<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Descriptions for Explaining Entity Matches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Paganelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Sottovia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Maccioni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Interlandi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Guerra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIEF - University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Huawei Research.</institution>
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Independent Researcher</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Microsoft Research</institution>
          ,
          <addr-line>Seattle</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Finding entity matches in large datasets is currently one of the most attractive research challenges. The recent interest of the research community towards Machine and Deep Learning techniques has led to the development of many and reliable approaches. Nevertheless, these are conceived as black-box tools that identify the matches between the entities provided as input. The lack of explainability of the process hampers its application to real-world scenarios where domain experts need to know and understand the reasons why entities can be considered as match, i.e., they represent the same real-world entity. In this paper, we show how data descriptions-a set of compact, readable and insightful formulas of boolean predicates-can be used to guide domain experts in understanding and evaluating the results of entity matching processes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data explanation</kwd>
        <kwd>data exploration</kwd>
        <kwd>data profiling</kwd>
        <kwd>outliers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Entity Matching (EM) is a long-lasting problem in the database research community. Recently,
approaches based on Machine Learning (ML) and Deep Learning (DL) have been proposed.
They conceive EM as a binary classification problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] applied on datasets whose records
describe pairs of entities. Since the first proposals (see related work in Section 4), they have
proved to be very efective. Nevertheless, the application of ML and DL approaches in real
scenarios is hampered not only by the need for a significant amount of labeled data for their
configuration and training, but also for the lack of explainability of the results they provide.
      </p>
      <p>
        We have already introduced and showcased (data) descriptions, i.e. compact, readable and
insightful structures formed by predicates expressed on the attribute values and able to
efectively explain the content of large and complex datasets, in [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. We have experimentally
6
Kenny Chesney
Kenny Chesney
      </p>
      <p>Ed Sheeran
Ed Sheeran</p>
      <p>Ed Sheeran</p>
      <p>Country
Urban Cowboy</p>
      <p>
        Pop
Pop
Pop
demonstrated the efectiveness of descriptions in performing many tasks of data exploration.
The aim of this paper is to show how they can be useful in another field, making a (large) dataset
describing the result of an EM task understandable at a glance by domain expert users.
Running Example. Let us introduce Table 1 containing a sample of entity matches obtained
by applying a generic EM approach to a sample of the iTunes-Amazon dataset published in the
Magellan library1 and by transitively extending the results through a connected components
algorithm (i.e., according to the literature [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we consider entities as referring to the same
realworld entity when the matching elements form a clique). Each record represents a song with
a list of characteristics (Entity Id, Song Name, Artist name, Album Name, and Genre).
The extra-column on the right shows the result of the application of the EM process, which is
able to recognize four distinct songs from the nine in the Table. A domain expert who needs to
validate the result of the EM task has to manually inspect the dataset to figure out the reasons
for the computed matches. Data descriptions can support the process since they are able to
concisely represent any arbitrary set of data tuples.
      </p>
      <p>Example 1. 1 is a description of the dataset represented in Table 1 obtained conjoining a list
of predicates, each characterizing the same attribute of the dataset with the related values. We
called M_E the extra-column recording the identifier of the resulting entity from the matching
task. In this example, the description covers the entire dataset (all the tuples are represented by
1) and is built upon only an attribute of the original dataset (i.e., Song_Name). Moreover, 1
partitions the dataset into four clusters, each one representing a unique entity resulting from the
EM process. The description based on the Song_Name attribute is then useful to understand if
the EM Model is able to perform a correct identification of entities: for each entity a diferent
song name is displayed thus confirming the correct subdivision into distinct entities of the dataset
entries. Other descriptions for the same dataset are possible, built upon other attributes, dividing
the datasets into other partitions and using more attributes for each partition (e.g., the dataset can
be grouped by Genre and described with the Album Name).
1 :(M_E ∈ { 1}) ⋀︀ Song Name ∈ {I Ai n’t Livin ’ Long Like This}) ⋁︀</p>
      <p>(M_E ∈{ 2} ⋀︀ Song Name ∈ {Billy}) ⋁︀
1https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
(M_E ∈ {3} ⋀︀ Song Name ∈ {I Ca n’t Go There (Acoustic Version)}) ⋁︀
(M_E ∈ {4} ⋀︀ Song Name ∈ {Afire Love})
Building the Descriptions: A principled approach. The data descriptions are generated
according to three main principles. The first states that it is more explicative to think of a dataset
as the composition of diferent groups of related tuples . Attributes of a dataset carry diferent
degrees of relevance for users who want to understand the content of datasets and glean insight
from their data. Some attributes have intrinsic value because, for example, they can identify
entities in a domain (e.g., the attribute EntityID in the dataset of Table 1) or because they allow
identifying specific features of the entities (e.g., the attributes Album Name, Artist Name in the
dataset). On the other hand, the importance of certain attributes may depend on the users: on
their specific interests motivating the dataset exploration and on their knowledge of the domain.
Nevertheless, in this case study the idea is to generate data descriptions representing entity
matches, and it is therefore natural to partition the dataset around clusters of matching entities.
We call d-formula the conjunction of predicates describing a partition. A description is formally
a non-empty disjunctive-normal formula of d-formulas (one for each partition).</p>
      <p>The exploration of a dataset as well as the analysis of entity matches can be conducted for
diferent purposes. In some cases, users want an accurate and complete, yet readable,
representation of the whole dataset, able to precisely explain the reasons that led to the generation of
those clusters of matching entities. In other cases, a general profile of the dataset is enough
for the user, who needs only to broadly know why entity possible match in that domain. The
description represents, in these latter cases, an overview that ignores infrequent values. Often,
instead, there are users who are interested in finding possible mistakes, i.e., entities in the same
clusters with values which are not consistent with the ones of other entities in the same cluster.
These values can be infrequent values, or outliers, but can also represent possible mistakes. The
second driving principle is that we can accommodate the multi-faceted goals of data explanation
by relaxing the concept of coverage in the descriptions. Our approach allows users to interactively
change the coverage of the expected descriptions (intended as the percentage of the total number
of tuples that are true for the description), thus making them able to satisfy all possible needs.</p>
      <p>Finally, the quality of a description is influenced by the subjectivity of the users and the tasks
where the description has to be applied. A prolix description can be suitable for a user who
wants to interpret a matching model, but, at the same time, be poor for another one who only
needs to know the main features of a type of entity. The third driving principle is to consider
user preferences for qualifying descriptions. We follow this principle by: (1) defining a series of
dimensions that characterize the descriptions; (2) we let the users indicate their preferential
value for these dimensions. Specifically, we identified three dimensions for characterizing
a description from a user perspective: coverage , degree, and diversity. The coverage is the
percentage of items covered by a description in each partition where the dataset is divided.
Through the coverage, the users select from one extreme (i.e., low coverage) if they want to
ifnd the peculiarities for each cluster of matching items (i.e.. possible mistakes); from the
other (i.e. high coverage), if they want to describe the common features for all the entities in a
matching group. The degree refers to the number of predicates used to describe each cluster
of entities. Intuitively, a description with lower degree tends to be much easier to read. The
diversity measures how often attributes are shared in describing diferent clusters of matching
Dataset</p>
      <p>Preprocessed
dataset</p>
      <p>Partiti3oning
Data drivenUser driven
Size Paivttortsal
cov</p>
      <p>Coverage</p>
      <p>D-formula 4computation Offset</p>
      <p>P1 dPf121 P3 P4 dPf152 P6
Partition#1 cov P9 dPf1201 P11 P1 dPf1222</p>
      <p>P7 dPf183</p>
      <p>P13 dPf223 P3 P14 dPf1254 P16
div descr.#1</p>
      <p>deg
P1 dPf121 P3 P13 dPf223 P3 P1 dPf1392 P20
div descr.#2</p>
      <p>deg
P7 dPf183 P1 dPf1222 P17 dPf1381
div descr.#3</p>
      <p>deg
P4 dPf152 P6 P9 dPf1201 P11 P1 dPf1392 P20
entities. While repetitions of a set of attributes in diferent partitions increase readability, they
could reduce the fine-grained representation of partitions.</p>
      <p>The rest of the paper is organized as follows. Section 2 summarizes the approach for building
the descriptions, Section 3 shows the application to the entity matching task, Section 4 provides
some related work, and finally Section 5 sketches out some conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Approach</title>
      <p>The description generation process is interactive: users can specify their preference parameters—
namely coverage, degree and diversity—and the system generates the best description
according to these preferences. The user can repeatedly vary the preferences and generate further
descriptions until the explanation need is fulfilled.</p>
      <p>Figure 1 illustrates the 5 steps driving the user in the generation process of descriptions.
First of all, the user starts the process by uploading a new dataset (➊). Once the dataset is
selected, the user can specify the attributes of interest that must be considered by the system
when generating the descriptions (➋, by default all attributes are used). Next, the user is guided
through the UI into the 3 major phases in which the actual descriptions are generated.</p>
      <p>In the first phase ( ➌), the input dataset is partitioned such that the tuples that are expected to
be described together reside in the same group. The user is in charge to select how to partition
the dataset. If the dataset describes the results of an EM task, the partition should be defined on
an extra value identifying the clusters of matching entities.</p>
      <p>
        The generation of the d-formulas (conjunction of predicates) happens during the second
phase (➍). For each partition, the number of possible d-formulas is exponential over the number
of attributes’ values. Generating all possible d-formulas is, therefore, prohibitively expensive.
As explained in Section 2.1, we adopt a heuristic procedure that allows us to prune d-formulas
that are less relevant for the task at hand. In the last phase (➎), the actual descriptions are
computed by combining d-formulas over diferent partitions. Intuitively, we assemble the top- 
descriptions that maximize a scoring function computed on the generated d-formulas that
takes into account the user preferences of degree, and diversity. This problem can be solved
with a dynamic programming approach: in Section 2.2 we will employ a variant of the Viterbi
Algorithm called LVA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (a.k.a. List Viterbi), although any other algorithm can be used.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Building d-formulas</title>
        <p>The approach starts building all feasible and relevant d-formulas for each partition. The number
of candidate predicates for each partition depends on the size of the active domain of the
attributes in the partition. Therefore, the complexity of this phase is   ×
where  is the set of the attributes,  is the number of partitions and |adom()| is the
aforementioned size. Given the complexity for generating all possible d-formulas, we follow a
heuristic process that generates the most relevant candidate d-formulas only. We adopt two
heuristics, one for pruning prolix d-formulas, and one for pruning d-formulas and predicates
︂(
∏︀ (2|adom()|) ,
∈
︂)
with undesirable selectivity.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Heuristic 1 - pruning prolix d-formulas</title>
          <p>A low degree is specified when the user prefers a small number of predicates and, conversely,
a high degree is for users who prefer descriptions with wide d-formulas. We push the degree
intent of the users into the process of building d-formulas in order to early-prune d-formulas
that do not meet such a requirement. This limits the number of predicates to evaluate, and it
improves the eficiency of the process.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Heuristic 2 - pruning d-formulas with undesirable selectivity.</title>
          <p>generating d-formulas for early pruning.</p>
          <p>The coverage parameter (cov) indicates the desired percentage of tuples that make a d-formula
true. Heuristic 2 transforms the coverage value indicated by the user input parameter value into
an interval [cov, cov] of admissible coverage values. The width of the interval is proportional
to the coverage itself given that cov = cov − ofset (cov), cov = cov + ofset (cov) and
ofset (cov) = 0.14 * cov2. As for Heuristic 1, we push the coverage interval into the process of</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Generating d-formulas.</title>
          <p>The approach adopted for generating all possible and relevant d-formulas takes in input the list
of partitions, the list of dataset attributes, the desired coverage, and degree. The d-formulas of
each partition are computed separately. First, we generate the atomic predicates (i.e., predicates
with only one value). Atomic predicates are generated only if they are within the expected
coverage as defined by Heuristic 2. Then the atomic predicates are combined together to
generate conjunctions of predicates. The resulting d-formulas, for each partition, are then
returned as output.</p>
          <p>When combining the predicates together, an internal routine makes sure that only those valid
combinations (i.e., d-formulas) for Heuristic 2 are actually returned. The combination of the
input predicates is organized in a lattice which is dynamically generated using a loop. In each
iteration, we generate one level of the lattice (the n-th) by combining pairs of the previously
generated predicates into one d-formula. Only combinations of predicates whose coverage is
within the interval defined by Heuristic 2 are kept for the next level of the lattice. The generation
of the lattice ends when the threshold, as per Heuristic 1, is reached or when it is no longer
possible to generate predicates.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Building top-k descriptions</title>
        <p>
          Selecting the best d-formulas in isolation does not necessarily lead to the best descriptions. We
need to search for the optimal set of d-formulas across partitions that all together best fit the
users’ needs. To achieve this, we use dynamic programming: our current implementation is
based on the List Viterbi Algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Viterbi Algorithm and its variants require to model the
search space as a trellis. We build a vertical slice for each partition. The nodes in the slice are
the d-formulas we have found in the previous phase. Nodes of a slice are connected to all the
nodes of the next slice via weighted edges. A path represents a list of d-formulas, with at most
one d-formula for partition. The algorithm is used to find the best full path.
        </p>
        <p>Viterbi needs an objective function  to score a path, which intuitively represents the score
of the (intermediate) descriptions we are computing.</p>
        <p>
          We introduce a scoring function which measures (1) the adherence between the features of
the description and the expectation of the user (expressed with a preference on level of coverage,
degree and diversity); and (2) the goodness of the chosen attribute in describing the partitions.
Other techniques can be evaluated. For example, we experimented also with the Smooth Local
Search [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with no improvement in the performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Applying Descriptions to the EM scenario</title>
      <p>When we need to explain entity matches, it is natural to create descriptions with a d-formula for
each cluster of matching entities. The use of descriptions enables the computation of diferent
types of explanation, obtained by varying the preferences of coverage, degree and diversity
specified by the user.</p>
      <p>What do these entities represent? When users are interested in understanding the main
features of the entities in the dataset, they have to select settings with high coverage values.
These configurations generate descriptions where d-formulas aim to represent the clusters
of matching entities as whole units. Descriptions with high degree tend to create d-formulas
with more predicates. Descriptions with high diversity tend to use diferent attributes in the
predicate lists of diferent d-formulas.</p>
      <p>Example 2. 1 in Example 1 is a description with high coverage, low degree and low diversity.
All clusters of matching entities shown in Table 1 are represented and the user can understand
that the values of Song Name qualify the entities. 2 is an example of a description with high
coverage, low degree and high diversity. In this case, the approach tries to select diferent attributes
for describing each partition.
2 : (M_E ∈ {1} ⋀︀ Song Name ∈ {I Ai n’t Livin ’ Long Like This}) ⋁︀
(M_E ∈ {2} ⋀︀ Artist Name ∈ {Keith Urban}) ⋁︀
(M_E ∈ {3} ⋀︀ Song Name ∈ {I Ca n’t Go There (Acoustic Version)}) ⋁︀
(M_E ∈ {4} ⋀︀ Genre ∈ {Pop})
Descriptions with high degree tend to use the largest number of possible attributes for each
partition. Clearly, degree and diversity are related dimensions and when the dataset has a low
dimensionality as in Table 1, the generated descriptions cannot accomplish both the requirement.
Is this a mistake? When users are interested in discovering possible mistakes in clusters of
matching entities, they need to generate low coverage descriptions, that apply only to a small
portion for each partition. As before for high coverage descriptions, degree and diversity are
used to qualify the d-formulas generated.</p>
      <p>Example 3. 3 is a description with low coverage, low degree and low diversity. It describes a
unique partition, i.e., the cluster of matching entities 3. In this cluster, the entity with Entity
id 7 has a genre which is diferent from the ones of the other entities in the same cluster. This can
be a possible mistake that the domain expert needs to check. Note that 3 shows the information
about cluster 3, since it is not possible to generate low coverage d-formulas for the other partitions.
In this case, the expert will probably validate the cluster, since all songs in 3 actually refer to
the same real-world item. Nevertheless, let us suppose that the EM model erroneously generates
a cluster of matching entities which is the concatenation of the entities in 1 and 2. 4 is a low
coverage description for this new cluster. In this case, the item reported by this description will
probably be evaluated by an expert as a mistake, since the song with Entity id 4 is completely
diferent from the other ones.
4 :M_E ∈ {1 + 2} ⋀︀ Song Name ∈ {Billy} ⋀︀ Genre ∈ {Country}</p>
    </sec>
    <sec id="sec-4">
      <title>4. Related Work</title>
      <p>
        Describing datasets. Explanation systems help users in gaining knowledge on the behavior of
systems, experiments or query answers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and “black box” complex models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Explanations
typically assume the form of association rules, decision lists and decision sets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Our approach
performs data explanation since it creates partitions from a dataset and builds rules that provide
users with an explanation of their content. The paper develops a subgroup discovery technique
for performing data explanation and exploration on structured datasets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Explainable Entity Matching. Entity matching [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] represents one of the main steps of data
integration and has been under study for several years. Many techniques have been proposed:
from the more traditional rule-based approaches to the most recent machine learning and deep
learning methods. Rule-based approaches [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] are intrinsically interpretable, however, the
identification of the most efective set of matching rules is a complex and non-trivial task [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
EM approaches based on Deep Learning (e.g., DeepER [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], DeepMatcher [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], DITTO [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and
many others [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]) have been demonstrated particularly efective. Nevertheless, they require a
significant amount of annotated data, they need a complex configuration, and there is no direct
interpretation of their behavior, afecting their usability in business environments. There are
typically two alternative approaches for providing explanation of AI techniques [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]: 1) applying
explanation systems to interpret their behavior a-posteriori or 2) building models that are
interpretable by design, i.e. models that base their decisions on humanly interpretable "structures
/ components”. The main approaches in EM (e.g., LIME [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and SHAP [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], Mojito [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and
Landmark Explanation [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]) belong to the first area.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>
        This discussion paper shows how data descriptions can be used for evaluating an EM task. In
particular, descriptions can be used to generate an overview of the main features of the entities
in the dataset or for discovering outliers, which can be the symptom of a mistake in the approach.
Interested readers can found further details of the approach and a deep evaluation in [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <article-title>Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Data-Centric Systems</article-title>
          and Applications, Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paganelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sottovia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerra</surname>
          </string-name>
          ,
          <article-title>Explaining data with descriptions</article-title>
          ,
          <source>Inf. Syst</source>
          .
          <volume>92</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paganelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sottovia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerra</surname>
          </string-name>
          ,
          <article-title>Understanding data in the blink of an eye</article-title>
          , in: CIKM, ACM,
          <year>2019</year>
          , pp.
          <fpage>2885</fpage>
          -
          <lpage>2888</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Online entity resolution using an oracle</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Seshadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-E. W.</given-names>
            <surname>Sundberg</surname>
          </string-name>
          ,
          <article-title>List viterbi decoding algorithms with applications</article-title>
          ,
          <source>IEEE Transactions on Communications</source>
          <volume>42</volume>
          (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U.</given-names>
            <surname>Feige</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Mirrokni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vondrák</surname>
          </string-name>
          ,
          <article-title>Maximizing non-monotone submodular functions</article-title>
          ,
          <source>SIAM J. Comput</source>
          .
          <volume>40</volume>
          (
          <year>2011</year>
          )
          <fpage>1133</fpage>
          -
          <lpage>1153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Meliou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          , Causality and explanations in databases,
          <source>PVLDB</source>
          <volume>7</volume>
          (
          <year>2014</year>
          )
          <fpage>1715</fpage>
          -
          <lpage>1716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models</article-title>
          ,
          <source>CSUR</source>
          <volume>51</volume>
          (
          <year>2018</year>
          )
          <fpage>93</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <article-title>Interpretable decision sets: A joint framework for description and prediction</article-title>
          , in: SIGKDD,
          <year>2016</year>
          , pp.
          <fpage>1675</fpage>
          -
          <lpage>1684</lpage>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/2939672.2939874.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          , Subgroup discovery,
          <source>Wiley Interdiscip. Rev. Data Min. Knowl. Discov</source>
          .
          <volume>5</volume>
          (
          <year>2015</year>
          )
          <fpage>35</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mandilaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gagliardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Simonini</surname>
          </string-name>
          , E. Thanos, G. Giannakopoulos,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>Three-dimensional entity resolution with jedai</article-title>
          ,
          <source>Information Systems</source>
          <volume>93</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Meduri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quiané-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Synthesizing entity matching rules by examples</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2017</year>
          )
          <fpage>189</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paganelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sottovia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          , Tuner:
          <article-title>Fine tuning of rule-based entity matchers</article-title>
          , in: CIKM, ACM,
          <year>2019</year>
          , pp.
          <fpage>2945</fpage>
          -
          <lpage>2948</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebraheem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Distributed representations of tuples for entity resolution</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>1454</fpage>
          -
          <lpage>1467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , G. Krishnan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arcaute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          , in: SIGMOD Conference, ACM,
          <year>2018</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Suhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          , W.-C. Tan,
          <article-title>Deep entity matching with pre-trained language models</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>50</fpage>
          -
          <lpage>60</lpage>
          . URL: https://doi.org/10.14778/3421424.3421431. doi:
          <volume>10</volume>
          .14778/3421424.3421431.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>U.</given-names>
            <surname>Brunner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stockinger</surname>
          </string-name>
          ,
          <article-title>Entity matching with transformer architectures - A step forward in data integration, in: EDBT, OpenProceedings</article-title>
          .org,
          <year>2020</year>
          , pp.
          <fpage>463</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Techniques for interpretable machine learning</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>68</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>" why should i trust you?" explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghorbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <article-title>Data shapley: Equitable valuation of data for machine learning</article-title>
          ,
          <source>in: ICML</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2242</fpage>
          -
          <lpage>2251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Cicco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Interpreting deep learning models for entity resolution: an experience report using LIME, in: aiDM@SIGMOD</article-title>
          , ACM,
          <year>2019</year>
          , pp.
          <volume>8</volume>
          :
          <fpage>1</fpage>
          -
          <issue>8</issue>
          :
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Buono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paganelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerra</surname>
          </string-name>
          ,
          <article-title>Using landmarks for explaining entity matching models</article-title>
          , in: EDBT, OpenProceedings.org,
          <year>2021</year>
          , pp.
          <fpage>451</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>