=Paper=
{{Paper
|id=Vol-2536/om2019_poster5
|storemode=property
|title=Towards Explainable Entity Matching via Comparison Queries
|pdfUrl=https://ceur-ws.org/Vol-2536/om2019_poster5.pdf
|volume=Vol-2536
|authors=Alina Petrova,Egor V. Kostylev,Bernardo Cuenca Grau,Ian Horrocks
|dblpUrl=https://dblp.org/rec/conf/semweb/PetrovaKGH19a
}}
==Towards Explainable Entity Matching via Comparison Queries==
<pdf width="1500px">https://ceur-ws.org/Vol-2536/om2019_poster5.pdf</pdf>
<pre>
           Towards explainable entity matching
                 via comparison queries

 Alina Petrova, Egor V. Kostylev, Bernardo Cuenca Grau, and Ian Horrocks

               Department of Computer Science, University of Oxford
             {alina.petrova, egor.kostylev, bernardo.cuenca.grau,
                          ian.horrocks}@cs.ox.ac.uk

    Nowadays there exists an abundance of heterogeneous Semantic Web data
coming from multiple sources. As a result, matching Linked Data has become a
tedious and non-transparent task. One way to facilitate entity matching across
datasets is to provide human-readable explanations that highlight what the two
entities have in common, as well as what differentiates the two entities.
    Entity comparison is an important information exploration problem that has
recently gained considerable research attention [1, 2, 4]. In this paper we propose
a solution towards explainable entity matching in Linked Data where entity
comparison is used as a subroutine that assists in debugging and validation of
matchings. To this end, we adopt the entity comparison framework in which
explanations are modelled as unary conjunctive queries of restricted form [3, 4].
    We concentrate on the data model where a dataset is an RDF graph—that
is, a set of triples of IRIs and literals, jointly called entities. The basic building
block of a query is a triple pattern, which is a triple of entities and variables.
Then, a query is a non-empty finite set of triple patterns in which one variable,
usually denoted by X, is an answer variable. The set Q(D) of answer entities
to a query Q on a dataset D is defined as usual in databases.
    The two main notions of the framework are the similarity and difference
queries for pairs of entities, which are defined as follows: a similarity query for
entities a and b in a dataset D is a query Q satisfying {a, b} ⊆ Q(D); a difference
query for a relative to b is a query Q satisfying a ∈ Q(D) and b 6∈ Q(D).
    In our prior work we proposed an algorithm for computing comparison queries
that can be repurposed for similarity and difference queries [3]. The algorithm is
based on the computation of a similarity tree—a data structure that represents
commonalities and discrepancies in data for input entities a and b. It is a directed
rooted tree with nodes and edges labelled by pairs of sets of entities such that
the root is labelled by ({a}, {b}) and every edge labelled (E1 , E2 ) between nodes
labelled (N1 , N2 ) and (N10 , N20 ) is justified in the sense that for every entity n in
Ni , i ∈ {1, 2}, there is a triple (n, e, n0 ) in the dataset with e ∈ Ei and n0 ∈ Ni0 .
    For instance, suppose there are 3 entities, Emma_Watson, Emily_Watson
and E_Watson, that need to be either matched or disambiguated, and a data
fragment given in Figure 1. Then the similarity trees for Emma_Watson and
E_Watson, and for Emily_Watson and E_Watson are depicted in Figure 2 (where
singleton sets {`} and pairs ({`}, {`}) are both written as ` for readability).
  Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
2       A. Petrova et al.


         Fig. 1. A fragment of data involving three concepts to be matched


    Each branch in a similarity tree can be treated as a separate similarity query,
in which each edge is encoded as a triple pattern, and each label (L1 , L2 ) is en-
coded as either an entity `, if L1 = L2 = {`}) or a fresh variable otherwise.
For example, a query Q1 = (X, actedIn, Ballet_Shoes) is a similarity query for
Emma_Watson and E_Watson, while a query Q2 = (X, actedIn, Y ), (Y, year, Z)
is a similarity query for Emily_Watson and E_Watson. Moreover, each branch
involving non-entity labels can also be treated as a difference query, if instead
of some variables we take entities from one of the label sets. For example, query


                Fig. 2. Similarity trees rooted in two pairs of entities


Q3 = (X, actedIn, Little_Women), (Little_Women, year, 2019) is a difference query
for E_Watson relative to Emily_Watson.
    Both types of queries can assist in explaining why two entities should or
should not be merged: Q1 gives a good reason to match Emma_Watson and
E_Watson into one entity, Q2 is not specific enough to match the other pair, and
Q3 can act as an indicator that the two movies named Little_Women are indeed
two different movies, and Emily_Watson and E_Watson are different people.


References
1. Colucci, S., Giannini, S., Donini, F.M., Di Sciascio, E.: Finding commonalities in
   Linked Open Data. In: Proc. of CILC. pp. 324–329 (2014)
2. El Hassad, S., Goasdoué, F., Jaudoin, H.: Learning commonalities in SPARQL. In:
   Proc. of ISWC. pp. 278–295 (2017)
3. Petrova, A., Kostylev, E.V., Cuenca Grau, B., Horrocks, I.: Query-based entity
   comparison in knowledge graphs revisited. In: Proc. of ISWC (2019)
4. Petrova, A., Sherkhonov, E., Cuenca Grau, B., Horrocks, I.: Entity comparison in
   RDF graphs. In: Proc. of ISWC. pp. 526–541 (2017)

</pre>