Comparing entities in RDF graphs

                                                                 Alina Petrova
                                                      supervised by Prof. Ian Horrocks
                                                      and Prof. Bernardo Cuenca Grau
                                                      Department of Computer Science
                                                            University of Oxford
                                                       alina.petrova@cs.ox.ac.uk

ABSTRACT                                                                  aspects two entities are similar and different, by compar-
The Semantic Web has fuelled the appearance of numerous                   ing entities and finding similar features. Such comparison
open-source knowledge bases. Knowledge bases enable new                   is done in many domains and for various types of entities:
types of information search, going beyond classical query an-             hotels,1 cars,2 universities,3 shopping items,4 to name a few.
swering and into the realm of exploratory search, and pro-                However, as a rule, such systems perform a side-by-side com-
viding answers to new types of user questions. One such                   parison of items in a domain-specific manner, i.e., following
question is how two entities are comparable, i.e., what are               a fixed, hard-coded template of aspects to compare (e.g., in
similarities and differences between the information known                case of hotels, it could be price, location, included break-
about the two entities. Entity comparison is an important                 fast, rating etc.). In a few more advanced systems, similar-
task and a widely used functionality available in many in-                ities are computed with respect to the type of information
formation systems. Yet it is usually domain-specific and de-              available about the input entities rather than following a
pends on a fixed set of aspects to compare. In this paper we              rigid pattern. One such example is the Facebook Friendship
propose a formal framework for domain-independent entity                  pages.5 Given two Facebook users, a friendship page con-
comparison that provides similarity and difference explana-               tains all their shared information, be it public posts, photos,
tions for input entities. We model explanations as conjunc-               likes or mutual friends, as well as their relationship, if any
tive queries, we discuss how multiple explanations for an                 (e.g., married, friends etc.). However, as in the aforemen-
entity pair can be ranked and we provide a polynomial-time                tioned examples, comparison is done over a limited set of
algorithm for generating most specific similarity explana-                attributes.
tions.                                                                       Relying on a fixed set of aspects is a reasonable solution
                                                                          for tabular data with rigid and stable structure. On the
                                                                          other hand, a more flexible approach to entity comparison is
1.    INTRODUCTION                                                        needed for Linked Data, namely for loosely structured RDF
  Information seeking is a complex task which can be ac-                  graphs. However, all current systems with such functionality
complished following different types of search behaviour.                 compare items following a predefined, domain-specific list of
Classical information retrieval focuses on the query-response             values to compare. Thus, an interesting research problem
search paradigm, in which a user asks for entities similar to             would be to create a framework for entity comparison that
the input keywords or fitting the formal input constraint.                is domain- and attribute-independent.
Yet there exists a broad area of exploratory search that is                  The Semantic Web has fuelled the appearance of numer-
characterized by open-ended, browsing behaviour [18] and                  ous open-source knowledge bases (KBs). Such KBs enable
that is much less well studied. Exploratory search encom-                 both automatic information processing tasks and manual
passes activities like information discovery, aggregation and             search, and they facilitate new types of information search,
interpretation, as well as comparison [13].                               going beyond classical query answering and providing an-
  Comparing entities, or rather, information available about              swers to new types of user questions. For example, using
the entities, is an important task and in fact a widely-used              KBs one can answer questions like how are the two entities
functionality implemented in many tools and resources. On                 similar or what differs them, i.e., perform entity comparison.
the one hand, systems that highlight similarities between                    In this paper we propose to study such questions posed
entities can focus on how much entities are alike, giving a               over one of the most common types of KBs — RDF graphs.
similarity score to a pair (or a group) of entities [6]. On               In particular, we provide a formal framework for posing such
the other hand, systems can focus on how or why, in which                 questions and we model answers to these questions as sim-
                                                                          ilarity and difference explanations. We then discuss how

                                                                          1
                                                                            www.flightnetwork.com/pages/
                                                                          hotel-comparison-tool/
                                                                          2
                                                                            http://www.cars.com/go/compare/modelCompare.jsp
                                                                          3
                                                                            http://colleges.startclass.com/
                                                                          4
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,         http://www.argos.co.uk/static/Home.html
                                                                          5
Germany.                                                                    Original announcement (cashed by the Wayback Machine):
Copyright (c) 2017 for this paper by its authors. Copying permitted for   https://web.archive.org/web/20101030105622/http:
private and academic purposes..                                           //blog.facebook.com/blog.php?post=443390892130
multiple explanations to a question can be ranked and we                    Using this definition, we can formulate the following de-
provide a polynomial-time algorithm for generating most                  cision problem: given hI, a, bi, SimExp is a problem of
specific similarity explanations. Finally, we outline direc-             whether there exists a CQ Q such that {a, b} ⊆ Q(I). The
tions of future research.                                                corresponding functional problem is to compute a query Q
                                                                         such that {a, b} ⊆ Q(I), given hI, a, bi. Note that both the
2.    PRELIMINARIES                                                      definition of Qsim and SimExp can be easily generalized
                                                                         from a pair of entities to a set of input entities.
  In what follows we use the standard notions of conjunctive
                                                                            We specifically chose the condition to be {a, b} ⊆ Q(I)
queries (CQs), query subsumption and homomorphism. We
                                                                         for two reasons. Firstly, the form of Q does not depend on
disallow trivial CQs of the form >(X). We model RDF
                                                                         the rest of the data: it does not matter whether there exist
graphs as finite sets of triples, where a triple is of the form
                                                                         other entities that match the graph pattern described by the
p(s, o), p and s being URIs and o being a URI or a literal.
                                                                         query; moreover, queries fitting the subsumption condition
Furthermore, we use the notion of a direct product of two
                                                                         will not be affected if new data is added. This is very im-
graphs, adapted to RDF graphs:
                                                                         portant in the context of RDF graphs, since web data is
  Definition 1. Let I and J be RDF graphs, t1 = R(s1 , o1 )              intrinsically incomplete.
and t2 = R(s2 , o2 ) be two triples. The direct product of t1               Secondly, it is known that the definability problem is
and t2 , denoted as t1 ⊗ t2 , is the triple R(hs1 , s2 i, ho1 , o2 i).   coNExpTime-complete for conjunctive queries [3,15]. On
The direct product I ⊗ J of I and J is the instance:                     the other hand, SimExp can easily be shown to be in PTime:
                                                                         for conjunctive queries, it is sufficient to take the full join of
                  {t1 ⊗ t2 | t1 ∈ I and t2 ∈ J}.                         all tables in the database instance.

3.    COMPARISON FRAMEWORK                                                 Let Sim(a, b) be the set of all similarity explanations for
                                                                         a given hI, a, bi. Obviously Sim(a, b) can be quite big, con-
  There are multiple ways of how we can define similarity
                                                                         taining numerous explanations, however, we are interested
and difference explanations and how we can model entity
                                                                         in the most informative ones. Our assumption is that the
comparison. In our framework the formalism of choice is
                                                                         more specific a similarity explanation is, the better.
conjunctive queries (CQs). We model formal explanations
as conjunctive queries and we consider the problem of find-
                                                                            Definition 3. Given hI, a, bi, a most specific similarity ex-
ing such explanations as an instance of the query reverse
                                                                         planation is a similarity explanation Qmspsim s.t. for all simi-
engineering problem.
                                                                         larity explanations Q0sim wrt hI, a, bi: Qmsp       0
                                                                                                                    sim ⊆ Qsim .
3.1    Similarity explanations
                                                                            The decision problem SimExpmsp is the problem of decid-
   We would like similarity explanations to highlight com-
                                                                         ing whether Qsim is a most specific similarity explanation
mon patterns for input entities. Thus, we model them as
                                                                         for the given hI, a, bi.The related functional problem is to
queries that return both of these entities, i.e, they match
                                                                         compute a most specific Qsim .
patterns fitting both entities. Let hI, a, bi be a tuple con-
                                                                            The subsumption relation divides the set of all similarity
sisting of an RDF graph I and two URIs from the domain
                                                                         explanations Sim(a, b) into ⊆-equivalent classes. If Sim(a, b)
of I a, b ∈ dom(I) representing input entities. Furthermore,
                                                                         is not empty, then Sim(a, b)msp is not empty, and there ex-
given a query Q, let Q(I) be the answer set returned by Q
                                                                         ists a finite most specific similarity explanation Qmsp
                                                                                                                             sim , whose
over I.
                                                                         size is bounded by the size of I. This explanation can in fact
  Definition 2. Given hI, a, bi, a similarity explanation for            be constructed in PTime (see Section 4).
a and b is a unary connected conjunctive query Qsim such
that {a, b} ⊆ Qsim (I).
                                                                         3.2    Difference explanations
                                                                           Analogous to similarity explanations, we model difference
   Example 1. Given two entities Marilyn Monroe and Eliza-               explanations as CQs, but this time we require only one of
beth Taylor and the Yago RDF graph [14], a possible simi-                the input entities to be in the answer set.
larity explanation is:
                                                                           Definition 4. Given hI, a, bi, a difference explanation for
                                                                         a wrt b is a unary connected conjunctive query Qadif such
        Qsim (X) =hasWonPrize(X, Golden Globe),                          that a ∈ Qadif (I), but b ∈
                                                                                                   / Qadif (I).
                  diedIn(X, Los Angeles),
                     hasGender(X, female),                                  The notion of a difference explanation can be generalized
                     actedIn(X, Y 1),                                    to sets of entities: given an RDF graph I, a set of entities
                                                                         P os and a set of entities N eg, a difference explanation for
                     isLocatedIn(Y 1, United States),                    I and P os wrt N eg is a unary connected CQ QP          os
                                                                                                                                    s.t.
                                                                                                                               dif
                     isMarriedTo(X, Y 2),                                                   P os               P os
                                                                         ∀p ∈ P os: p ∈ Qdif and N eg ∩ Qdif = ∅.
                     hasGender(Y 2, male),                                  Given hI, a, bi, Dif Exp is the problem of deciding whether
                     hasWonPrize(Y 2, Tony Award), etc.                  there exists a difference explanation Qadif . The generalized
                                                                         difference explanation problem Dif Exp can be solved using
Qsim can be interpreted the following way: both Monroe                   the most specific similarity explanation problem SimExpmsp :
and Taylor received a Golden Globes award, died in Los                   given I, P os and N eg, first construct a most specific sim-
Angeles, acted in movies that were shot in the US and were               ilarity explanation Q for entities in P os (done in PTime),
married to men who received a Tony Award.                                and then check whether none of the elements of N eg are in
the answer set of Q (conjunctive query evaluation is NP-               Algorithm 1: Algorithm for computing a most specific
complete). Hence, the complexity of generalized Dif Exp                similarity explanation
is NP-complete.
                                                                        Input: an RDF graph I, entities a, b from dom(I).
   Furthermore, we would like to introduce another defini-
                                                                        Output: a most specific similarity explanation for a
tion of a difference explanation that is dependent on the
                                                                                   and b.
similarities between a and b. We would like the difference
                                                                     1 Let J = {R(ha, bi, hc, di) | R(a, c), R(b, d) ∈
explanation for a to be as relevant as possible, hence we
                                                                        I} ∪ {R(hc, di, ha, bi) | R(c, a), R(d, b) ∈ I};
model it to be dependent not only on the information about
                                                                     2 if J = ∅ then
b, but also on the common patterns for a and b. One pos-
                                                                     3      return empty query;
sible way to do so is the following: let const(Q) be the set
                                                                               ∗
of constants appearing in a query Q and let const(R(x̄)) be          4 Let J = ∅;
                                                                                   ∗
the set of constants appearing in an atom R(x̄).                     5 while J 6= J do
                                                                     6       J ∗ := J;
  Definition 5. Given hI, a, bi, a difference explanation for        7       J := J ∪ {R(hc, di, he, f i) 6∈ J | R(c, e), R(d, f ) ∈
a wrt b and Qsim is a different explanation Qa,sim      such that            I, and ∃ a fact in J that contains hc, di or he, f i};
                                                  dif
∀R(x̄) ∈ Qa,sim
           dif  : const(R(x̄)) ∩ const(Q  sim ) 6
                                                =    ∅.              8 Construct qJ (xha,bi );
                                                                     9 foreach      xhc,ci in qJ , c 6∈ {a, b} do
   Example 2. Let the input entities be a = John Travolta                    // Replace xhc,ci with constant c
and b = Quentin Tarantino. Let Qsim for a and b be an expla-        10       qJ (xha,bi ) := qJ (xha,bi )[xhc,ci → c];
nation that both persons starred in Pulp Fiction: Qsim (X) =        11 return qJ (xha,bi ).
starredIn(X, Pulp Fiction). Relevant difference explanations
could be that Travolta also starred in Grease and other
movies, while Tarantino has directed several movies, in-                 (ii) For a connected unary conjunctive query q 0 (x), if there
cluding Pulp Fiction: Qadif (X) = starredIn(X, Grease) and                    exist homomorphisms h1 , h2 : q 0 → I such that h1 (x) =
Qbdif (X) = directed(X, Pulp Fiction). On the other hand, an                  a and h2 (x) = b, then there exists a homomorphism
explanation that Travolta (unlike Tarantino) is married to                    h : q 0 → J such that h(x) = ha, bi.
Kelly Preston is rather irrelevant, since we have not com-
pared the two persons with respect to their marital status.             Corollary 1. Algorithm 1 produces a most specific simi-
                                                                     larity explanation.

4.    TECHNICAL RESULTS                                              4.2      Properties of the resulting query
                                                                       The algorithm 1 runs in time polynomial to the size of
4.1    Algorithm for computing a most specific                       the input RDF graph, and the size of resulting most specific
       similarity explanation                                        similarity explanation is also polynomial to I. It should be
   We compute a most specific similarity explanation by con-         noted that the output query tends to be non-minimal. For
structing the direct product of the RDF graph, similar to the        example, since Marilyn Monroe and Elizabeth Taylor acted in
construction of the direct product of a database instances           several movies that were shot in the US, Q(X) will contain
with itself [15]. Any RDF graph I a ∈ dom(I) can be asso-            atoms like:
ciated with a canonical unary conjunctive query qI (xa ) such
that for each fact R(c, d) in I there is an atom R(xc , xd ) in      actedIn(X, Y 1), isLocatedIn(Y 1, United States),
qI , where xc and xd are variables and xa is a free variable.        actedIn(X, Y 2), isLocatedIn(Y 2, United States),
Note that a is an answer to qI (xa ) over I. The following           actedIn(X, Y 3), isLocatedIn(Y 3, United States), etc.
algorithm produces a most specific similarity explanation.
In it, we first produce an instance with the domain from             To avoid such redundancy, we can take the core of the query
dom(I)2 , i.e., tuples hc, di for c, d ∈ dom(I), and then con-       (i.e., apply the query minimization algorithm). Taking the
struct a canonical conjunctive query of this instance.               core is an NP-complete problem [9, 11], hence, obtaining
                                                                     a most specific similarity explanation without redundant
                                                                     atoms is an NP-complete task.
  Claim 1. If J 6= ∅, then J is a maximal connected com-
ponent of I ⊗ I such that a ⊗ b = ha, bi ∈ dom(J).
                                                                     5.      RELATED WORK
Proof sketch: Firstly, if J 6= ∅, then ha, bi ∈ dom(J), by Step         So far only few works have studied explanations over RDF
1. Secondly, the while-loop on Step 5 is in fact the greedy          graphs [5, 10, 12], and there is no single formal definition
procedure that generates the maximal connected component             of an explanation over RDF data. A lot of attention has
in I ⊗ I. Indeed, the condition R(c, e), R(d, f ) ∈ I ensures        been paid to discovering connections (“associations”) be-
that the fact R(hc, di, he, f i) is in I ⊗ I, and the condition      tween nodes [12], which boils down to finding and grouping
that there must exist a fact in J that contains hc, di or he, f i    together paths in the graph that connect one input node to
ensures connectedness.                                               another one. Such connectedness explanations are orthogo-
                                                                     nal (rather than alternative) to the similarity explanations
   Claim 2. Let hI, a, bi be an input of Algorithm 1. Let            modelled as queries, which we propose to study. The two
qJ (xha,bi ) be the output, and J the instance obtained after        types of explanations are intended to capture different rela-
the while loop on Step 5. Then all of the following hold.            tions between nodes: the former explore possible paths that
                                                                     link the two nodes together, while the latter seek to find
  (i) {a, b} ⊆ qJ (I),                                               commonalities in the neighbourhoods of the input nodes.
   The problem of reverse engineering a query given some           7.   REFERENCES
examples originated in late 1970s and was first introduced          [1] D. Angluin. Queries and concept learning. Machine
for the domain of relational databases [20]. Later it was ex-           learning, 2(4):319–342, 1988.
tensively researched with respect to different query formats:       [2] M. Arenas, G. I. Diaz, and E. V. Kostylev. Reverse
regular languages [1], XML queries [7], relational database             engineering SPARQL queries. In Proc. of the 25th Int.
queries [16, 17, 19], graph database queries [4] and SPARQL             Conf. on World Wide Web, pages 239–249, 2016.
queries [2]. The problem of QRE for RDF data was first
                                                                    [3] P. Barcelo and M. Romero. The complexity of reverse
studied by Arenas et al. [2] and was implemented by Diaz et
                                                                        engineering problems for conjunctive queries. In Proc.
al. [8]. In [2], the authors consider three different variations
                                                                        of 20th Int. Conf. on Database Theory, 2017.
of QRE problem: the basic variation that requires the input
                                                                    [4] A. Bonifati, R. Ciucanu, and A. Lemay. Learning path
mappings to be part of the answer set (Ω ⊆ JQKG ); the one
                                                                        queries on graph databases. In 18th Int. Conf. on
that allows positive examples Ω together with negative ex-
                                                                        Extending Database Technology (EDBT), 2015.
amples Ω̄ (such that Ω ⊆ JQKG and Ω̄ ∩ JQKG = ∅); and the
variation that requires the examples from Ω to be exactly           [5] G. Cheng, Y. Zhang, and Y. Qu. Explass: exploring
the answer set of Q (Ω = JQKG ). The complexity of these                associations between entities via top-k ontological
three variations is then provided for fragments of SPARQL               patterns and facets. In International Semantic Web
with AND, FILTER and OPT.                                               Conference, pages 422–437. Springer, 2014.
                                                                    [6] S.-S. Choi, S.-H. Cha, and C. C. Tappert. A survey of
                                                                        binary similarity and distance measures. J. Systemics,
                                                                        Cybernetics and Informatics, 8(1):43–48, 2010.
6.   FUTURE WORK
                                                                    [7] S. Cohen and Y. Y. Weiss. Learning tree patterns
   As part of my PhD, I would like to continue studying the             from example graphs. In LIPIcs-Leibniz International
problem of entity comparison using RDF graphs in several                Proceedings in Informatics, volume 31, 2015.
research directions. So far we have investigated similarity
                                                                    [8] G. Diaz, M. Arenas, and M. Benedikt. SPARQLByE:
and difference explanations, and we rank the former accord-
                                                                        Querying RDF data by example. Proceedings of the
ing to the preference condition based on subsumption. In
                                                                        VLDB Endowment, 9(13), 2016.
particular, we assume that the highest ranked explanations
                                                                    [9] G. Gottlob and A. Nash. Efficient core computation in
are most specific similarity explanations. On the one hand,
                                                                        data exchange. Journal of the ACM, 55(2):9, 2008.
we would like to apply a similar rationale to difference expla-
nations and to study most general difference explanations as       [10] P. Heim, S. Hellmann, J. Lehmann, S. Lohmann, and
most preferred ones. On the other hand, these may not be                T. Stegemann. RelFinder: Revealing relationships in
the optimal choices for a given user, hence we need to inves-           RDF knowledge bases. In Int. Conf. on Semantic and
tigate other possible ranking conditions as well as means of            Digital Media Technologies, pages 182–187, 2009.
user-specific ranking of explanations.                             [11] P. Hell and J. Nešetřil. The core of a graph. Discrete
   RDF graphs are inherently incomplete, hence it would be              Mathematics, 109(1):117–126, 1992.
useful to consider a scenario where an explanation is pro-         [12] J. Lehmann, J. Schüppel, and S. Auer. Discovering
duced over an RDF graph and a domain ontology that con-                 unknown connections - the DBpedia relationship
tains knowledge not explicitly present in the graph. Con-               finder. CSSW, 113:99–110, 2007.
sider a graph G consisting of two facts: T eacher(Bob) and         [13] G. Marchionini. Exploratory search: from finding to
teaches(Alice, CS), — and a simple EL ontology O consist-               understanding. Communications of the ACM,
ing of one axiom: ∃teaches.Class v T eacher. Let the two                49(4):41–46, 2006.
input entities be Alice and Bob. Then a similarity explana-        [14] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago:
tion wrt G, O could be Q(X) = T eacher(X), while we are                 A large ontology from wikipedia and wordnet. Web
unable to generate such a CQ using only graph data.                     Semantics: Science, Services and Agents on the World
   In our framework explanations are modelled as CQs, and               Wide Web, 6(3):203–217, 2008.
while CQs are formulas with relatively high readability for        [15] B. ten Cate and V. Dalmau. The Product
a user, it is of interest to be able to verbalize explanations,         Homomorphism Problem and Applications. In 18th
transforming them into natural language sentences. For ex-              International Conference on Database Theory (ICDT
ample, a formal explanation Q(X) = livesIn(X, London),                  2015), pages 161–176, 2015.
friendsWith(X, Y ), worksAt(Y, Oracle) could be transformed        [16] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query
into an English sentence “Both input entities live in London            by output. In Proc. of the 2009 ACM SIGMOD Int.
and are friends with someone who works at Oracle”.                      Conf. on Management of data, pages 535–548, 2009.
   While CQs correspond to a large part of queries issued          [17] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query
over relational databases, i.e., they have relatively high ex-          reverse engineering. The VLDB Journal,
pressivity, they cannot express things like negation or dis-            23(5):721–746, 2014.
junction, which is a limitation. Hence, an interesting prob-       [18] R. W. White and R. A. Roth. Exploratory search:
lem would be to consider more expressive languages, in par-             beyond the query-response paradigm, 2009.
ticular, union of CQs and CQs with inequalities and numeric
                                                                   [19] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and
comparison.
                                                                        D. Srivastava. Reverse engineering complex join
   Lastly, we are planning to implement a comprehensive
                                                                        queries. In Proc. of the 2013 ACM SIGMOD Int.
comparison system that would compute most specific simi-
                                                                        Conf. on Management of Data, pages 809–820, 2013.
larity explanations and most general difference explanations,
to test it on real-world RDF graphs and to perform usability       [20] M. M. Zloof. Query-by-example: A data base
tests.                                                                  language. IBM systems Journal, 16(4):324–343, 1977.