Comparing entities in RDF graphs Alina Petrova supervised by Prof. Ian Horrocks and Prof. Bernardo Cuenca Grau Department of Computer Science University of Oxford alina.petrova@cs.ox.ac.uk ABSTRACT aspects two entities are similar and different, by compar- The Semantic Web has fuelled the appearance of numerous ing entities and finding similar features. Such comparison open-source knowledge bases. Knowledge bases enable new is done in many domains and for various types of entities: types of information search, going beyond classical query an- hotels,1 cars,2 universities,3 shopping items,4 to name a few. swering and into the realm of exploratory search, and pro- However, as a rule, such systems perform a side-by-side com- viding answers to new types of user questions. One such parison of items in a domain-specific manner, i.e., following question is how two entities are comparable, i.e., what are a fixed, hard-coded template of aspects to compare (e.g., in similarities and differences between the information known case of hotels, it could be price, location, included break- about the two entities. Entity comparison is an important fast, rating etc.). In a few more advanced systems, similar- task and a widely used functionality available in many in- ities are computed with respect to the type of information formation systems. Yet it is usually domain-specific and de- available about the input entities rather than following a pends on a fixed set of aspects to compare. In this paper we rigid pattern. One such example is the Facebook Friendship propose a formal framework for domain-independent entity pages.5 Given two Facebook users, a friendship page con- comparison that provides similarity and difference explana- tains all their shared information, be it public posts, photos, tions for input entities. We model explanations as conjunc- likes or mutual friends, as well as their relationship, if any tive queries, we discuss how multiple explanations for an (e.g., married, friends etc.). However, as in the aforemen- entity pair can be ranked and we provide a polynomial-time tioned examples, comparison is done over a limited set of algorithm for generating most specific similarity explana- attributes. tions. Relying on a fixed set of aspects is a reasonable solution for tabular data with rigid and stable structure. On the other hand, a more flexible approach to entity comparison is 1. INTRODUCTION needed for Linked Data, namely for loosely structured RDF Information seeking is a complex task which can be ac- graphs. However, all current systems with such functionality complished following different types of search behaviour. compare items following a predefined, domain-specific list of Classical information retrieval focuses on the query-response values to compare. Thus, an interesting research problem search paradigm, in which a user asks for entities similar to would be to create a framework for entity comparison that the input keywords or fitting the formal input constraint. is domain- and attribute-independent. Yet there exists a broad area of exploratory search that is The Semantic Web has fuelled the appearance of numer- characterized by open-ended, browsing behaviour [18] and ous open-source knowledge bases (KBs). Such KBs enable that is much less well studied. Exploratory search encom- both automatic information processing tasks and manual passes activities like information discovery, aggregation and search, and they facilitate new types of information search, interpretation, as well as comparison [13]. going beyond classical query answering and providing an- Comparing entities, or rather, information available about swers to new types of user questions. For example, using the entities, is an important task and in fact a widely-used KBs one can answer questions like how are the two entities functionality implemented in many tools and resources. On similar or what differs them, i.e., perform entity comparison. the one hand, systems that highlight similarities between In this paper we propose to study such questions posed entities can focus on how much entities are alike, giving a over one of the most common types of KBs — RDF graphs. similarity score to a pair (or a group) of entities [6]. On In particular, we provide a formal framework for posing such the other hand, systems can focus on how or why, in which questions and we model answers to these questions as sim- ilarity and difference explanations. We then discuss how 1 www.flightnetwork.com/pages/ hotel-comparison-tool/ 2 http://www.cars.com/go/compare/modelCompare.jsp 3 http://colleges.startclass.com/ 4 Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, http://www.argos.co.uk/static/Home.html 5 Germany. Original announcement (cashed by the Wayback Machine): Copyright (c) 2017 for this paper by its authors. Copying permitted for https://web.archive.org/web/20101030105622/http: private and academic purposes.. //blog.facebook.com/blog.php?post=443390892130 multiple explanations to a question can be ranked and we Using this definition, we can formulate the following de- provide a polynomial-time algorithm for generating most cision problem: given hI, a, bi, SimExp is a problem of specific similarity explanations. Finally, we outline direc- whether there exists a CQ Q such that {a, b} ⊆ Q(I). The tions of future research. corresponding functional problem is to compute a query Q such that {a, b} ⊆ Q(I), given hI, a, bi. Note that both the 2. PRELIMINARIES definition of Qsim and SimExp can be easily generalized from a pair of entities to a set of input entities. In what follows we use the standard notions of conjunctive We specifically chose the condition to be {a, b} ⊆ Q(I) queries (CQs), query subsumption and homomorphism. We for two reasons. Firstly, the form of Q does not depend on disallow trivial CQs of the form >(X). We model RDF the rest of the data: it does not matter whether there exist graphs as finite sets of triples, where a triple is of the form other entities that match the graph pattern described by the p(s, o), p and s being URIs and o being a URI or a literal. query; moreover, queries fitting the subsumption condition Furthermore, we use the notion of a direct product of two will not be affected if new data is added. This is very im- graphs, adapted to RDF graphs: portant in the context of RDF graphs, since web data is Definition 1. Let I and J be RDF graphs, t1 = R(s1 , o1 ) intrinsically incomplete. and t2 = R(s2 , o2 ) be two triples. The direct product of t1 Secondly, it is known that the definability problem is and t2 , denoted as t1 ⊗ t2 , is the triple R(hs1 , s2 i, ho1 , o2 i). coNExpTime-complete for conjunctive queries [3,15]. On The direct product I ⊗ J of I and J is the instance: the other hand, SimExp can easily be shown to be in PTime: for conjunctive queries, it is sufficient to take the full join of {t1 ⊗ t2 | t1 ∈ I and t2 ∈ J}. all tables in the database instance. 3. COMPARISON FRAMEWORK Let Sim(a, b) be the set of all similarity explanations for a given hI, a, bi. Obviously Sim(a, b) can be quite big, con- There are multiple ways of how we can define similarity taining numerous explanations, however, we are interested and difference explanations and how we can model entity in the most informative ones. Our assumption is that the comparison. In our framework the formalism of choice is more specific a similarity explanation is, the better. conjunctive queries (CQs). We model formal explanations as conjunctive queries and we consider the problem of find- Definition 3. Given hI, a, bi, a most specific similarity ex- ing such explanations as an instance of the query reverse planation is a similarity explanation Qmspsim s.t. for all simi- engineering problem. larity explanations Q0sim wrt hI, a, bi: Qmsp 0 sim ⊆ Qsim . 3.1 Similarity explanations The decision problem SimExpmsp is the problem of decid- We would like similarity explanations to highlight com- ing whether Qsim is a most specific similarity explanation mon patterns for input entities. Thus, we model them as for the given hI, a, bi.The related functional problem is to queries that return both of these entities, i.e, they match compute a most specific Qsim . patterns fitting both entities. Let hI, a, bi be a tuple con- The subsumption relation divides the set of all similarity sisting of an RDF graph I and two URIs from the domain explanations Sim(a, b) into ⊆-equivalent classes. If Sim(a, b) of I a, b ∈ dom(I) representing input entities. Furthermore, is not empty, then Sim(a, b)msp is not empty, and there ex- given a query Q, let Q(I) be the answer set returned by Q ists a finite most specific similarity explanation Qmsp sim , whose over I. size is bounded by the size of I. This explanation can in fact Definition 2. Given hI, a, bi, a similarity explanation for be constructed in PTime (see Section 4). a and b is a unary connected conjunctive query Qsim such that {a, b} ⊆ Qsim (I). 3.2 Difference explanations Analogous to similarity explanations, we model difference Example 1. Given two entities Marilyn Monroe and Eliza- explanations as CQs, but this time we require only one of beth Taylor and the Yago RDF graph [14], a possible simi- the input entities to be in the answer set. larity explanation is: Definition 4. Given hI, a, bi, a difference explanation for a wrt b is a unary connected conjunctive query Qadif such Qsim (X) =hasWonPrize(X, Golden Globe), that a ∈ Qadif (I), but b ∈ / Qadif (I). diedIn(X, Los Angeles), hasGender(X, female), The notion of a difference explanation can be generalized actedIn(X, Y 1), to sets of entities: given an RDF graph I, a set of entities P os and a set of entities N eg, a difference explanation for isLocatedIn(Y 1, United States), I and P os wrt N eg is a unary connected CQ QP os s.t. dif isMarriedTo(X, Y 2), P os P os ∀p ∈ P os: p ∈ Qdif and N eg ∩ Qdif = ∅. hasGender(Y 2, male), Given hI, a, bi, Dif Exp is the problem of deciding whether hasWonPrize(Y 2, Tony Award), etc. there exists a difference explanation Qadif . The generalized difference explanation problem Dif Exp can be solved using Qsim can be interpreted the following way: both Monroe the most specific similarity explanation problem SimExpmsp : and Taylor received a Golden Globes award, died in Los given I, P os and N eg, first construct a most specific sim- Angeles, acted in movies that were shot in the US and were ilarity explanation Q for entities in P os (done in PTime), married to men who received a Tony Award. and then check whether none of the elements of N eg are in the answer set of Q (conjunctive query evaluation is NP- Algorithm 1: Algorithm for computing a most specific complete). Hence, the complexity of generalized Dif Exp similarity explanation is NP-complete. Input: an RDF graph I, entities a, b from dom(I). Furthermore, we would like to introduce another defini- Output: a most specific similarity explanation for a tion of a difference explanation that is dependent on the and b. similarities between a and b. We would like the difference 1 Let J = {R(ha, bi, hc, di) | R(a, c), R(b, d) ∈ explanation for a to be as relevant as possible, hence we I} ∪ {R(hc, di, ha, bi) | R(c, a), R(d, b) ∈ I}; model it to be dependent not only on the information about 2 if J = ∅ then b, but also on the common patterns for a and b. One pos- 3 return empty query; sible way to do so is the following: let const(Q) be the set ∗ of constants appearing in a query Q and let const(R(x̄)) be 4 Let J = ∅; ∗ the set of constants appearing in an atom R(x̄). 5 while J 6= J do 6 J ∗ := J; Definition 5. Given hI, a, bi, a difference explanation for 7 J := J ∪ {R(hc, di, he, f i) 6∈ J | R(c, e), R(d, f ) ∈ a wrt b and Qsim is a different explanation Qa,sim such that I, and ∃ a fact in J that contains hc, di or he, f i}; dif ∀R(x̄) ∈ Qa,sim dif : const(R(x̄)) ∩ const(Q sim ) 6 = ∅. 8 Construct qJ (xha,bi ); 9 foreach xhc,ci in qJ , c 6∈ {a, b} do Example 2. Let the input entities be a = John Travolta // Replace xhc,ci with constant c and b = Quentin Tarantino. Let Qsim for a and b be an expla- 10 qJ (xha,bi ) := qJ (xha,bi )[xhc,ci → c]; nation that both persons starred in Pulp Fiction: Qsim (X) = 11 return qJ (xha,bi ). starredIn(X, Pulp Fiction). Relevant difference explanations could be that Travolta also starred in Grease and other movies, while Tarantino has directed several movies, in- (ii) For a connected unary conjunctive query q 0 (x), if there cluding Pulp Fiction: Qadif (X) = starredIn(X, Grease) and exist homomorphisms h1 , h2 : q 0 → I such that h1 (x) = Qbdif (X) = directed(X, Pulp Fiction). On the other hand, an a and h2 (x) = b, then there exists a homomorphism explanation that Travolta (unlike Tarantino) is married to h : q 0 → J such that h(x) = ha, bi. Kelly Preston is rather irrelevant, since we have not com- pared the two persons with respect to their marital status. Corollary 1. Algorithm 1 produces a most specific simi- larity explanation. 4. TECHNICAL RESULTS 4.2 Properties of the resulting query The algorithm 1 runs in time polynomial to the size of 4.1 Algorithm for computing a most specific the input RDF graph, and the size of resulting most specific similarity explanation similarity explanation is also polynomial to I. It should be We compute a most specific similarity explanation by con- noted that the output query tends to be non-minimal. For structing the direct product of the RDF graph, similar to the example, since Marilyn Monroe and Elizabeth Taylor acted in construction of the direct product of a database instances several movies that were shot in the US, Q(X) will contain with itself [15]. Any RDF graph I a ∈ dom(I) can be asso- atoms like: ciated with a canonical unary conjunctive query qI (xa ) such that for each fact R(c, d) in I there is an atom R(xc , xd ) in actedIn(X, Y 1), isLocatedIn(Y 1, United States), qI , where xc and xd are variables and xa is a free variable. actedIn(X, Y 2), isLocatedIn(Y 2, United States), Note that a is an answer to qI (xa ) over I. The following actedIn(X, Y 3), isLocatedIn(Y 3, United States), etc. algorithm produces a most specific similarity explanation. In it, we first produce an instance with the domain from To avoid such redundancy, we can take the core of the query dom(I)2 , i.e., tuples hc, di for c, d ∈ dom(I), and then con- (i.e., apply the query minimization algorithm). Taking the struct a canonical conjunctive query of this instance. core is an NP-complete problem [9, 11], hence, obtaining a most specific similarity explanation without redundant atoms is an NP-complete task. Claim 1. If J 6= ∅, then J is a maximal connected com- ponent of I ⊗ I such that a ⊗ b = ha, bi ∈ dom(J). 5. RELATED WORK Proof sketch: Firstly, if J 6= ∅, then ha, bi ∈ dom(J), by Step So far only few works have studied explanations over RDF 1. Secondly, the while-loop on Step 5 is in fact the greedy graphs [5, 10, 12], and there is no single formal definition procedure that generates the maximal connected component of an explanation over RDF data. A lot of attention has in I ⊗ I. Indeed, the condition R(c, e), R(d, f ) ∈ I ensures been paid to discovering connections (“associations”) be- that the fact R(hc, di, he, f i) is in I ⊗ I, and the condition tween nodes [12], which boils down to finding and grouping that there must exist a fact in J that contains hc, di or he, f i together paths in the graph that connect one input node to ensures connectedness. another one. Such connectedness explanations are orthogo- nal (rather than alternative) to the similarity explanations Claim 2. Let hI, a, bi be an input of Algorithm 1. Let modelled as queries, which we propose to study. The two qJ (xha,bi ) be the output, and J the instance obtained after types of explanations are intended to capture different rela- the while loop on Step 5. Then all of the following hold. tions between nodes: the former explore possible paths that link the two nodes together, while the latter seek to find (i) {a, b} ⊆ qJ (I), commonalities in the neighbourhoods of the input nodes. The problem of reverse engineering a query given some 7. REFERENCES examples originated in late 1970s and was first introduced [1] D. Angluin. Queries and concept learning. Machine for the domain of relational databases [20]. Later it was ex- learning, 2(4):319–342, 1988. tensively researched with respect to different query formats: [2] M. Arenas, G. I. Diaz, and E. V. Kostylev. Reverse regular languages [1], XML queries [7], relational database engineering SPARQL queries. In Proc. of the 25th Int. queries [16, 17, 19], graph database queries [4] and SPARQL Conf. on World Wide Web, pages 239–249, 2016. queries [2]. The problem of QRE for RDF data was first [3] P. Barcelo and M. Romero. The complexity of reverse studied by Arenas et al. [2] and was implemented by Diaz et engineering problems for conjunctive queries. In Proc. al. [8]. In [2], the authors consider three different variations of 20th Int. Conf. on Database Theory, 2017. of QRE problem: the basic variation that requires the input [4] A. Bonifati, R. Ciucanu, and A. Lemay. Learning path mappings to be part of the answer set (Ω ⊆ JQKG ); the one queries on graph databases. In 18th Int. Conf. on that allows positive examples Ω together with negative ex- Extending Database Technology (EDBT), 2015. amples Ω̄ (such that Ω ⊆ JQKG and Ω̄ ∩ JQKG = ∅); and the variation that requires the examples from Ω to be exactly [5] G. Cheng, Y. Zhang, and Y. Qu. Explass: exploring the answer set of Q (Ω = JQKG ). The complexity of these associations between entities via top-k ontological three variations is then provided for fragments of SPARQL patterns and facets. In International Semantic Web with AND, FILTER and OPT. Conference, pages 422–437. Springer, 2014. [6] S.-S. Choi, S.-H. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. J. Systemics, Cybernetics and Informatics, 8(1):43–48, 2010. 6. FUTURE WORK [7] S. Cohen and Y. Y. Weiss. Learning tree patterns As part of my PhD, I would like to continue studying the from example graphs. In LIPIcs-Leibniz International problem of entity comparison using RDF graphs in several Proceedings in Informatics, volume 31, 2015. research directions. So far we have investigated similarity [8] G. Diaz, M. Arenas, and M. Benedikt. SPARQLByE: and difference explanations, and we rank the former accord- Querying RDF data by example. Proceedings of the ing to the preference condition based on subsumption. In VLDB Endowment, 9(13), 2016. particular, we assume that the highest ranked explanations [9] G. Gottlob and A. Nash. Efficient core computation in are most specific similarity explanations. On the one hand, data exchange. Journal of the ACM, 55(2):9, 2008. we would like to apply a similar rationale to difference expla- nations and to study most general difference explanations as [10] P. Heim, S. Hellmann, J. Lehmann, S. Lohmann, and most preferred ones. On the other hand, these may not be T. Stegemann. RelFinder: Revealing relationships in the optimal choices for a given user, hence we need to inves- RDF knowledge bases. In Int. Conf. on Semantic and tigate other possible ranking conditions as well as means of Digital Media Technologies, pages 182–187, 2009. user-specific ranking of explanations. [11] P. Hell and J. Nešetřil. The core of a graph. Discrete RDF graphs are inherently incomplete, hence it would be Mathematics, 109(1):117–126, 1992. useful to consider a scenario where an explanation is pro- [12] J. Lehmann, J. Schüppel, and S. Auer. Discovering duced over an RDF graph and a domain ontology that con- unknown connections - the DBpedia relationship tains knowledge not explicitly present in the graph. Con- finder. CSSW, 113:99–110, 2007. sider a graph G consisting of two facts: T eacher(Bob) and [13] G. Marchionini. Exploratory search: from finding to teaches(Alice, CS), — and a simple EL ontology O consist- understanding. Communications of the ACM, ing of one axiom: ∃teaches.Class v T eacher. Let the two 49(4):41–46, 2006. input entities be Alice and Bob. Then a similarity explana- [14] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: tion wrt G, O could be Q(X) = T eacher(X), while we are A large ontology from wikipedia and wordnet. Web unable to generate such a CQ using only graph data. Semantics: Science, Services and Agents on the World In our framework explanations are modelled as CQs, and Wide Web, 6(3):203–217, 2008. while CQs are formulas with relatively high readability for [15] B. ten Cate and V. Dalmau. The Product a user, it is of interest to be able to verbalize explanations, Homomorphism Problem and Applications. In 18th transforming them into natural language sentences. For ex- International Conference on Database Theory (ICDT ample, a formal explanation Q(X) = livesIn(X, London), 2015), pages 161–176, 2015. friendsWith(X, Y ), worksAt(Y, Oracle) could be transformed [16] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query into an English sentence “Both input entities live in London by output. In Proc. of the 2009 ACM SIGMOD Int. and are friends with someone who works at Oracle”. Conf. on Management of data, pages 535–548, 2009. While CQs correspond to a large part of queries issued [17] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query over relational databases, i.e., they have relatively high ex- reverse engineering. The VLDB Journal, pressivity, they cannot express things like negation or dis- 23(5):721–746, 2014. junction, which is a limitation. Hence, an interesting prob- [18] R. W. White and R. A. Roth. Exploratory search: lem would be to consider more expressive languages, in par- beyond the query-response paradigm, 2009. ticular, union of CQs and CQs with inequalities and numeric [19] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and comparison. D. Srivastava. Reverse engineering complex join Lastly, we are planning to implement a comprehensive queries. In Proc. of the 2013 ACM SIGMOD Int. comparison system that would compute most specific simi- Conf. on Management of Data, pages 809–820, 2013. larity explanations and most general difference explanations, to test it on real-world RDF graphs and to perform usability [20] M. M. Zloof. Query-by-example: A data base tests. language. IBM systems Journal, 16(4):324–343, 1977.