1. INTRODUCTION

PhD Workshop, August

Comparing entities in RDF graphs

Alina Petrova supervised by Ian Horrocks

alina.petrova@cs.ox.ac.uk 0

Bernardo Cuenca Grau

0 0 Department of Computer Science University of Oxford

2017

28 2017

The Semantic Web has fuelled the appearance of numerous open-source knowledge bases. Knowledge bases enable new types of information search, going beyond classical query answering and into the realm of exploratory search, and providing answers to new types of user questions. One such question is how two entities are comparable, i.e., what are similarities and di erences between the information known about the two entities. Entity comparison is an important task and a widely used functionality available in many information systems. Yet it is usually domain-speci c and depends on a xed set of aspects to compare. In this paper we propose a formal framework for domain-independent entity comparison that provides similarity and di erence explanations for input entities. We model explanations as conjunctive queries, we discuss how multiple explanations for an entity pair can be ranked and we provide a polynomial-time algorithm for generating most speci c similarity explanations.

1. INTRODUCTION

Information seeking is a complex task which can be accomplished following di erent types of search behaviour. Classical information retrieval focuses on the query-response search paradigm, in which a user asks for entities similar to the input keywords or tting the formal input constraint. Yet there exists a broad area of exploratory search that is characterized by open-ended, browsing behaviour [ 18 ] and that is much less well studied. Exploratory search encompasses activities like information discovery, aggregation and interpretation, as well as comparison [ 13 ].

Comparing entities, or rather, information available about the entities, is an important task and in fact a widely-used functionality implemented in many tools and resources. On the one hand, systems that highlight similarities between entities can focus on how much entities are alike, giving a similarity score to a pair (or a group) of entities [ 6 ]. On the other hand, systems can focus on how or why, in which aspects two entities are similar and di erent, by comparing entities and nding similar features. Such comparison is done in many domains and for various types of entities: hotels,1 cars,2 universities,3 shopping items,4 to name a few. However, as a rule, such systems perform a side-by-side comparison of items in a domain-speci c manner, i.e., following a xed, hard-coded template of aspects to compare (e.g., in case of hotels, it could be price, location, included breakfast, rating etc.). In a few more advanced systems, similarities are computed with respect to the type of information available about the input entities rather than following a rigid pattern. One such example is the Facebook Friendship pages.5 Given two Facebook users, a friendship page contains all their shared information, be it public posts, photos, likes or mutual friends, as well as their relationship, if any (e.g., married, friends etc.). However, as in the aforementioned examples, comparison is done over a limited set of attributes.

Relying on a xed set of aspects is a reasonable solution for tabular data with rigid and stable structure. On the other hand, a more exible approach to entity comparison is needed for Linked Data, namely for loosely structured RDF graphs. However, all current systems with such functionality compare items following a prede ned, domain-speci c list of values to compare. Thus, an interesting research problem would be to create a framework for entity comparison that is domain- and attribute-independent.

The Semantic Web has fuelled the appearance of numerous open-source knowledge bases (KBs). Such KBs enable both automatic information processing tasks and manual search, and they facilitate new types of information search, going beyond classical query answering and providing answers to new types of user questions. For example, using KBs one can answer questions like how are the two entities similar or what di ers them, i.e., perform entity comparison.

In this paper we propose to study such questions posed over one of the most common types of KBs | RDF graphs. In particular, we provide a formal framework for posing such questions and we model answers to these questions as similarity and di erence explanations. We then discuss how 1www.flightnetwork.com/pages/ hotel-comparison-tool/ 2http://www.cars.com/go/compare/modelCompare.jsp 3http://colleges.startclass.com/ 4http://www.argos.co.uk/static/Home.html 5Original announcement (cashed by the Wayback Machine): https://web.archive.org/web/20101030105622/http: //blog.facebook.com/blog.php?post=443390892130 multiple explanations to a question can be ranked and we provide a polynomial-time algorithm for generating most speci c similarity explanations. Finally, we outline directions of future research.

2. PRELIMINARIES

In what follows we use the standard notions of conjunctive queries (CQs), query subsumption and homomorphism. We disallow trivial CQs of the form >(X). We model RDF graphs as nite sets of triples, where a triple is of the form p(s; o), p and s being URIs and o being a URI or a literal. Furthermore, we use the notion of a direct product of two graphs, adapted to RDF graphs:

De nition 1. Let I and J be RDF graphs, t1 = R(s1; o1) and t2 = R(s2; o2) be two triples. The direct product of t1 and t2, denoted as t1 t2, is the triple R(hs1; s2i; ho1; o2i). The direct product I J of I and J is the instance: ft1

t2 j t1 2 I and t2 2 Jg: 3.

COMPARISON FRAMEWORK

There are multiple ways of how we can de ne similarity and di erence explanations and how we can model entity comparison. In our framework the formalism of choice is conjunctive queries (CQs). We model formal explanations as conjunctive queries and we consider the problem of nding such explanations as an instance of the query reverse engineering problem. 3.1

Similarity explanations

We would like similarity explanations to highlight common patterns for input entities. Thus, we model them as queries that return both of these entities, i.e, they match patterns tting both entities. Let hI; a; bi be a tuple consisting of an RDF graph I and two URIs from the domain of I a; b 2 dom(I) representing input entities. Furthermore, given a query Q, let Q(I) be the answer set returned by Q over I.

De nition 2. Given hI; a; bi, a similarity explanation for a and b is a unary connected conjunctive query Qsim such that fa; bg Qsim(I).

Example 1. Given two entities Marilyn Monroe and Elizabeth Taylor and the Yago RDF graph [ 14 ], a possible similarity explanation is:

Qsim(X) =hasWonPrize(X; Golden Globe); diedIn(X; Los Angeles); hasGender(X; female); actedIn(X; Y 1); isLocatedIn(Y 1; United States); isMarriedTo(X; Y 2); hasGender(Y 2; male); hasWonPrize(Y 2; Tony Award); etc: Qsim can be interpreted the following way: both Monroe and Taylor received a Golden Globes award, died in Los Angeles, acted in movies that were shot in the US and were married to men who received a Tony Award.

Using this de nition, we can formulate the following decision problem: given hI; a; bi, SimExp is a problem of whether there exists a CQ Q such that fa; bg Q(I). The corresponding functional problem is to compute a query Q such that fa; bg Q(I), given hI; a; bi. Note that both the de nition of Qsim and SimExp can be easily generalized from a pair of entities to a set of input entities.

We speci cally chose the condition to be fa; bg Q(I) for two reasons. Firstly, the form of Q does not depend on the rest of the data: it does not matter whether there exist other entities that match the graph pattern described by the query; moreover, queries tting the subsumption condition will not be a ected if new data is added. This is very important in the context of RDF graphs, since web data is intrinsically incomplete.

Secondly, it is known that the de nability problem is coNExpTime-complete for conjunctive queries [ 3,15 ]. On the other hand, SimExp can easily be shown to be in PTime: for conjunctive queries, it is su cient to take the full join of all tables in the database instance.

Let Sim(a; b) be the set of all similarity explanations for a given hI; a; bi. Obviously Sim(a; b) can be quite big, containing numerous explanations, however, we are interested in the most informative ones. Our assumption is that the more speci c a similarity explanation is, the better.

De nition 3. Given hI; a; bi, a most speci c similarity explanation is a similarity explanation Qsmimsp s.t. for all similarity explanations Q0sim wrt hI; a; bi: Qsmimsp Q0sim.

The decision problem SimExpmsp is the problem of deciding whether Qsim is a most speci c similarity explanation for the given hI; a; bi.The related functional problem is to compute a most speci c Qsim.

The subsumption relation divides the set of all similarity explanations Sim(a; b) into -equivalent classes. If Sim(a; b) is not empty, then Sim(a; b)msp is not empty, and there exists a nite most speci c similarity explanation Qsmimsp, whose size is bounded by the size of I. This explanation can in fact be constructed in PTime (see Section 4). 3.2

Difference explanations

Analogous to similarity explanations, we model di erence explanations as CQs, but this time we require only one of the input entities to be in the answer set.

De nition 4. Given hI; a; bi, a di erence explanation for a wrt b is a unary connected conjunctive query Qdaif such that a 2 Qdaif (I), but b 2= Qdaif (I).

The notion of a di erence explanation can be generalized to sets of entities: given an RDF graph I, a set of entities P os and a set of entities N eg, a di erence explanation for I and P os wrt N eg is a unary connected CQ QdPiofs s.t. 8p 2 P os: p 2 QdPiofs and N eg \ QdPiofs = ;.

Given hI; a; bi, Dif Exp is the problem of deciding whether there exists a di erence explanation Qdaif . The generalized di erence explanation problem Dif Exp can be solved using the most speci c similarity explanation problem SimExpmsp: given I, P os and N eg, rst construct a most speci c similarity explanation Q for entities in P os (done in PTime), and then check whether none of the elements of N eg are in the answer set of Q (conjunctive query evaluation is NPcomplete). Hence, the complexity of generalized Dif Exp is NP-complete.

Furthermore, we would like to introduce another de nition of a di erence explanation that is dependent on the similarities between a and b. We would like the di erence explanation for a to be as relevant as possible, hence we model it to be dependent not only on the information about b, but also on the common patterns for a and b. One possible way to do so is the following: let const(Q) be the set of constants appearing in a query Q and let const(R(x)) be the set of constants appearing in an atom R(x).

De nition 5. Given hI; a; bi, a di erence explanation for a wrt b and Qsim is a di erent explanation Qdai;sfim such that 8R(x) 2 Qdai;sfim: const(R(x)) \ const(Qsim) 6= ;.

Example 2. Let the input entities be a = John Travolta and b = Quentin Tarantino. Let Qsim for a and b be an explanation that both persons starred in Pulp Fiction: Qsim(X) = starredIn(X; Pulp Fiction). Relevant di erence explanations could be that Travolta also starred in Grease and other movies, while Tarantino has directed several movies, including Pulp Fiction: Qdaif (X) = starredIn(X; Grease) and b Qdif (X) = directed(X; Pulp Fiction). On the other hand, an explanation that Travolta (unlike Tarantino) is married to Kelly Preston is rather irrelevant, since we have not compared the two persons with respect to their marital status.

TECHNICAL RESULTS Algorithm for computing a most specific similarity explanation

We compute a most speci c similarity explanation by constructing the direct product of the RDF graph, similar to the construction of the direct product of a database instances with itself [ 15 ]. Any RDF graph I a 2 dom(I) can be associated with a canonical unary conjunctive query qI (xa) such that for each fact R(c; d) in I there is an atom R(xc; xd) in qI , where xc and xd are variables and xa is a free variable. Note that a is an answer to qI (xa) over I. The following algorithm produces a most speci c similarity explanation. In it, we rst produce an instance with the domain from dom(I)2, i.e., tuples hc; di for c; d 2 dom(I), and then construct a canonical conjunctive query of this instance.

Claim 1. If J 6= ;, then J is a maximal connected component of I I such that a b = ha; bi 2 dom(J ). Proof sketch: Firstly, if J 6= ;, then ha; bi 2 dom(J ), by Step 1. Secondly, the while-loop on Step 5 is in fact the greedy procedure that generates the maximal connected component in I I. Indeed, the condition R(c; e); R(d; f ) 2 I ensures that the fact R(hc; di; he; f i) is in I I, and the condition that there must exist a fact in J that contains hc; di or he; f i ensures connectedness.

Claim 2. Let hI; a; bi be an input of Algorithm 1. Let qJ (xha;bi) be the output, and J the instance obtained after the while loop on Step 5. Then all of the following hold. (i) fa; bg qJ (I), Algorithm 1: Algorithm for computing a most speci c similarity explanation Input: an RDF graph I, entities a; b from dom(I). Output: a most speci c similarity explanation for a and b. 1 Let J = fR(ha; bi; hc; di) j R(a; c); R(b; d) 2

Ig [ fR(hc; di; ha; bi) j R(c; a); R(d; b) 2 Ig; 2 if J = ; then 3 return empty query; 4 Let J = ;; 5 while J 6= J do 6 J := J ; 7 J := J [ fR(hc; di; he; f i) 62 J j R(c; e); R(d; f ) 2

I; and 9 a fact in J that contains hc; di or he; f ig; 10 8 Construct qJ (xha;bi); 9 foreach xhc;ci in qJ , c 62 fa; bg do // Replace xhc;ci with constant c qJ (xha;bi) := qJ (xha;bi)[xhc;ci ! c]; 11 return qJ (xha;bi).

(ii) For a connected unary conjunctive query q0(x), if there exist homomorphisms h1; h2 : q0 ! I such that h1(x) = a and h2(x) = b, then there exists a homomorphism h : q0 ! J such that h(x) = ha; bi.

Corollary 1. Algorithm 1 produces a most speci c similarity explanation. 4.2

Properties of the resulting query

The algorithm 1 runs in time polynomial to the size of the input RDF graph, and the size of resulting most speci c similarity explanation is also polynomial to I. It should be noted that the output query tends to be non-minimal. For example, since Marilyn Monroe and Elizabeth Taylor acted in several movies that were shot in the US, Q(X) will contain atoms like: actedIn(X; Y 1); isLocatedIn(Y 1; United States); actedIn(X; Y 2); isLocatedIn(Y 2; United States); actedIn(X; Y 3); isLocatedIn(Y 3; United States); etc. To avoid such redundancy, we can take the core of the query (i.e., apply the query minimization algorithm). Taking the core is an NP-complete problem [ 9, 11 ], hence, obtaining a most speci c similarity explanation without redundant atoms is an NP-complete task. 5.

RELATED WORK

So far only few works have studied explanations over RDF graphs [ 5, 10, 12 ], and there is no single formal de nition of an explanation over RDF data. A lot of attention has been paid to discovering connections (\associations") between nodes [ 12 ], which boils down to nding and grouping together paths in the graph that connect one input node to another one. Such connectedness explanations are orthogonal (rather than alternative) to the similarity explanations modelled as queries, which we propose to study. The two types of explanations are intended to capture di erent relations between nodes: the former explore possible paths that link the two nodes together, while the latter seek to nd commonalities in the neighbourhoods of the input nodes.

The problem of reverse engineering a query given some examples originated in late 1970s and was rst introduced for the domain of relational databases [ 20 ]. Later it was extensively researched with respect to di erent query formats: regular languages [ 1 ], XML queries [ 7 ], relational database queries [ 16, 17, 19 ], graph database queries [ 4 ] and SPARQL queries [ 2 ]. The problem of QRE for RDF data was rst studied by Arenas et al. [ 2 ] and was implemented by Diaz et al. [ 8 ]. In [ 2 ], the authors consider three di erent variations of QRE problem: the basic variation that requires the input mappings to be part of the answer set ( JQKG); the one that allows positive examples together with negative examples (such that JQKG and \ JQKG = ;); and the variation that requires the examples from to be exactly the answer set of Q ( = JQKG). The complexity of these three variations is then provided for fragments of SPARQL with AND, FILTER and OPT.

FUTURE WORK

As part of my PhD, I would like to continue studying the problem of entity comparison using RDF graphs in several research directions. So far we have investigated similarity and di erence explanations, and we rank the former according to the preference condition based on subsumption. In particular, we assume that the highest ranked explanations are most speci c similarity explanations. On the one hand, we would like to apply a similar rationale to di erence explanations and to study most general di erence explanations as most preferred ones. On the other hand, these may not be the optimal choices for a given user, hence we need to investigate other possible ranking conditions as well as means of user-speci c ranking of explanations.

RDF graphs are inherently incomplete, hence it would be useful to consider a scenario where an explanation is produced over an RDF graph and a domain ontology that contains knowledge not explicitly present in the graph. Consider a graph G consisting of two facts: T eacher(Bob) and teaches(Alice; CS), | and a simple EL ontology O consisting of one axiom: 9teaches:Class v T eacher. Let the two input entities be Alice and Bob. Then a similarity explanation wrt G; O could be Q(X) = T eacher(X), while we are unable to generate such a CQ using only graph data.

In our framework explanations are modelled as CQs, and while CQs are formulas with relatively high readability for a user, it is of interest to be able to verbalize explanations, transforming them into natural language sentences. For example, a formal explanation Q(X) = livesIn(X; London), friendsWith(X; Y ); worksAt(Y; Oracle) could be transformed into an English sentence \Both input entities live in London and are friends with someone who works at Oracle".

While CQs correspond to a large part of queries issued over relational databases, i.e., they have relatively high expressivity, they cannot express things like negation or disjunction, which is a limitation. Hence, an interesting problem would be to consider more expressive languages, in particular, union of CQs and CQs with inequalities and numeric comparison.

Lastly, we are planning to implement a comprehensive comparison system that would compute most speci c similarity explanations and most general di erence explanations, to test it on real-world RDF graphs and to perform usability tests.

REFERENCES

[1]

Angluin . Queries and concept learning . Machine learning , 2 ( 4 ): 319 { 342 , 1988 .

[2]

Arenas ,

G. I.

Diaz , and

E. V.

Kostylev . Reverse engineering SPARQL queries . In Proc. of the 25th Int. Conf. on World Wide Web , pages 239 { 249 , 2016 .

[3]

Barcelo and

Romero . The complexity of reverse engineering problems for conjunctive queries . In Proc. of 20th Int. Conf. on Database Theory , 2017 .

[4]

Bonifati ,

Ciucanu , and

Lemay . Learning path queries on graph databases . In 18th Int. Conf. on Extending Database Technology (EDBT) , 2015 .

[5] G. Cheng, Y. Zhang, and

Qu . Explass: exploring associations between entities via top-k ontological patterns and facets . In International Semantic Web Conference , pages 422 { 437 . Springer, 2014 .

[6]

S.-S.

Choi ,

S.-H.

Cha , and

C. C.

Tappert . A survey of binary similarity and distance measures . J. Systemics, Cybernetics and Informatics , 8 ( 1 ): 43 { 48 , 2010 .

[7]

Cohen and

Y. Y.

Weiss . Learning tree patterns from example graphs . In LIPIcs-Leibniz International Proceedings in Informatics , volume 31 , 2015 .

[8]

Diaz ,

Arenas , and

Benedikt . SPARQLByE: Querying RDF data by example . Proceedings of the VLDB Endowment , 9 ( 13 ), 2016 .

[9]

Gottlob and

Nash . E cient core computation in data exchange . Journal of the ACM , 55 ( 2 ): 9 , 2008 .

[10]

Heim ,

Hellmann ,

Lehmann ,

Lohmann , and

Stegemann . RelFinder: Revealing relationships in RDF knowledge bases . In Int. Conf. on Semantic and Digital Media Technologies , pages 182 { 187 , 2009 .

[11]

Hell and J. Nesetril. The core of a graph . Discrete Mathematics , 109 ( 1 ): 117 { 126 , 1992 .

[12]

Lehmann , J.

Schuppel, and

Auer . Discovering unknown connections - the DBpedia relationship nder . CSSW , 113 : 99 { 110 , 2007 .

[13]

Marchionini . Exploratory search: from nding to understanding . Communications of the ACM , 49 ( 4 ): 41 { 46 , 2006 .

[14]

F. M.

Suchanek , G. Kasneci, and

Weikum. Yago : A large ontology from wikipedia and wordnet . Web Semantics: Science, Services and Agents on the World Wide Web , 6 ( 3 ): 203 { 217 , 2008 .

[15]

B. ten

Cate and

Dalmau . The Product Homomorphism Problem and Applications . In 18th International Conference on Database Theory (ICDT 2015 ), pages 161 { 176 , 2015 .

[16]

Q. T.

Tran ,

C.-Y.

Chan , and

Parthasarathy . Query by output . In Proc. of the 2009 ACM SIGMOD Int. Conf. on Management of data , pages 535 { 548 , 2009 .

[17]

Q. T.

Tran ,

C.-Y.

Chan , and

Parthasarathy . Query reverse engineering . The VLDB Journal , 23 ( 5 ): 721 { 746 , 2014 .

[18]

R. W.

White and

R. A.

Roth . Exploratory search: beyond the query-response paradigm , 2009 .

[19]

Zhang ,

Elmeleegy ,

C. M.

Procopiuc , and

Srivastava . Reverse engineering complex join queries . In Proc. of the 2013 ACM SIGMOD Int. Conf. on Management of Data , pages 809 { 820 , 2013 .

[20] M. M. Zloof . Query-by-example: A data base language . IBM systems Journal , 16 ( 4 ): 324 { 343 , 1977 .