Top-K Diversification for Path Queries in Knowledge Graphs Christian Aebeloe1 , Vinay Setty1,2 , Gabriela Montoya1 , and Katja Hose1 1 Aalborg University, Denmark {caebel,vinay,gmontoya,khose}@cs.aau.dk 2 University of Stavanger, Norway vsetty@acm.org Abstract. To explore the relationships between entities in RDF graphs, property path queries were introduced in SPARQL 1.1. However, existing RDF engines return only reachability of the entities ignoring the intermediate nodes in the path. If the paths are output, they are too many, which makes it difficult for users to find the most relevant paths. To address this issue, we propose a generalized top- k ranking technique that balances the trade-off between relevance and diversity. We propose a shortest path based relevance scoring in combination with several path similarity measures for diversification. With preliminary experiments and examples, we show that our diversification strategies provide more novel paths as our technique prioritizes diversity over path length. 1 Introduction Knowledge Graphs (KGs) have become a popular way to represent and query world knowledge. KGs such as YAGO [6] and DBPedia [3] are extensively used for tasks, such as question answering, knowledge exploration, and reasoning. RDF3 and SPARQL4 are standards for representing and querying KGs, recommended by the W3C and widely adopted by the Semantic Web community. SPARQL 1.15 introduces property path queries to check only the transitive reacha- bility of entities via paths of specific properties while omitting the actual paths connect- ing the entities. For example, a property path query with the clause “Alan_Turing linksTo* Princeton_University” checks if Princeton_University is transitively reachable from Alan_Turing via any number of linksTo property re- lationships but omits the intermediate entities in the path. To address this issue, we have proposed an additional operator → that returns the full paths rather than only perform- ing reachability checks [1]. Since enumerating all the paths may be overwhelming, we need ranking techniques to select the top-k most informative paths for the users. Related work on ranking and diversifying property path query results is currently very limited as supporting reachability checks already is a very challenging task. Some recent works rank the paths between entities based on their lengths and frequency of 3 https://www.w3.org/RDF/ 4 https://www.w3.org/TR/rdf-sparql-query/ 5 https://www.w3.org/TR/sparql11-query/ resources [8]. However, they only support queries involving a property path with lim- ited length rather than involving arbitrary length property paths. Therefore, we need a general top-k ranking technique supporting arbitrary property path queries. Selecting paths based on top-k shortest path ranking is a good strategy in general, but shortest paths may have several redundant resources (entities and properties). There- fore, it is also essential to apply diversification techniques to avoid redundancy, while still minimizing the path lengths. However, existing techniques do not balance the trade- off between path lengths and diversity. Path diversification using the Jaccard similarity measure has been proposed before which works well in some cases [8]. But Jaccard measure treats paths as sets and ignores the order of entities. To the best of our knowl- edge, there are no diversification measures considering the semantics of paths. To address these issues, our contributions in this paper are: (1) we formulate a gen- eralized top-k ranking technique for property paths that balances the trade-off between path lengths and diversity. (2) we then propose the Levenshtein similarity measure for respecting path semantics. (3) we perform preliminary evaluations and compare the ef- fectiveness of the two diversification strategies using an evaluation metric to measure novelty and average path lengths. 2 Diversification Metrics for Property Paths Given a property path query Q and the set of all relevant paths R, our goal is to select S ⊆ R to maximize the objective given below: X DivP (R) = arg max (1 − λ) · Rel(Pi , R) + λ · (1 − Sim(Pi , S)) s.t.|S| = k (1) S⊆R Pi ∈S Where Rel(Pi , R) is a function quantifying the relevance of a path Pi to a query Q with result set R. Each path P contains a set of entities and property edges connecting them, which we call “resources”, and each path has a path length represented as |P |, which corresponds to the total number of resources in the path. To avoid paths with redundant resources, the objective function in Equation 1 aims at balancing relevance (Rel) and similarity (Sim) among paths in S. The trade-off between these two scores is balanced by a user-specified diversification parameter 0 ≤ λ ≤ 1. If λ = 0, we rank the paths purely according to their lengths. On the other hand, if λ = 1, the k most dissimilar paths according to some similarity measure irrespective of their path lengths are chosen. Computing a solution for DivP (R) is known to be NP-hard [4]. However, since the objective follows the submodularity property, there are known greedy heuristics for solving it approximately [1, 7]. In this paper, we quantify the relevance relative to the shortest path in the result set. A path is more relevant if its path length is closer to the shortest path’s length. We compute the normalized relevance score as Rel(Pi , R) = (minPj ∈R |Pj |)/|Pi |. We can also use other scoring models such as weighted path lengths and informative- ness measures [8]. In this paper, we explore several Sim functions that we can use for diversification. Jaccard similarity: The simplest way to quantify the overlap in paths is to treat them as sets and compute the Jaccard similarity value. Jaccard similarity is defined as: |Pi ∩ Pj | SimJ (Pi , S) = max (2) Pj ∈S,Pj 6=Pi |Pi ∪ Pj | Since Jaccard similarity captures the overlap in paths by treating them as sets, it ignores the order of resources in the paths. Intuitively, SimJ cannot distinguish between two paths with identical resources but connected in a different order. To address this issue, we need a similarity measure that preserves the order of sequences. Levenshtein similarity: To quantify the path sequence similarity, we adapt the Lev- enshtein distance [5] (LD), which is largely used to quantify the distance between two strings as the minimal number of insertions, deletions, and substitutions of one character for another that will transform one string into the other. For diversification of property paths, in Equation 1, we normalize the similarity value computed from LD as follows: LD(Pi , Pj ) SimL (Pi , S) = max 1− (3) Pj ∈S,Pj 6=Pi max(|Pi |, |Pj |) 3 Evaluation For evaluation we use YAGO [6], which contains 1+ billion triples. To evaluate the ef- fectiveness of the aforementioned diversification metrics, we define an evaluation met- rics inspired by the novelty metric used by Arnaou et al. [2]. Evaluations are performed on sets of top-k paths of R, [P1 , ..., Pk ], sorted according to Equation 1 in descending order. We define an evaluation metric N ov to capture the novelty based on the fraction of unseen elements of each path in the top-k results: S |F(Pi ) \ 1≤j 0.6). For lower values of λ, Jaccard still performs slightly better than Levenshtein. This could be attributed to the fact that in shorter paths, there are fewer triples and hence the impact of Levenshtein is not significant. However, this needs further analysis which we leave as a future work. In Table 1, we can also observe that until λ = 0.4 the path lenghts are identi- cal for both Jaccard and Levenshtein. For λ > 0.6, Levenshtein, prefers shorter paths with higher novelty w.r.t N ovT , while Jaccard prefers longer paths because it treats shorter paths with same resources but connected in different order as similar. Since we are ranking and diversifying paths this behavior is undesirable. This demonstrates that Levenshtein is able to achieve the trade-off between path lengths and diversity. Due to lack of space we omit results with different values of k and specific exam- ples, but they can be found on our website, http://qweb.cs.aau.dk/jedi/. 4 Conclusion In this paper, we addressed the issue of ranking paths in KGs. We proposed a gener- alized top-k diversification technique that is customizable with different relevance and similarity functions. We proposed a new similarity measure Levenshtein, which respects the order of nodes in the paths. We conducted preliminary experiments using novelty metrics and path lengths and show that Levenshtein provides better balancing in trade- off between path lengths and diversification than Jaccard. However, Levenshtein does not always outperform Jaccard and there is a need for further analysis in this regard. Our diversification technique has been integrated into J EDI [1], which can be used to observe the impact of diversification when evaluating queries over different KGs. Acknowledgment This research was partially funded by the Danish Council for In- dependent Research (DFF) under grant agreement no. DFF-4093-00301 and Aalborg University’s Talent Management Programme. References 1. C. Aebeloe, G. Montoya, V. Setty, and K. Hose. Discovering Diversified Paths in Knowledge Bases. PVLDB, 11(12), 2018. 2. H. Arnaout and S. Elbassuoni. Result Diversity for RDF Search. In KDIR, pages 249–256, 2016. 3. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In The Semantic Web, pages 722–735. Springer, 2007. 4. J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR’98, pages 335–336, 1998. 5. V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10(8):707–710, 1966. 6. F. Mahdisoltani, J. Biega, and F. Suchanek. YAGO3: A knowledge base from multilingual wikipedias. In CIDR’14, 2014. 7. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maxi- mizing submodular set functions–I. Mathematical Programming, 14(1):265–294, 1978. 8. G. Pirrò. Explaining and suggesting relatedness in knowledge graphs. In ISWC, pages 622– 639, 2015.