<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Top-K Diversification for Path Queries in Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Aebeloe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinay Setty</string-name>
          <email>vinay@cs.aau.dk</email>
          <email>vsetty@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Montoya</string-name>
          <email>gmontoya@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katja Hose</string-name>
          <email>khose@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalborg University</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Stavanger</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To explore the relationships between entities in RDF graphs, property path queries were introduced in SPARQL 1.1. However, existing RDF engines return only reachability of the entities ignoring the intermediate nodes in the path. If the paths are output, they are too many, which makes it difficult for users to find the most relevant paths. To address this issue, we propose a generalized topk ranking technique that balances the trade-off between relevance and diversity. We propose a shortest path based relevance scoring in combination with several path similarity measures for diversification. With preliminary experiments and examples, we show that our diversification strategies provide more novel paths as our technique prioritizes diversity over path length.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Knowledge Graphs (KGs) have become a popular way to represent and query world
knowledge. KGs such as YAGO [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and DBPedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are extensively used for tasks, such
as question answering, knowledge exploration, and reasoning. RDF3 and SPARQL4 are
standards for representing and querying KGs, recommended by the W3C and widely
adopted by the Semantic Web community.
      </p>
      <p>
        SPARQL 1.15 introduces property path queries to check only the transitive
reachability of entities via paths of specific properties while omitting the actual paths
connecting the entities. For example, a property path query with the clause “Alan_Turing
linksTo* Princeton_University” checks if Princeton_University is
transitively reachable from Alan_Turing via any number of linksTo property
relationships but omits the intermediate entities in the path. To address this issue, we have
proposed an additional operator ! that returns the full paths rather than only
performing reachability checks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since enumerating all the paths may be overwhelming, we
need ranking techniques to select the top-k most informative paths for the users.
      </p>
      <p>
        Related work on ranking and diversifying property path query results is currently
very limited as supporting reachability checks already is a very challenging task. Some
recent works rank the paths between entities based on their lengths and frequency of
3 https://www.w3.org/RDF/
4 https://www.w3.org/TR/rdf-sparql-query/
5 https://www.w3.org/TR/sparql11-query/
resources [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, they only support queries involving a property path with
limited length rather than involving arbitrary length property paths. Therefore, we need a
general top-k ranking technique supporting arbitrary property path queries.
      </p>
      <p>
        Selecting paths based on top-k shortest path ranking is a good strategy in general,
but shortest paths may have several redundant resources (entities and properties).
Therefore, it is also essential to apply diversification techniques to avoid redundancy, while
still minimizing the path lengths. However, existing techniques do not balance the
tradeoff between path lengths and diversity. Path diversification using the Jaccard similarity
measure has been proposed before which works well in some cases [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. But Jaccard
measure treats paths as sets and ignores the order of entities. To the best of our
knowledge, there are no diversification measures considering the semantics of paths.
      </p>
      <p>To address these issues, our contributions in this paper are: (1) we formulate a
generalized top-k ranking technique for property paths that balances the trade-off between
path lengths and diversity. (2) we then propose the Levenshtein similarity measure for
respecting path semantics. (3) we perform preliminary evaluations and compare the
effectiveness of the two diversification strategies using an evaluation metric to measure
novelty and average path lengths.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Diversification Metrics for Property Paths</title>
      <p>Given a property path query Q and the set of all relevant paths R, our goal is to select
S R to maximize the objective given below:</p>
      <p>DivP (R) = arg max X (1</p>
      <p>S R Pi2S
) Rel(Pi; R) +
(1</p>
      <p>Sim(Pi; S)) s:t:jSj = k
(1)</p>
      <p>Where Rel(Pi; R) is a function quantifying the relevance of a path Pi to a query Q
with result set R. Each path P contains a set of entities and property edges connecting
them, which we call “resources”, and each path has a path length represented as jP j,
which corresponds to the total number of resources in the path.</p>
      <p>
        To avoid paths with redundant resources, the objective function in Equation 1 aims
at balancing relevance (Rel) and similarity (Sim) among paths in S. The trade-off
between these two scores is balanced by a user-specified diversification parameter 0
1. If = 0, we rank the paths purely according to their lengths. On the other hand,
if = 1, the k most dissimilar paths according to some similarity measure irrespective
of their path lengths are chosen. Computing a solution for DivP (R) is known to be
NP-hard [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, since the objective follows the submodularity property, there are
known greedy heuristics for solving it approximately [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ].
      </p>
      <p>
        In this paper, we quantify the relevance relative to the shortest path in the result
set. A path is more relevant if its path length is closer to the shortest path’s length.
We compute the normalized relevance score as Rel(Pi; R) = (minPj2R jPj j)=jPij.
We can also use other scoring models such as weighted path lengths and
informativeness measures [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this paper, we explore several Sim functions that we can use for
diversification.
      </p>
      <p>Jaccard similarity: The simplest way to quantify the overlap in paths is to treat them
as sets and compute the Jaccard similarity value. Jaccard similarity is defined as:
SimJ (Pi; S) = max jPi \ Pjj (2)</p>
      <p>Pj2S;Pj6=Pi jPi [ Pjj</p>
      <p>
        Since Jaccard similarity captures the overlap in paths by treating them as sets, it
ignores the order of resources in the paths. Intuitively, SimJ cannot distinguish between
two paths with identical resources but connected in a different order. To address this
issue, we need a similarity measure that preserves the order of sequences.
Levenshtein similarity: To quantify the path sequence similarity, we adapt the
Levenshtein distance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (LD), which is largely used to quantify the distance between two
strings as the minimal number of insertions, deletions, and substitutions of one character
for another that will transform one string into the other. For diversification of property
paths, in Equation 1, we normalize the similarity value computed from LD as follows:
SimL(Pi; S) = max 1 LD(Pi; Pj ) (3)
      </p>
      <p>Pj2S;Pj6=Pi max(jPij; jPj j)
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        For evaluation we use YAGO [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which contains 1+ billion triples. To evaluate the
effectiveness of the aforementioned diversification metrics, we define an evaluation
metrics inspired by the novelty metric used by Arnaou et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Evaluations are performed
on sets of top-k paths of R, [P1; :::; Pk], sorted according to Equation 1 in descending
order. We define an evaluation metric N ov to capture the novelty based on the fraction
of unseen elements of each path in the top-k results:
      </p>
      <p>N ov(Pi; [P1; :::; Pk]) = jF (Pi) n</p>
      <p>S1 j&lt;i F (Pj )j
jF (Pi)j</p>
      <p>Where F (Pi) is a function for computing the set of elements in the path Pi. Here we
observe that the set of elements of a path can be at the granularity of resources or triples.
The former treats each unseen individual resource such as entities and properties in the
path as a novel element, while the latter considers novel elements at the granularity
of triples. Therefore, we define two variants of N ov–N ovR and N ovT which replace
F (Pi) in Equation 4 with resources(Pi) and triples(Pi) respectively, representing
set of resources and triples in the path Pi respectively. Intuitively, N ovT captures the
novelty of a path respecting semantics of paths.
(4)</p>
      <p>We perform evaluations on 10 randomly chosen queries. Table 1 shows the mean
novelty metrics and path lengths, when varying the -values from 0 to 1.0, with k = 10.
When = 0, we observe that the novelty and the path lenghts are identical for both
approaches. This is expected, as the paths are ranked solely based on the path length
and diversification plays no role.</p>
      <p>Since N ovR measures the fraction of unseen resources in the ranked paths, it is
natural to observe that Jaccard consistently achieves higher N ovR values than
Levenshtein. However, when we consider N ovT , which measures the fraction of unseen
triples, Levenshtein performs better with higher values of (&gt; 0.6). For lower values of
, Jaccard still performs slightly better than Levenshtein. This could be attributed to the
fact that in shorter paths, there are fewer triples and hence the impact of Levenshtein is
not significant. However, this needs further analysis which we leave as a future work.</p>
      <p>In Table 1, we can also observe that until = 0:4 the path lenghts are
identical for both Jaccard and Levenshtein. For &gt; 0:6, Levenshtein, prefers shorter paths
with higher novelty w.r.t N ovT , while Jaccard prefers longer paths because it treats
shorter paths with same resources but connected in different order as similar. Since we
are ranking and diversifying paths this behavior is undesirable. This demonstrates that
Levenshtein is able to achieve the trade-off between path lengths and diversity.</p>
      <p>Due to lack of space we omit results with different values of k and specific
examples, but they can be found on our website, http://qweb.cs.aau.dk/jedi/.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusion</title>
      <p>
        In this paper, we addressed the issue of ranking paths in KGs. We proposed a
generalized top-k diversification technique that is customizable with different relevance and
similarity functions. We proposed a new similarity measure Levenshtein, which respects
the order of nodes in the paths. We conducted preliminary experiments using novelty
metrics and path lengths and show that Levenshtein provides better balancing in
tradeoff between path lengths and diversification than Jaccard. However, Levenshtein does
not always outperform Jaccard and there is a need for further analysis in this regard.
Our diversification technique has been integrated into JEDI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which can be used to
observe the impact of diversification when evaluating queries over different KGs.
Acknowledgment This research was partially funded by the Danish Council for
Independent Research (DFF) under grant agreement no. DFF-4093-00301 and Aalborg
University’s Talent Management Programme.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Aebeloe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Montoya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          .
          <article-title>Discovering Diversified Paths in Knowledge Bases</article-title>
          . PVLDB,
          <volume>11</volume>
          (
          <issue>12</issue>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnaout</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Elbassuoni</surname>
          </string-name>
          .
          <article-title>Result Diversity for RDF Search</article-title>
          . In KDIR, pages
          <fpage>249</fpage>
          -
          <lpage>256</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Ives.</surname>
          </string-name>
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          .
          <source>In The Semantic Web</source>
          , pages
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. J. Carbonell and J.
          <string-name>
            <surname>Goldstein</surname>
          </string-name>
          .
          <article-title>The use of MMR, diversity-based reranking for reordering documents and producing summaries</article-title>
          .
          <source>In SIGIR'98</source>
          , pages
          <fpage>335</fpage>
          -
          <lpage>336</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary Codes Capable of Correcting Deletions, Insertions and Reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          ,
          <volume>10</volume>
          (
          <issue>8</issue>
          ):
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>F.</given-names>
            <surname>Mahdisoltani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Biega</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          .
          <article-title>YAGO3: A knowledge base from multilingual wikipedias</article-title>
          .
          <source>In CIDR'14</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Nemhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Wolsey</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. L. Fisher.</surname>
          </string-name>
          <article-title>An analysis of approximations for maximizing submodular set functions-</article-title>
          <source>I. Mathematical Programming</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>265</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>G.</given-names>
            <surname>Pirrò</surname>
          </string-name>
          .
          <article-title>Explaining and suggesting relatedness in knowledge graphs</article-title>
          .
          <source>In ISWC</source>
          , pages
          <fpage>622</fpage>
          -
          <lpage>639</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>