<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Multi-Facet Snippets for Dataset Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaxia Wang</string-name>
          <email>xxwang@smail.nju.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gong Cheng</string-name>
          <email>gcheng@nju.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeny Kharlamov</string-name>
          <email>evgeny.kharlamov@de.bosch.com</email>
          <email>evgeny.kharlamov@ifi.uio.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Center for Arti cial Intelligence</institution>
          ,
          <addr-line>Renningen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Key Laboratory for Novel Software Technology, Nanjing University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Due to a recent signi cant increase in the number of RDF datasets available on the Web, there is a pressing need in e ective search techniques for nding the right data on demand. A promising approach is to present retrieved datasets as snippets that aim at concisely explaining to the user why this dataset ful ls their demand. Snippets in particular can illustrate the main content of the dataset and explain its relevance to the user's query. Computing optimal snippets is a non-trivial task and a number of approaches have emerged to address this problem. In this short paper, we report our ongoing work on snippets that address multiple facets of optimality. Based on our recently proposed evaluation metrics for dataset snippets, we formulate a weighted maximum coverage problem which directly optimizes three evaluation metrics. We solve the problem with a greedy algorithm, and our current implementation has outperformed four baseline methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The open data movement brings increasingly many datasets to the Web, many
of which are in the RDF format. Reusing these datasets is of great importance to
researchers and developers. In order to enable the reuse there is a pressing need
in e ective search techniques for nding the right data on demand. A promising
approach is to query for datasets with keywords as in Google Dataset Search [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and to present each retrieved RDF dataset as a snippet, its small representative
subset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Dataset snippets aim at concisely explaining to the user why this
dataset ful ls their demand and in particular can illustrate the main content of
the dataset and explain its relevance to the user's query.
      </p>
      <p>
        Computing optimal snippets is a non-trivial task and a number of approaches
have emerged to address this or related problems [2{6]. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we presented four
metrics for evaluating the quality of a dataset snippet. In this short paper, we
report our ongoing work on snippets that address multiple facets of optimality.
In particular, in order to improve the quality of a snippet for dataset search, we
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
formulate the selection of RDF triples as a combinatorial optimization problem
that directly optimizes three evaluation metrics proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A dataset
snippet generated by our approach, which we refer to as KSD, is expected to have
a good coverage of the query Keywords and the content of the dataset at both
the Schema and the Data level. We solve the problem with a greedy algorithm,
and our evaluation demonstrates that KSD outperforms the baselines reported
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and that there is still a considerable room for quality improvement.
      </p>
      <p>
        The remainder of this paper is structured as follows. Section 2 de nes the
problem and reviews the evaluation metrics proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Section 3 describes
the implementation of KSD. Section 4 presents evaluation results. Section 5
concludes the paper with future work.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>Problem Statement</title>
        <p>An RDF dataset is a set of RDF triples denoted by T = ft1; t2; ; tng, where
ts; tip; tioi is a subject-predicate-object triple of RDF resources. The
each ti = h i
subject tis of a triple ti is an entity (i.e., a non-literal resource at the instance
level) that appears in T . The predicate tip represents a property. The object tio
is a value of tip, which can be a class, a literal, or another entity in T .</p>
        <p>A keyword query is a set of keywords denoted by Q = fq1; q2; ; qmg. Given
a dataset T , a keyword query Q, and a positive integer k as the size bound, a
dataset snippet is an optimum subset of triples selected from T , denoted by S
T , satisfying jSj k. We will give our de nition of optimality in Section 3.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Evaluation Metrics</title>
        <p>
          We brie y review the four metrics proposed in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for evaluating the quality of
a dataset snippet S: coKyw, coCnx, coSkm, and coDat, all in the range of [0; 1].
        </p>
        <p>Coverage of Query Keywords (coKyw). A resource r covers a keyword q
if r's textual form (e.g., rdfs:label of an IRI or blank node, lexical form of a
literal) contains a keyword match for q. A triple t covers a keyword q, denoted
by t q, if r covers q for any r 2 fts; tp; tog. For a snippet S, the coKyw metric
evaluates its coverage of query keywords:
1
jQj
coKyw(S) =
jfq 2 Q : 9t 2 S; t
qgj :</p>
      </sec>
      <sec id="sec-2-3">
        <title>Coverage of Connections between Query Keywords (coCnx). A snip</title>
        <p>pet S covers the connection between two keywords qi; qj 2 Q, denoted by
S (qi; qj), if there is a path in the RDF graph representation of S that
connects two resources: one covering qi and the other covering qj. For S, the coCnx
metric evaluates its coverage of connections between query keywords:
coCnx(S) =
( 1
(j Q2j) jffqi; qjg
coKyw(S)</p>
        <p>Q : qi 6= qj and S
(qi; qj)gj if jQj &gt; 1 ;
if jQj = 1 :
(1)
(2)
(3)
(4)
(6)
coSkm(S) = hm(</p>
        <p>frqCls(c);</p>
        <p>X
c2Cls(S)</p>
        <p>X
p2Prp(S)
frqPrp(p)) ;
(5)
where Cls(S) is the set of classes instantiated in S and Prp(S) is the set of
properties instantiated in S.</p>
        <p>Coverage of Data (coDat). Central entities represent the key content of a
dataset. Let d+(e) and d (e) be the out-degree and in-degree of an entity e in
the RDF graph representation of a dataset T , respectively. For a snippet S, its
coverage of the entities in T is the harmonic mean (hm) of the mean normalized
out-degree and in-degree of the entities it contains:
coDat(S) = hm(
1
1</p>
        <p>X</p>
        <p>X
jEnt(S)j e2Ent(S) maxe02Ent(T ) log(d+(e0) + 1)
jEnt(S)j e2Ent(S) maxe02Ent(T ) log(d (e0) + 1)
log(d+(e) + 1)
log(d (e) + 1)
;
) ;
When there is only one keyword, coCnx is meaningless and we set it to coKyw.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Coverage of Data Schema (coSkm). Consider the RDF schema of a</title>
        <p>dataset. The relative frequency of a class c observed in a dataset T is
frqCls(c) = jft 2 T : tp = rdf:type and to = cgj :</p>
        <p>jft 2 T : tp = rdf:typegj
Analogously, the relative frequency of a property p observed in T is
frqPrp(p) = jft 2 T : tp = pgj :
jT j</p>
        <p>For a snippet S, its coverage of the schema of T is the harmonic mean (hm)
of the total relative frequency of the classes and properties it contains:
where Ent(X) is the set of entities that appear in a set of triples X.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>Given the evaluation metrics presented in Section 2.2, a straightforward idea is to
formulate the selection of RDF triples as a combinatorial optimization problem,
and directly optimize these evaluation metrics. Our current work considers three
metrics: coKyw, coSkm, and coDat, leaving coCnx as future work. The three
selected metrics all require a snippet to cover some elements: query keywords
in coKyw, classes and properties in coSkm, and entities in coDat. Furthermore,
the classes, properties, and entities to cover are with di erent weights. It inspires
us to formulate an instance of the weighted maximum coverage problem. We
formalize this idea in Section 3.1, and present a solution in Section 3.2.</p>
      <p>Algorithm 1 Greedy Algorithm
Input: A dataset T , a keyword query Q, and a size bound k
Output: An optimum dataset snippet S T
1: S ;;
2: while jSj &lt; k do
3: t argmaxt2(T nS)(q(S [ ftg) q(S));
4: S S [ ft g;
5: end while
6: return S;
3.1</p>
      <sec id="sec-3-1">
        <title>Snippet Generation as Weighted Maximum Coverage</title>
        <p>Weighted Maximum Coverage. Given a collection of sets, a weighted
maximum coverage (WMC) problem is to select a limited number of sets from the
collection such that the total weight of the covered elements is maximized.</p>
        <p>Snippet Generation as WMC. We formulate the generation of an
optimum dataset snippet as an instance of the WMC problem. Each RDF triple
ti 2 T corresponds to a set denoted by cov(ti) which consists of: the query
keywords covered by ti, the class instantiated in ti, the property instantiated in ti,
and the entities that appear in ti. The universe of elements is denoted by
= Q [ Cls(T ) [ Prp(T ) [ Ent(T ) :
Each element x 2</p>
        <p>has a non-negative weight:
8
&gt;
&gt;
&gt;
&gt;
w(x) = &lt;
&gt;
&gt;
&gt;
&gt;
:
jQ1 j x 2 Q ;
frqCls(x) x 2 Cls(T ) ;
frqPrp(x) x 2 Prp(T ) ;</p>
        <p>log(d+(x)+1) log(d (x)+1)
( Pe2Ent(T) log(d+(e)+1) + Pe2Ent(T) log(d (e)+1) ) x 2 Ent(T ) ;
(7)
(8)
In our experiments, we set = 2; = 1; = 1, to balance between the
coverage of query keywords in coKyw ( ), the coverage of classes and properties in
coSkm ( ), and the coverage of entities in coDat ( ) in our objective function.</p>
        <p>An optimum dataset snippet S T is one that
maximizes q(S) =
w(x) ; subject to jSj
k ;
(9)</p>
        <p>X
x2Sti2S cov(ti)
where k is a prede ned size bound, and q( ) is the objective function.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Solution</title>
        <p>Algorithm 1 presents the greedy algorithm for the WMC problem which
at each stage chooses a set that contains the maximum weight of uncovered
elements. It achieves an approximation ratio of 1 1e .
IlluSnip
TA+C
PrunedDP++
CES
KSD
coKyw coCnx coSkm coDat Average
0.1000 0.0540 0.6820 0.3850 0.3053
0.9590 0.4703 0.0425 0.0915 0.3908</p>
        <p>1 1 0.0898 0.2133 0.5758
0.9006 0.3926 0.3668 0.2684 0.4821
0.8352 0.3595 0.8651 0.4247 0.6211</p>
        <p>coKyw coCnx coSkm coDat Average
data.gov.uk 0.7643 0.2882 0.8249 0.3870 0.5661
DMOZ-1 0.8977 0.7955 0.8873 0.4726 0.7633
DMOZ-2 0.8433 0.2444 0.8710 0.4569 0.6039
DMOZ-3 0.8395 0.2337 0.8693 0.4145 0.5893
DMOZ-4 0.7936 0.1877 0.8521 0.3731 0.5516</p>
        <p>Assuming q(S [ ftg) q(S) is computed in O(1), the overall running time of
a naive implementation of the algorithm is O(k n), where n is the number of
RDF triples in T . A more e cient implementation may use a priority queue to
hold candidate triples, which is left as our future work.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        Our evaluation reused the 387 query-dataset pairs in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where datasets were
collected from DataHub and queries included 42 real queries submitted to data.gov.uk
and 345 arti cial queries comprising i category names in DMOZ referred to as
DMOZ-i for i = 1; 2; 3; 4. We compared our proposed KSD with four baseline
methods evaluated in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], namely IlluSnip [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], TA+C [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], PrunedDP++ [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and
CES [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Following [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we set k = 20, i.e., a snippet contained at most 20 triples.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Quality of Snippets</title>
        <p>Table 1 presents the average scores of the four evaluation metrics over all the
query-dataset pairs. Compared with the baselines, KSD achieved the highest
overall score of 0.6211. In particular, its coverage of schema (coSkm = 0:8651)
and data (coDat = 0:4247) were at the top. Its coverage of query keywords
(coKyw = 0:8352) was close to TA+C, PrunedDP++, and CES which are
queryfocused methods. Therefore, KSD achieved a satisfying trade-o between these
evaluation metrics. On the other hand, its coCnx score was not high because
coCnx was not explicitly considered in our approach.</p>
        <p>Table 2 breaks down the scores of KSD into groups of query-dataset pairs.
The scores on di erent groups were generally consistent with each other,
demonstrating the robustness of our approach. One exception was the very high coCnx
score on DMOZ-1, due to Eq. (2) where coCnx = coKyw when jQj = 1.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Running Time</title>
        <p>We tested the running time of our approach on an Intel Core i7-8700K (3.70GHz)
with 10GB memory for the JVM.</p>
        <p>Among all the 387 query-dataset pairs, for 234 (60.47%) a dataset snippet
was generated within 1 second, and for 341 (88.11%) one was generated within
10 seconds. The median time was 0.51 second, showing promising performance
for practical use. In the worst case, it took 150 seconds to process a large dataset
containing more than 2 million RDF triples. Future work would be needed to
improve the performance of our implementation to handle large datasets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this ongoing work, we proposed KSD, a new approach to generating snippets
for dataset search. By directly optimizing three evaluation metrics, KSD
outperformed four baselines. It has established new state-of-the-art results for future
work. We are working towards a full version of KSD which will also optimize
coCnx. We will implement our approach in a prototype of a new dataset search
engine, to help users conveniently judge the relevance of a retrieved dataset.</p>
      <p>There are limitations in our work. First, the current version of KSD has
considered three metrics but we exclude coCnx. The other three metrics are all
about covering some elements with selected RDF triples, whereas coCnx is
related to graph connectivity. The weighted maximum coverage problem seems not
expressive enough to model coCnx. We will explore other possibilities. Second,
although the running time of our current implementation is acceptable in most
cases, its performance is not satisfying on large datasets. We will consider using
priority queue and appropriate indexes to make the generation process faster.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>This work was supported by the NSFC under Grant 61572247. Cheng was funded
by the Six Talent Peaks Program of Jiangsu Province under Grant RJFW-011.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          :
          <article-title>Google dataset search: Building a search engine for datasets in an open web ecosystem</article-title>
          .
          <source>In: WWW 2019</source>
          . pp.
          <volume>1365</volume>
          {
          <issue>1375</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Cheng, G.,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Generating illustrative snippets for open data on the web</article-title>
          .
          <source>In: WSDM 2017</source>
          . pp.
          <volume>151</volume>
          {
          <issue>159</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Elle</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breslin</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demidova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szymanski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>RDF dataset pro ling - a survey of features, methods, vocabularies and applications</article-title>
          .
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <volume>677</volume>
          {
          <fpage>705</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Feigenblat</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roitman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konopnicki</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Unsupervised query-focused multi-document summarization using the cross entropy method</article-title>
          .
          <source>In: SIGIR 2017</source>
          . pp.
          <volume>961</volume>
          {
          <issue>964</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ge</surname>
          </string-name>
          , W., Cheng, G.,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Incorporating compactness to generate termassociation view snippets for ontology search</article-title>
          .
          <source>Inf. Process. Manage</source>
          .
          <volume>49</volume>
          (
          <issue>2</issue>
          ),
          <volume>513</volume>
          {
          <fpage>528</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
          </string-name>
          , R.:
          <article-title>E cient and progressive group steiner tree search</article-title>
          .
          <source>In: SIGMOD 2016</source>
          . pp.
          <volume>91</volume>
          {
          <issue>106</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Cheng, G.,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kharlamov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A framework for evaluating snippet generation for dataset search</article-title>
          .
          <source>In: ISWC</source>
          <year>2019</year>
          (
          <year>2019</year>
          ), https://arxiv.org/abs/
          <year>1907</year>
          .01183
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>