=Paper= {{Paper |id=Vol-2465/profiles_paper1 |storemode=property |title=Towards Multi-Facet Snippets for Dataset Search |pdfUrl=https://ceur-ws.org/Vol-2465/profiles_paper1.pdf |volume=Vol-2465 |authors=Xiaxia Wang,Gong Cheng,Evgeny Kharlamov |dblpUrl=https://dblp.org/rec/conf/semweb/WangCK19 }} ==Towards Multi-Facet Snippets for Dataset Search== https://ceur-ws.org/Vol-2465/profiles_paper1.pdf
Towards Multi-Facet Snippets for Dataset Search

             Xiaxia Wang1 , Gong Cheng1 , and Evgeny Kharlamov2,3
1
    National Key Laboratory for Novel Software Technology, Nanjing University, China
                   xxwang@smail.nju.edu.cn, gcheng@nju.edu.cn
               2
                 Department of Informatics, University of Oslo, Norway
                           evgeny.kharlamov@ifi.uio.no
             3
               Bosch Center for Artificial Intelligence, Renningen, Germany
                          evgeny.kharlamov@de.bosch.com



        Abstract. Due to a recent significant increase in the number of RDF
        datasets available on the Web, there is a pressing need in effective search
        techniques for finding the right data on demand. A promising approach is
        to present retrieved datasets as snippets that aim at concisely explaining
        to the user why this dataset fulfils their demand. Snippets in particular
        can illustrate the main content of the dataset and explain its relevance
        to the user’s query. Computing optimal snippets is a non-trivial task
        and a number of approaches have emerged to address this problem. In
        this short paper, we report our ongoing work on snippets that address
        multiple facets of optimality. Based on our recently proposed evaluation
        metrics for dataset snippets, we formulate a weighted maximum coverage
        problem which directly optimizes three evaluation metrics. We solve the
        problem with a greedy algorithm, and our current implementation has
        outperformed four baseline methods.


1     Introduction
The open data movement brings increasingly many datasets to the Web, many
of which are in the RDF format. Reusing these datasets is of great importance to
researchers and developers. In order to enable the reuse there is a pressing need
in effective search techniques for finding the right data on demand. A promising
approach is to query for datasets with keywords as in Google Dataset Search [1]
and to present each retrieved RDF dataset as a snippet, its small representative
subset [2]. Dataset snippets aim at concisely explaining to the user why this
dataset fulfils their demand and in particular can illustrate the main content of
the dataset and explain its relevance to the user’s query.
    Computing optimal snippets is a non-trivial task and a number of approaches
have emerged to address this or related problems [2–6]. In [7], we presented four
metrics for evaluating the quality of a dataset snippet. In this short paper, we
report our ongoing work on snippets that address multiple facets of optimality.
In particular, in order to improve the quality of a snippet for dataset search, we

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
X. Wang et al.

formulate the selection of RDF triples as a combinatorial optimization problem
that directly optimizes three evaluation metrics proposed in [7]. A dataset snip-
pet generated by our approach, which we refer to as KSD, is expected to have
a good coverage of the query Keywords and the content of the dataset at both
the Schema and the Data level. We solve the problem with a greedy algorithm,
and our evaluation demonstrates that KSD outperforms the baselines reported
in [7] and that there is still a considerable room for quality improvement.
    The remainder of this paper is structured as follows. Section 2 defines the
problem and reviews the evaluation metrics proposed in [7]. Section 3 describes
the implementation of KSD. Section 4 presents evaluation results. Section 5
concludes the paper with future work.

2     Preliminaries
2.1   Problem Statement
An RDF dataset is a set of RDF triples denoted by T = {t1 , t2 , · · · , tn }, where
each ti = htsi , tpi , toi i is a subject-predicate-object triple of RDF resources. The
subject tsi of a triple ti is an entity (i.e., a non-literal resource at the instance
level) that appears in T . The predicate tpi represents a property. The object toi
is a value of tpi , which can be a class, a literal, or another entity in T .
    A keyword query is a set of keywords denoted by Q = {q1 , q2 , · · · , qm }. Given
a dataset T , a keyword query Q, and a positive integer k as the size bound, a
dataset snippet is an optimum subset of triples selected from T , denoted by S ⊆
T , satisfying |S| ≤ k. We will give our definition of optimality in Section 3.

2.2   Evaluation Metrics
We briefly review the four metrics proposed in [7] for evaluating the quality of
a dataset snippet S: coKyw, coCnx, coSkm, and coDat, all in the range of [0, 1].
    Coverage of Query Keywords (coKyw). A resource r covers a keyword q
if r’s textual form (e.g., rdfs:label of an IRI or blank node, lexical form of a
literal) contains a keyword match for q. A triple t covers a keyword q, denoted
by t ≺ q, if r covers q for any r ∈ {ts , tp , to }. For a snippet S, the coKyw metric
evaluates its coverage of query keywords:
                                    1
                    coKyw(S) =         · |{q ∈ Q : ∃t ∈ S, t ≺ q}| .                   (1)
                                   |Q|
   Coverage of Connections between Query Keywords (coCnx). A snip-
pet S covers the connection between two keywords qi , qj ∈ Q, denoted by
S ≺ (qi , qj ), if there is a path in the RDF graph representation of S that con-
nects two resources: one covering qi and the other covering qj . For S, the coCnx
metric evaluates its coverage of connections between query keywords:
                 ( 1
                    |Q| · |{{qi , qj } ⊆ Q : qi 6= qj and S ≺ (qi , qj )}| if |Q| > 1 ,
 coCnx(S) = ( 2 )                                                                       (2)
                   coKyw(S)                                                if |Q| = 1 .
                                      Towards Multi-Facet Snippets for Dataset Search

When there is only one keyword, coCnx is meaningless and we set it to coKyw.
   Coverage of Data Schema (coSkm). Consider the RDF schema of a
dataset. The relative frequency of a class c observed in a dataset T is

                             |{t ∈ T : tp = rdf:type and to = c}|
              frqCls(c) =                                         .                    (3)
                                   |{t ∈ T : tp = rdf:type}|

Analogously, the relative frequency of a property p observed in T is

                                          |{t ∈ T : tp = p}|
                         frqPrp(p) =                         .                         (4)
                                                  |T |

    For a snippet S, its coverage of the schema of T is the harmonic mean (hm)
of the total relative frequency of the classes and properties it contains:
                                X                         X
             coSkm(S) = hm(              frqCls(c),                frqPrp(p)) ,        (5)
                              c∈Cls(S)                  p∈Prp(S)


where Cls(S) is the set of classes instantiated in S and Prp(S) is the set of
properties instantiated in S.
   Coverage of Data (coDat). Central entities represent the key content of a
dataset. Let d+ (e) and d− (e) be the out-degree and in-degree of an entity e in
the RDF graph representation of a dataset T , respectively. For a snippet S, its
coverage of the entities in T is the harmonic mean (hm) of the mean normalized
out-degree and in-degree of the entities it contains:

                            1            X              log(d+ (e) + 1)
        coDat(S) = hm(            ·                                               ,
                         |Ent(S)|                maxe0 ∈Ent(T ) log(d+ (e0 ) + 1)
                                      e∈Ent(S)
                                                                                       (6)
                            1            X              log(d− (e) + 1)
                                  ·                                               ),
                         |Ent(S)|                maxe0 ∈Ent(T ) log(d− (e0 ) + 1)
                                      e∈Ent(S)


where Ent(X) is the set of entities that appear in a set of triples X.


3    Approach

Given the evaluation metrics presented in Section 2.2, a straightforward idea is to
formulate the selection of RDF triples as a combinatorial optimization problem,
and directly optimize these evaluation metrics. Our current work considers three
metrics: coKyw, coSkm, and coDat, leaving coCnx as future work. The three
selected metrics all require a snippet to cover some elements: query keywords
in coKyw, classes and properties in coSkm, and entities in coDat. Furthermore,
the classes, properties, and entities to cover are with different weights. It inspires
us to formulate an instance of the weighted maximum coverage problem. We
formalize this idea in Section 3.1, and present a solution in Section 3.2.
X. Wang et al.

Algorithm 1 Greedy Algorithm
Input: A dataset T , a keyword query Q, and a size bound k
Output: An optimum dataset snippet S ⊆ T
 1: S ← ∅;
 2: while |S| < k do
 3:   t∗ ← argmaxt∈(T \S) (q(S ∪ {t}) − q(S));
 4:   S ← S ∪ {t∗ };
 5: end while
 6: return S;



3.1   Snippet Generation as Weighted Maximum Coverage

Weighted Maximum Coverage. Given a collection of sets, a weighted max-
imum coverage (WMC) problem is to select a limited number of sets from the
collection such that the total weight of the covered elements is maximized.
    Snippet Generation as WMC. We formulate the generation of an op-
timum dataset snippet as an instance of the WMC problem. Each RDF triple
ti ∈ T corresponds to a set denoted by cov(ti ) which consists of: the query key-
words covered by ti , the class instantiated in ti , the property instantiated in ti ,
and the entities that appear in ti . The universe of elements is denoted by

                       Ω = Q ∪ Cls(T ) ∪ Prp(T ) ∪ Ent(T ) .                      (7)

   Each element x ∈ Ω has a non-negative weight:
          
                 1
          
           α · |Q|                                               x ∈ Q,
          
          β · frqCls(x)
          
                                                                  x ∈ Cls(T ) ,
   w(x) =                                                                         (8)
          
           β · frqPrp(x)                                         x ∈ Prp(T ) ,
                                               −
          γ · ( P log(d+ (x)+1)        P log(d (x)+1)
          
                        log(d+ (e)+1) +        log(d− (e)+1) )    x ∈ Ent(T ) ,
          
                   e∈Ent(T )                   e∈Ent(T )


In our experiments, we set α = 2, β = 1, γ = 1, to balance between the cover-
age of query keywords in coKyw (α), the coverage of classes and properties in
coSkm (β), and the coverage of entities in coDat (γ) in our objective function.
   An optimum dataset snippet S ⊆ T is one that
                                 X
          maximizes q(S) =                 w(x) , subject to |S| ≤ k ,       (9)
                                    S
                               x∈   ti ∈S cov(ti )



where k is a predefined size bound, and q(·) is the objective function.


3.2   Solution

   Algorithm 1 presents the greedy algorithm for the WMC problem which
at each stage chooses a set that contains the maximum weight of uncovered
elements. It achieves an approximation ratio of 1 − 1e .
                                     Towards Multi-Facet Snippets for Dataset Search

          coKyw coCnx coSkm coDat Average                  coKyw coCnx coSkm coDat Average
IlluSnip  0.1000 0.0540 0.6820 0.3850 0.3053   data.gov.uk 0.7643 0.2882 0.8249 0.3870 0.5661
TA+C      0.9590 0.4703 0.0425 0.0915 0.3908   DMOZ-1      0.8977 0.7955 0.8873 0.4726 0.7633
PrunedDP++ 1        1 0.0898 0.2133 0.5758     DMOZ-2      0.8433 0.2444 0.8710 0.4569 0.6039
CES       0.9006 0.3926 0.3668 0.2684 0.4821   DMOZ-3      0.8395 0.2337 0.8693 0.4145 0.5893
KSD       0.8352 0.3595 0.8651 0.4247 0.6211   DMOZ-4      0.7936 0.1877 0.8521 0.3731 0.5516

Table 1: Average scores of different methods Table 2: Average scores of KSD over each
over all the query-dataset pairs.            group of query-dataset pairs.


   Assuming q(S ∪ {t}) − q(S) is computed in O(1), the overall running time of
a naive implementation of the algorithm is O(k · n), where n is the number of
RDF triples in T . A more efficient implementation may use a priority queue to
hold candidate triples, which is left as our future work.


4     Evaluation
Our evaluation reused the 387 query-dataset pairs in [7] where datasets were col-
lected from DataHub and queries included 42 real queries submitted to data.gov.uk
and 345 artificial queries comprising i category names in DMOZ referred to as
DMOZ-i for i = 1, 2, 3, 4. We compared our proposed KSD with four baseline
methods evaluated in [7], namely IlluSnip [2], TA+C [5], PrunedDP++ [6], and
CES [4]. Following [7], we set k = 20, i.e., a snippet contained at most 20 triples.

4.1   Quality of Snippets
Table 1 presents the average scores of the four evaluation metrics over all the
query-dataset pairs. Compared with the baselines, KSD achieved the highest
overall score of 0.6211. In particular, its coverage of schema (coSkm = 0.8651)
and data (coDat = 0.4247) were at the top. Its coverage of query keywords
(coKyw = 0.8352) was close to TA+C, PrunedDP++, and CES which are query-
focused methods. Therefore, KSD achieved a satisfying trade-off between these
evaluation metrics. On the other hand, its coCnx score was not high because
coCnx was not explicitly considered in our approach.
    Table 2 breaks down the scores of KSD into groups of query-dataset pairs.
The scores on different groups were generally consistent with each other, demon-
strating the robustness of our approach. One exception was the very high coCnx
score on DMOZ-1, due to Eq. (2) where coCnx = coKyw when |Q| = 1.

4.2   Running Time
We tested the running time of our approach on an Intel Core i7-8700K (3.70GHz)
with 10GB memory for the JVM.
   Among all the 387 query-dataset pairs, for 234 (60.47%) a dataset snippet
was generated within 1 second, and for 341 (88.11%) one was generated within
10 seconds. The median time was 0.51 second, showing promising performance
X. Wang et al.

for practical use. In the worst case, it took 150 seconds to process a large dataset
containing more than 2 million RDF triples. Future work would be needed to
improve the performance of our implementation to handle large datasets.


5    Conclusion and Future Work
In this ongoing work, we proposed KSD, a new approach to generating snippets
for dataset search. By directly optimizing three evaluation metrics, KSD outper-
formed four baselines. It has established new state-of-the-art results for future
work. We are working towards a full version of KSD which will also optimize
coCnx. We will implement our approach in a prototype of a new dataset search
engine, to help users conveniently judge the relevance of a retrieved dataset.
    There are limitations in our work. First, the current version of KSD has
considered three metrics but we exclude coCnx. The other three metrics are all
about covering some elements with selected RDF triples, whereas coCnx is re-
lated to graph connectivity. The weighted maximum coverage problem seems not
expressive enough to model coCnx. We will explore other possibilities. Second,
although the running time of our current implementation is acceptable in most
cases, its performance is not satisfying on large datasets. We will consider using
priority queue and appropriate indexes to make the generation process faster.

Acknowledgements
This work was supported by the NSFC under Grant 61572247. Cheng was funded
by the Six Talent Peaks Program of Jiangsu Province under Grant RJFW-011.


References
1. Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: Building a search engine
   for datasets in an open web ecosystem. In: WWW 2019. pp. 1365–1375 (2019)
2. Cheng, G., Jin, C., Ding, W., Xu, D., Qu, Y.: Generating illustrative snippets for
   open data on the web. In: WSDM 2017. pp. 151–159 (2017)
3. Ellefi, M.B., Bellahsene, Z., Breslin, J.G., Demidova, E., Dietze, S., Szymanski, J.,
   Todorov, K.: RDF dataset profiling - a survey of features, methods, vocabularies
   and applications. Semantic Web 9(5), 677–705 (2018)
4. Feigenblat, G., Roitman, H., Boni, O., Konopnicki, D.: Unsupervised query-focused
   multi-document summarization using the cross entropy method. In: SIGIR 2017.
   pp. 961–964 (2017)
5. Ge, W., Cheng, G., Li, H., Qu, Y.: Incorporating compactness to generate term-
   association view snippets for ontology search. Inf. Process. Manage. 49(2), 513–528
   (2013)
6. Li, R., Qin, L., Yu, J.X., Mao, R.: Efficient and progressive group steiner tree search.
   In: SIGMOD 2016. pp. 91–106 (2016)
7. Wang, X., Chen, J., Li, S., Cheng, G., Pan, J., Kharlamov, E., Qu, Y.: A frame-
   work for evaluating snippet generation for dataset search. In: ISWC 2019 (2019),
   https://arxiv.org/abs/1907.01183