Live SPARQL Auto-Completion

                                   Stéphane Campinas

        Insight Centre for Data Analytics, National University of Ireland, Galway
                        stephane.campinas@insight-centre.org


        Abstract. The amount of Linked Data has been growing increasingly. How-
        ever, the efficient use of that knowledge is hindered by the lack of informa-
        tion about the data structure. This is reflected by the difficulty of writing
        SPARQL queries. In order to improve the user experience, we propose an
        auto-completion library1 for SPARQL that suggests possible RDF terms. In
        this work, we investigate the feasibility of providing recommendations by
        only querying the SPARQL endpoint directly.


1     Introduction
The Linking Open Data movement has brought a tremendous amount of data avail-
able to the general user. The available knowledge spans a wide range of domains,
from life sciences to films. However, using SPARQL to search through this knowl-
edge is a tedious process, not only because of the syntax barrier but mainly due
to the schema heterogeneity of the data. The expression of an information need in
SPARQL is difficult due to the schema being generally unknown to the user as well
as an heterogeneous of several vocabularies.
    A common solution is for the user to manually gain knowledge about the data
structure, i.e., what predicates and classes are used, by executing additional queries
in parallel to the main one. The paper [3] proposes a “context-aware” auto-completion
method for assisting a user in writing a SPARQL query by recommending schema
terms in various position in the query. The method is context-aware in the sense
that only essential triple patterns are considered for the recommendations. To do
so, it leverage a data-generated schema. Instead, in this work we propose to bypass
this need by executing live SPARQL queries in order to provide recommendations.
Thus, this removes the overhead of pre-computing the data-generated schema. The
proposed approach exposes a trade-off between the performance of the application
and the quality of the recommendations. We make available a library1 for providing
data-based recommendations that can be used with other tools such as YASGUI [8].
    In Section 2 we discuss related works regarding auto-completion for SPARQL.
In Section 3 we present the proposed approach. In Section 4 we report an evaluation
of the system based on query logs of DBpedia.

2     Related Work
Over the years, many contributions have been done towards facilitating the use of
SPARQL, either visually [4], or by completely hiding SPARQL from the user [7]. In
this work, we aim to help users with a knowledge of SPARQL by providing an auto-
completion feature. Several systems have been proposed in this direction. Although
1
    Gosparqled: https://github.com/scampi/gosparqled
    the focus in [1] is the visual interface, it can provide recommendations of terms such
    as predicates and classes. In [6] possible recommendations are taken from query
    logs. The system proposed in [5] provides recommendations based on the data itself,
    with a focus on SPARQL federation. Instead, we aim to make available an easy-to-
    use library which core feature is to provide data-based recommendations. In [3] an
    editor with auto-completion was developed that leverage a data-generated schema
    (i.e., a graph summary). We investigate in this work the practicability of bypassing
    the graph summary by relying only on the data.

    3     Live Auto-Completion
    We propose a data-based auto-completion which retrieves possible items with re-
    gards to the current state of the query. Recommended items can be predicates,
    classes, or even named graphs. Firstly, we indicate the position in the SPARQL
    query that is to be auto-completed, i.e., the Point Of Focus (POF), by inserting the
    character ‘<’. Secondly, we reduce the query down to its recommendation scope [3].
    Finally, we transform the POF into the SPARQL variable “?POF” which is used
    for retrieving recommendations. The retrieved recommendations are then ranked,
    e.g., by the number of occurrences of an item.
    Recommendation Scope. While building a SPARQL query, not all triple patterns are
    relevant for the recommendation. Therefore, we define the scope as the connected
    component that contains the POF. Figure 1a depicts a SPARQL query where the
    POF is associated with the variable “?s”: it seeks possible predicates that occur with
    a “:Person” having the predicate “:name”. Figure 1b depicts the previous SPARQL
    query reduced to its recommendation scope. Indeed, the pattern on line 4 is removed
    since it is not part of the connected component containing the POF.


1    SELECT * {                             1   SELECT ? POF {
2      ? s a : Person ;                     2     ? s a : Person ;
3          : name ? name ; < .              3         : name ? name ; ? POF [] .
4      ? o a : Document                     4
5    }                                      5   }

        (a) A query with ‘<’ as the POF                  (b) Scope of the query

                                Fig. 1: Query auto-completion

    Recommendation Capabilities. The scope may include content-specific terms, e.g,
    resources and filters, unlike to [3] since the graph summary is an abstraction that
    captures only the structure of the data. Recommendations about predicates, classes
    and named graphs are possible as in [3]. In addition, the use of the data directly
    allows to provide recommendations for specific resources.

    4     Evaluation
    Systems. In this section, we evaluate the recommendations returned by the proposed
    system, that we refer to as “S1”, against the ones provided by the approach in [3],
    which we refer to as “S2”.
Settings. We compare the recommendations with regards to (1) the response-time,
i.e., the time spent on retrieving the recommendations via a SPARQL query; and
(2) the quality of the recommendations. A run of the evaluation consists of the fol-
lowing steps. First, we vary the amount of information retrieved via the “LIMIT”
clause. Then, we compare the ranked TOP-10 recommendations against a gold stan-
dard. The ranking is based on the number of occurrences of a recommendation. The
gold standard consists in retrieving recommendations directly from the data without
the LIMIT clause, and retaining only the 10 most occurring terms. The TOP-10 of
the gold standard and the system are compared using the Jaccard similarity. We
consider that the higher the similarity, the higher the quality of recommendations.
Queries. We used the query logs of the DBpedia endpoint version 3.3 available
from the USEWOD20132 dataset. The queries3 were stripped of any pattern about
specific resources, in order to keep only the structure of the query. In addition,
we removed queries that contain more than one connected component. Queries are
grouped according to their complexity, which depends on the number of triple pat-
terns and on the number of star graphs. A group is identified by a string that has
as many numbers as there are stars, with numbers separated by a dash ’-’ and rep-
resenting the number of triple patterns in a star. For example, a query with two
stars and one triple pattern each is then identified with 1-1. This definition of query
complexity exhibits the potential errors, i.e., a recommendation having zero-result,
that a graph summary can have, as described in [2].
Graphs. We loaded into an endpoint the English part of the Dbpedia3.34 dataset,
which consists of 167 199 852 triples. The graph summary consists of 29 706 051
triples, generated by grouping resources sharing the same set of classes.
Endpoint. We used a Virtuoso5 SPARQL endpoint. The endpoint is deployed on a
server with 32GB of RAM and with SSD drives.
Comparison. For each group of query complexity QC, we report in Table 1 the
results of the evaluation, with J1 (resp., J2) the average Jaccard similarity for the
system S1 (resp., S2); and T 1 (resp., T 2) the average response-time in ms for the
system S1 (resp., S2). The reported values are the averages over 5 runs. We can see
that as the LIMIT gets larger, the higher the Jaccard similarity becomes.Since the
graph summary used in S2 is a concise representation of the graph structure, the
data sample at a certain LIMIT value contains more terms than in S1. However,
this impacts negatively on the quality of S2 as reflected by the values of J2. This
shows the graph summary is subject to errors [2], i.e., zero-result recommendations.
Nonetheless, it is interesting to remark that in S1 the recommendations can lead
the query to an “isolated” part of the graph, from which the way out is through
the use of “OPTIONAL” clauses. In S2, the graph summary allows to reduce this
effect. The response-times for either system is similar, with S2 being slightly faster
than S1. This indicates that directly querying the endpoint for recommendations
is feasible. However, the significant difference in sizes between the graph summary
and the original graph would become increasingly pre-dominant as the data grows.
2
  http://usewod.org/
3
  https://github.com/scampi/gosparqled/tree/master/eval/data
4
  http://wiki.dbpedia.org/Downloads33
5
  Virtuoso v7.1.0 at https://github.com/openlink/virtuoso-opensource
      J1 J2         J1 J2         J1 J2         J1 J2         J1 J2       J1 J2       J1 J2       J1 J2       J1 J2
QC      2             3             4             5             6           9          10          1-1         1-2
10 0.12 0.12       0.17 0.21     0.15 0.21     0.16 0.19     0.14 0.16   0.17 0.19   0.16 0.19   0.11 0.09   0.19 0.18
100 0.15 0.17      0.28 0.26     0.27 0.28     0.28 0.29     0.24 0.26   0.25 0.26   0.25 0.27   0.12 0.11   0.24 0.22
500 0.24 0.27      0.34 0.29     0.34 0.30     0.36 0.35     0.38 0.31   0.42 0.27   0.43 0.26   0.15 0.18   0.29 0.29
QC      1-3           1-4           1-5           2-2           3-4        1-1-2       1-1-3       1-1-4
10 0.62 0.64       0.23 0.22     0.15 0.19     0.17 0.17     0.15 0.06   0.55 0.38   0.50 0.49   0.38 0.43
100 0.62 0.60      0.38 0.32     0.24 0.32     0.19 0.19     0.24 0.10   0.57 0.39   0.53 0.52   0.44 0.40
500 0.62 0.59      0.60 0.34     0.25 0.29     0.25 0.22     0.21 0.12   0.57 0.40   0.55 0.51   0.47 0.46
      T1 T2         T1 T2         T1 T2         T1 T2         T1 T2       T1 T2       T1 T2       T1 T2       T1 T2
QC      2             3             4             5             6           9          10          1-1         1-2
10    107     81   119      82   127      81   129      82   144 85      314 197     688 468      97   79    103 80
100   108     81   180      84   147      95   202      86   173 88      311 198     701 458     122   84    140 83
500   141     91   192      96   144      79   172      99   149 101     337 207     701 467     127   89    133 111
QC      1-3           1-4           1-5           2-2           3-4        1-1-2       1-1-3       1-1-4
10    101 108      108      87   104      93   102      80   114 83      107 391     106   87    105 87
100   103 105      102      94   105      84   106      80   142 85      115 385     112   89    105 96
500   126 105      141      92   136      97   158      94   137 117     126 400     133   99    139 102
Table 1: Average Jaccard similarity (J1 for system S1 and J2 for S2) and response-
times in ms (T 1 for system S1 and T 2 for S2) for each group of query complexity QC,
and with the LIMIT varying from 10 to 500. The reported values are the averages
over 5 runs.


Acknowledgement
This material is based upon works supported by the European FP7 projects LOD2
(257943).

Bibliography
[1] Ambrus, O., Mller, K., Handschuh, S.: Konduit vqb: a visual query builder for sparql
    on the social semantic desktop
[2] Campinas, S., Delbru, R., Tummarello, G.: Efficiency and precision trade-offs in graph
    summary algorithms. In: Proceedings of the 17th International Database Engineering
    & Applications Symposium. pp. 38–47. IDEAS ’13, ACM, New York, NY, USA (2013)
[3] Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introducing rdf
    graph summary with application to assisted sparql formulation. In: Proceedings of the
    2012 23rd International Workshop on Database and Expert Systems Applications. pp.
    261–266. DEXA ’12, IEEE Computer Society, Washington, DC, USA (2012)
[4] Clark, L.: Sparql views: A visual sparql query builder for drupal. In: International
    Semantic Web Conference. pp. –1–1 (2010)
[5] Gombos, G., Kiss, A.: Sparql query writing with recommendations based on datasets.
    In: Yamamoto, S. (ed.) Human Interface and the Management of Information. Infor-
    mation and Knowledge Design and Evaluation, Lecture Notes in Computer Science,
    vol. 8521, pp. 310–319. Springer International Publishing (2014)
[6] Kramer, K., Dividino, R.Q., Gröner, G.: Space: Sparql index for efficient autocom-
    pletion. In: Blomqvist, E., Groza, T. (eds.) International Semantic Web Conference.
    CEUR Workshop Proceedings, vol. 1035, pp. 157–160. CEUR-WS.org (2013)
[7] Lehmann, J., Bühmann, L.: Autosparql: Let users query your knowledge base. In:
    Proceedings of the 8th Extended Semantic Web Conference on The Semantic Web:
    Research and Applications - Volume Part I. pp. 63–79. ESWC’11, Springer-Verlag,
    Berlin, Heidelberg (2011)
[8] Rietveld, L., Hoekstra, R.: Yasgui: Not just another sparql client. In: SALAD@ESWC.
    pp. 1–9 (2013)