Hybrid Graph based Keyword Query
               Interpretation on RDF

         Kaifeng Xu1 , Junquan Chen1 , Haofen Wang1 , and Yong Yu1

                  Apex Data and Knowledge Management Lab
             Shanghai Jiao Tong University, Shanghai, 200240, China
             {kaifengxu,jqchen,whfcarter,yyu}@apex.sjtu.edu.cn


      Abstract. Adopting keyword query interface to semantic search on
      RDF data can help users keep away from learning the SPARQL query
      syntax and understanding the complex and fast evolving data schema.
      The existing approaches are divided into two categories: instance-based
      approaches and schema-based approaches. The instance-based approaches
      relying on the original RDF graph can generate precise answers but take
      a long processing time. In contrast, the schema-based approaches rely-
      ing on the reduced summary graph require much less processing time
      but cannot always generate correct answers. In this paper, we propose
      a novel approach based on a hybrid graph which can achieve significant
      improvements on processing time with a limited accuracy drop compared
      with instance-based approaches, and meanwhile, can achieve promising
      accuracy gains at an affordable time cost compared with schema-based
      approaches.


1   Introduction
On the way to Semantic Web, Resource Description Framework (RDF) is a
language for representing information about resources in the World Wide Web.
The ever growing semantic data in RDF format provides fertile soil for semantic
search, and formal query languages (e.g. SPARQL) are adopted by most current
semantic search systems[1, 2] to accurately express complex information needs.
However, the disadvantages of formal queries are: (1) Complex Syntax : It is hard
to learn and remember complex syntax of formal queries for ordinary users. (2)
Priori Knowledge: Users have to know the schema of the underlying semantic
data beforehand. In contrast, keyword queries cater to user habits since keywords
(or known as keyword phrases) are easier to be understood and convenient to
use. An approach that can leverage the advantages of both query types is to
provide a keyword user interface and then translate keyword queries into formal
queries.
    In XML and database communities, bridging the gap between keyword queries
and formal queries has been widely studied. However, there exists a limited
amount of work on how to answer keyword queries on semantic data in RDF
format. As an early attempt to build a semantic search system, SemSearch [3]
employed a template-based approach to capture the restricted interpretations
of given keywords. Later, improved approaches [4, 5] have been proposed to ad-
dress the problem of finding all possible interpretations. In particular, Thanh
et al. [5] employed the RDF graph (instance-based approaches) to discover the
connections between nodes matching the input keywords, through which the
interpretation accuracy can be ensured, but at the cost of a longer processing
time. This problem has been recently tackled by [6, 7], where keyword queries
are translated using a summary graph extracted from the RDF data (schema-
based approaches). Although schema-based approaches significantly speed up
the processing, the schema-graph loses too much connectivity information of the
corresponding RDF graph to guarantee the interpretation accuracy.
    In this paper, we propose a novel effective and efficient keyword query in-
terpretation approach based on a hybrid graph carefully constructed from the
original RDF graph. A hybrid graph is much smaller than the original RDF
graph, and meanwhile it can preserve as much connectivity information as pos-
sible. In this way, we construct the hybrid graph under the guidance of a graph
score which reflects the best tradeoff between effectiveness and efficiency of key-
word query interpretation.

2   The Hybrid Graph based Approach
Figure 1 describes the entire process (both offline and online stages) of keyword
query interpretation. In the offline stage, a hybrid graph is constructed from
the origin RDF graph. After that, a keyword index is built for the mapping of
keywords to corresponding nodes in the hybrid graph. During the online process,
keywords input by end users are first mapped to the nodes in the hybrid graph
using the keyword index, and then we search on the hybrid graph to construct
top-k potential tree-shaped conjunctive queries (i.e., formal queries). While our
focus is the construction of the hybrid graph, we implement a similar keyword
mapping, query construction and ranking as mentioned in Q2Semantic [6]. Due
to space limitations, we refer you to [6] for details.


         RDF Graph         Extracting            Hybrid Graph


                       Keyword index     Indexing                                     Offline


       Keyword Query
                         Keyword        Mapped         Query Constructions   Formal   Online
                         Mapping        Nodes             and Ranking         Query


                 Fig. 1. The flow chart of keyword query interpretation


    The construction of hybrid graph is an iterative process which extracts and
refines a qualified subgraph from the original RDF graph by means of a graph
score. The graph score is used to define the overall interpretation performance,
which plays an important role from the starting point to the ending point of each
interaction in the whole construction process. More precisely, the graph score is
the linear combination of the size of the hybrid graph and the amount of the
connectivity information contained in the graph. The workflow of a hybrid graph
construction is illustrated in Figure 2(a) which takes the original RDF graph as
input. A construction unit (CU) is employed to carry out the refinement on the
given RDF graph to generate a qualified hybrid subgraph. The CUs are addi-
tionally used several times for further refinement on the remaining unqualified
RDF subgraphs. Finally, several qualified hybrid subgraphs are returned, and
combined together to form an overall hybrid graph.


     RDF Graph                                                  Instance Node   Cluster Node
                                               RDF Graph
         CU
                   Qualified
     Unqualified   Subgraph
      Subgraph
                   Qualified
                   Subgraph     CU
         CU
                                         Unqualified Subgraph
                     ……


     Unqualified               Hybrid
      Subgraph                 Graph
                   Qualified
                   Subgraph
        ……


                                         Qualified Subgraph


        CU


  Fig. 2. (a) The work flow of hybrid graph construction. (b)The work flow of CU.


    Figure 2(b) shows the detailed components inside a CU. Given a RDF graph
G, CU will work through the following streamline: (1) Instance Clustering: This
phase aims to generate the nodes of hybrid graph. Each instance in G is repre-
sented by the feature set of its relations and regarded as a trivial cluster initially.
After that, the cluster pairs with the highest similarity are hierarchically clus-
tered until a cluster tree T is derived. (2) Relation Refinement: This phase aims
to generate the relations of hybrid graph. It tries to substitute the relations in
the given RDF graph with those between the high level nodes in the cluster tree.
The relation replacement will increase the score of the graph, and the goal is to
get a hybrid graph with the highest score. (3) Graph Detachment The score of
the whole graph generated after relation refinement might be too low. Thus, we
extract a part of the graph (called a qualified subgraph) whose score is above
the given score threshold, and feed the remaining part (called an unqualified
subgraph) into the CU for next iterations.
3              Preliminary Results
We compare our approach with instance-based approches and schema-based
counterparts on three different datasets (i.e., semanticweb.org, DBpedia, DBLP)
in terms of processing time and interpretation accuracy. Table 1 lists The statis-
tics of the three data sets. We manually construct 42 scenarios (17 from seman-
ticweb.org, 10 from DBpedia, and 5 from DBLP) for the comparison.

                         Table 1. Statistics of semanticweb.org, DBLP and DBpedia.

                Data set ]Category                       ]Instance               ]Relation ]Inst.degree                         ]Rel.kind             ]Rel/kind
                                             2                       3                               4
semanticweb.org 5.06 × 10 7.483 × 10 1.628 × 10                                                                          2.18 4.77 × 10 3.413 × 101
                                                                                                                                                  2

         DBLP    1.0 × 101 1.640 × 106 3.176 × 106                                                                       1.94   1.0 × 101 3.176 × 105
       DBpedia 2.694 × 105 2.520 × 106 6.868 × 106                                                                       2.73 1.128 × 104 6.088 × 102


                         Instance     HG           Schema                                                            Instance   HG           Schema
                      1.00            1.00 1.00             1.00
               1.00                                                0.93                                                                               Inf
                             0.88                                                                          1200
                                                                                                                                973
               0.80
                                                                                       Runingg Time (ms)   1000                                             882
                                                                                                                  848                 803
                                                                                                           800
    Accuracy
     ccuracy


               0.60            0.53                                                                                                                               647
                                                  0.40                                                     600
               0.40
               0 40                                                       0.33
                                                                                                           400
               0.20                                                                                        200
                                                                                                                        27 26                13
               0.00                                                                                          0
                      SemanticWeb         DBLP                DBpedia                                             SemanticWeb         DBLP             DBpedia

                                       Data set                                                                                  Data set


    Fig. 3. The practical interpretation accuracy and efficiency on different data sets

   Figure 3 shows that our approach has achieved a 61.66% efficiency improve-
ment with a 6.17% accuracy drop over instance-based approaches on average. On
the other hand, our approach achieves a 132.30% accuracy gain with a 20.08%
time increase compared with schema-based approaches.

References
1. Broekstra, J., Kampman, A., Van Harmelen, F.: Sesame: A generic architecture for
   storing and querying rdf and rdf schema. In: ISWC. (2002) 54–68
2. Lu, J., Ma, L., Zhang, L., Brunner, J., Wang, C., Pan, Y., Yu, Y.: SOR: a practical
   system for ontology storage, reasoning and search. In: VLDB. (2007) 1402–1405
3. Lei, Y., Uren, V., Motta, E.: Semsearch: A search engine for the semantic web. In:
   EKAW. (2006) 238
4. Zhou, Q., Wang, C., Xiong, M., Wang, H., Yu, Y.: Spark: Adapting keyword query
   to semantic search. In: ISWC/ASWC. (2007) 694
5. Tran, T., Cimiano, P., Rudolph, S., Studer, R.: Ontology-based interpretation of
   keywords for semantic search. In: ISWC/ASWC. (2007) 523
6. Wang, H., Zhang, K., Liu, Q., Tran, T., Yu, Y.: Q2Semantic: A lightweight keyword
   interface to semantic search. In: ESWC. (2008) 584
7. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candidates
   for efficient keyword search on graph-shaped (RDF) data. In: ICDE. (2009) 405–416