Yaanii: Effective Keyword Search over Semantic dataset
                                                   - EXTENDED ABSTRACT -
                  Roberto De Virgilio                           Paolo Cappellari                   Michele Miscione
              Dipartimento di Informatica e                Department of Computing            Dipartimento di Informatica e
                      Automazione                                  Science                            Automazione
                  Universitá Roma Tre                        University of Alberta                Universitá Roma Tre
                       Rome, Italy                        Edmonton, Alberta, Canada                    Rome, Italy
             devirgilio@dia.uniroma3.it                   cappellari@ualberta.ca             miscione@dia.uniroma3.it

ABSTRACT                                                                  interest, possibly by using an indexing system or a database engine,
Nowadays data is disseminated in a number of different sources,           then explores the data structure in order to discover a connection
from databases systems to the Web, from a traditional structured          between such identified parts. The common exploration paradigm
organization (relational) to a semi-structured (XML), up to the un-       is similar to the triple of RDF, that is subject, property, object.
structured ones (text in Web documents). Although availability of         Candidate solutions, built out of found connections, are then all
data is constantly increasing, one principal difficulty users have to     generated and finally ranked through a scoring function. To return
face is to find and retrieve the information they are looking for. To     the top-k best solutions, pruning techniques reducing the list of can-
this aim keywords search based systems are increasingly capturing         didate solutions down to those whose score is above a threshold are
the attention of researchers. In this paper, we present Yaanii1, a        implemented. In this framework to achieve efficiency above all,
tool for the effective Keyword Search over semantic datasets. It          current algorithms compute the best answers only in an approxi-
is based on a novel keyword search paradigm for graph-structured          mate way. This is because they use an exploration paradigm that
data, focusing in particular on the RDF data model. While many            is inefficient and the scoring function takes place only when solu-
techniques search the best answer trees, we propose an effective          tions were generated all together. Moreover pruning techniques can
algorithm for the exploration and computation of all matching sub-        have a sensible impact in both the quality of the solutions, as low
graphs. We provide a clustering technique that identifies and groups      scoring results are not shown or even computed, as well as on effi-
graph substructures based on template match. A scoring function,          ciency, as an early pruning reduces the space of candidate solutions
IR inspired, evaluates the relevance of the substructures and the         to investigate.
clusters. A strong point of our approach is that the ranking sup-            In [2] we proposed a novel approach to keyword search in the
ports the generation of Top-k solutions during its execution.             graph-structure data in a RDF representation. The main contribu-
                                                                          tions of our approach are:
1.      INTRODUCTION                                                         • A clustering technique that identifies and groups graph sub-
   Keyword-based search approaches have the huge benefit that users            structures based on template match. The idea is to group
can ignore both the language and the structure of the data they are            paths with respect to the template (i.e schema) they corre-
going to query. A keyword based search engine returns a list of                spond to. A solution is a composition of paths belonging to
candidate pages, documents or set of data that match keywords pro-             different clusters. In this way we avoid the exploration of
vided in input. Then a user has to dedicate time and efforts navi-             overlapping solutions and we build cleaner results for the
gating each result returned from the engine in order to discover the           users, gaining in terms of computation cost. Usually, the
desired information, i.e. the answer he is looking for. Therefore, at-         most promising algorithms of an efficient solution for key-
tention around searching and query processing of graph-structured              word based search are in PTIME class complexity. To this
data continue to increase as the Web, XML documents and even re-               aim, in [1] we demonstrated how Yaanii is more efficient
lational database can be represented as a graph. Current approaches            with respect to the others, presenting a quadratic complexity
rely on a combination of IR and tree/graph exploration techniques              as upper-bound.
whose goal is to rank results according to a relevance criterion.            • An algorithm that ranks solutions while it builds the solu-
Keyword search on tree-structured data counts a good number of                 tions. Unlike most of the approaches to keyword search, that
approaches already [4, 5]. Actual efforts [3, 6] focus on RDF data             first identify all the solutions and then rank them according to
querying, given the great momentum of Semantic Web in which                    a function, our approach leverages on the clusters to assem-
Web pages carry information that can be read and understood by                 bly a solution starting with the most relevant path in the most
machines in a systematic way. Simplifying, a generic approach first            relevant cluster. As a result, the most relevant solution is the
identifies the parts of the data structure containing the keywords of          first to come out of the algorithm, then decreasing monoton-
1
    Yaanii, literally “path” in Sanskrit.                                      ically to the less relevant solutions. This allows users to ex-
                                                                               plore the returned solutions, starting with the most relevant,
Appears in the Proceedings of the 1st Italian Information Retrieval
                                                                               while the elaboration of remaining solutions is undergoing.
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.
http://ims.dei.unipd.it/websites/iir10/
index.html                                                                2. AN ARCHITECTURE OF REFERENCE
 Copyright owned by the authors.                                             We implemented our approach into a tool, called Yaanii. A flex-
IIR 2010, January 27-28, Padova [Italy]                                   ible architecture of the system was design, as shown in Figure 1.
                                                        Figure 1: Architecture of Yaanii


It serves as a logical view of how the system looks like. This is a           From a theoretical point of view, future directions focus on im-
typical use scenario of the system:                                        proving the search algorithm of Yaanii to reach a linear time com-
                                                                           plexity. From a practical point of view, we would improve the in-
     1. The RDF Parser takes as input a collection of RDF Docu-            dexing capabilities by embedding Lucene into a DBMS (e.g. Ora-
        ments and parses them into triples. Here we use the Jena           cle) and provide a query-by-example interface to support the user
        framework 2 ;                                                      to perform the query and navigate the results.

     2. The Indexer builds an index on top of the triple collections to    4. REFERENCES
        achieve structural information useful for the query process.        [1] P. Cappellari, R. De Virgilio, M. Miscione. Keyword based
        Here the indexing is supported by Lucene3 and WordNet 4 .               Search over Semantic Data in Polynomial Time. In Technical
        The last allows query expansion;                                        Report RT-DIA-160, Universitá Roma Tre, Rome, Italy, 2009.
                                                                            [2] R. De Virgilio, P. Cappellari, M. Miscione. Cluster-based
     3. A user performs a query through a GUI helper, handling
                                                                                exploration for Effective Keyword Search over Semantic
        events and the query itself;
                                                                                Datasets. In Proc. of the 28th International Conference on
                                                                                Conceptual Modeling (ER’09), Gramado, Brazil, 2009.
     4. The parsed query is given to the Searcher for processing;
                                                                            [3] H. He, Wang, H., Yang, J., Yu, P.S. Blinks: ranked keyword
                                                                                searches on graphs. In Int. Conf. on Management of Data
     5. The Searcher processes the query over the Indexed Resource
                                                                                (SIGMOD’07), China, 2007.
        Base and returns the search result to the caller. It communi-
        cates with the Indexer to extract the instances matching input      [4] B. Kimelfeld, Sagiv, Y. Finding and approximating top-k
        keywords (i.e. informative paths), group them into clusters             answers in keyword proximity search. In Int. Symposium on
        and compose elements from clusters into the final solutions             Principles of Database Systems (PODS’06), USA, 2006.
        (i.e. subgraph structures). Each structure (i.e. path, cluster      [5] F. Liu, Yu, C.T., Meng, W., Chowdhury, A. Effective
        and solution) is evaluated by a scoring function.                       keyword search in relational databases. In Int. Conf. on
                                                                                Management of Data (SIGMOD’06), USA, 2006.
                                                                            [6] T. Tran, Wang, H., Rudolph, S., Cimiano, P. Top-k
3.     CONCLUSION AND FUTURE WORKS                                              exploration of query graph candidates for efficient keyword
   We presented Yaanii, a tool for effective Keyword Search over                search on rdf. In Int. Conf. on Data Engineering (ICDE’09),
semantic datasets. It is based on a clustering technique and a scor-            China, 2009.
ing function that support the generation of Top-k solutions during
its execution in the first k steps.
2
    http://jena.sourceforge.net/
3
    http://lucene.apache.org/
4
    http://wordnet.princeton.edu