Yaanii: Effective Keyword Search over Semantic dataset - EXTENDED ABSTRACT - Roberto De Virgilio Paolo Cappellari Michele Miscione Dipartimento di Informatica e Department of Computing Dipartimento di Informatica e Automazione Science Automazione Universitá Roma Tre University of Alberta Universitá Roma Tre Rome, Italy Edmonton, Alberta, Canada Rome, Italy devirgilio@dia.uniroma3.it cappellari@ualberta.ca miscione@dia.uniroma3.it ABSTRACT interest, possibly by using an indexing system or a database engine, Nowadays data is disseminated in a number of different sources, then explores the data structure in order to discover a connection from databases systems to the Web, from a traditional structured between such identified parts. The common exploration paradigm organization (relational) to a semi-structured (XML), up to the un- is similar to the triple of RDF, that is subject, property, object. structured ones (text in Web documents). Although availability of Candidate solutions, built out of found connections, are then all data is constantly increasing, one principal difficulty users have to generated and finally ranked through a scoring function. To return face is to find and retrieve the information they are looking for. To the top-k best solutions, pruning techniques reducing the list of can- this aim keywords search based systems are increasingly capturing didate solutions down to those whose score is above a threshold are the attention of researchers. In this paper, we present Yaanii1, a implemented. In this framework to achieve efficiency above all, tool for the effective Keyword Search over semantic datasets. It current algorithms compute the best answers only in an approxi- is based on a novel keyword search paradigm for graph-structured mate way. This is because they use an exploration paradigm that data, focusing in particular on the RDF data model. While many is inefficient and the scoring function takes place only when solu- techniques search the best answer trees, we propose an effective tions were generated all together. Moreover pruning techniques can algorithm for the exploration and computation of all matching sub- have a sensible impact in both the quality of the solutions, as low graphs. We provide a clustering technique that identifies and groups scoring results are not shown or even computed, as well as on effi- graph substructures based on template match. A scoring function, ciency, as an early pruning reduces the space of candidate solutions IR inspired, evaluates the relevance of the substructures and the to investigate. clusters. A strong point of our approach is that the ranking sup- In [2] we proposed a novel approach to keyword search in the ports the generation of Top-k solutions during its execution. graph-structure data in a RDF representation. The main contribu- tions of our approach are: 1. INTRODUCTION • A clustering technique that identifies and groups graph sub- Keyword-based search approaches have the huge benefit that users structures based on template match. The idea is to group can ignore both the language and the structure of the data they are paths with respect to the template (i.e schema) they corre- going to query. A keyword based search engine returns a list of spond to. A solution is a composition of paths belonging to candidate pages, documents or set of data that match keywords pro- different clusters. In this way we avoid the exploration of vided in input. Then a user has to dedicate time and efforts navi- overlapping solutions and we build cleaner results for the gating each result returned from the engine in order to discover the users, gaining in terms of computation cost. Usually, the desired information, i.e. the answer he is looking for. Therefore, at- most promising algorithms of an efficient solution for key- tention around searching and query processing of graph-structured word based search are in PTIME class complexity. To this data continue to increase as the Web, XML documents and even re- aim, in [1] we demonstrated how Yaanii is more efficient lational database can be represented as a graph. Current approaches with respect to the others, presenting a quadratic complexity rely on a combination of IR and tree/graph exploration techniques as upper-bound. whose goal is to rank results according to a relevance criterion. • An algorithm that ranks solutions while it builds the solu- Keyword search on tree-structured data counts a good number of tions. Unlike most of the approaches to keyword search, that approaches already [4, 5]. Actual efforts [3, 6] focus on RDF data first identify all the solutions and then rank them according to querying, given the great momentum of Semantic Web in which a function, our approach leverages on the clusters to assem- Web pages carry information that can be read and understood by bly a solution starting with the most relevant path in the most machines in a systematic way. Simplifying, a generic approach first relevant cluster. As a result, the most relevant solution is the identifies the parts of the data structure containing the keywords of first to come out of the algorithm, then decreasing monoton- 1 Yaanii, literally “path” in Sanskrit. ically to the less relevant solutions. This allows users to ex- plore the returned solutions, starting with the most relevant, Appears in the Proceedings of the 1st Italian Information Retrieval while the elaboration of remaining solutions is undergoing. Workshop (IIR’10), January 27–28, 2010, Padova, Italy. http://ims.dei.unipd.it/websites/iir10/ index.html 2. AN ARCHITECTURE OF REFERENCE Copyright owned by the authors. We implemented our approach into a tool, called Yaanii. A flex- IIR 2010, January 27-28, Padova [Italy] ible architecture of the system was design, as shown in Figure 1. Figure 1: Architecture of Yaanii It serves as a logical view of how the system looks like. This is a From a theoretical point of view, future directions focus on im- typical use scenario of the system: proving the search algorithm of Yaanii to reach a linear time com- plexity. From a practical point of view, we would improve the in- 1. The RDF Parser takes as input a collection of RDF Docu- dexing capabilities by embedding Lucene into a DBMS (e.g. Ora- ments and parses them into triples. Here we use the Jena cle) and provide a query-by-example interface to support the user framework 2 ; to perform the query and navigate the results. 2. The Indexer builds an index on top of the triple collections to 4. REFERENCES achieve structural information useful for the query process. [1] P. Cappellari, R. De Virgilio, M. Miscione. Keyword based Here the indexing is supported by Lucene3 and WordNet 4 . Search over Semantic Data in Polynomial Time. In Technical The last allows query expansion; Report RT-DIA-160, Universitá Roma Tre, Rome, Italy, 2009. [2] R. De Virgilio, P. Cappellari, M. Miscione. Cluster-based 3. A user performs a query through a GUI helper, handling exploration for Effective Keyword Search over Semantic events and the query itself; Datasets. In Proc. of the 28th International Conference on Conceptual Modeling (ER’09), Gramado, Brazil, 2009. 4. The parsed query is given to the Searcher for processing; [3] H. He, Wang, H., Yang, J., Yu, P.S. Blinks: ranked keyword searches on graphs. In Int. Conf. on Management of Data 5. The Searcher processes the query over the Indexed Resource (SIGMOD’07), China, 2007. Base and returns the search result to the caller. It communi- cates with the Indexer to extract the instances matching input [4] B. Kimelfeld, Sagiv, Y. Finding and approximating top-k keywords (i.e. informative paths), group them into clusters answers in keyword proximity search. In Int. Symposium on and compose elements from clusters into the final solutions Principles of Database Systems (PODS’06), USA, 2006. (i.e. subgraph structures). Each structure (i.e. path, cluster [5] F. Liu, Yu, C.T., Meng, W., Chowdhury, A. Effective and solution) is evaluated by a scoring function. keyword search in relational databases. In Int. Conf. on Management of Data (SIGMOD’06), USA, 2006. [6] T. Tran, Wang, H., Rudolph, S., Cimiano, P. Top-k 3. CONCLUSION AND FUTURE WORKS exploration of query graph candidates for efficient keyword We presented Yaanii, a tool for effective Keyword Search over search on rdf. In Int. Conf. on Data Engineering (ICDE’09), semantic datasets. It is based on a clustering technique and a scor- China, 2009. ing function that support the generation of Top-k solutions during its execution in the first k steps. 2 http://jena.sourceforge.net/ 3 http://lucene.apache.org/ 4 http://wordnet.princeton.edu