Keyword-Based Semantic Search Engine Koios++ Björn Forcher1 , Andreas Giloj1 , and Erich Weichselgartner1 Leibniz Institute for Psychology Information, Germany, email: rstname.lastname@zpid.de Abstract. In this paper, we describe the keyword-based semantic search engine KOIOS++ which interprets the keywords and computes a set of SPARQL queries. The special feature of KOIOS++ is that it leverages not only the class hierarchy but also the property hierarchy. The algo- rithm and data structures of KOIOS++ are based on a well-established approach that we extended by minor adjustments of the data structures and a sophisticated weighting strategy. Key words: keyword-based semantic search, RDFS semantics 1 Introduction The RDF data model is a popular data model of the Semantic Web which can be easily transferred to a simple directed graph. RDFS extends RDF and provides representational constructs for describing ontologies, for instance the properties rdfs:subClassOf and rdfs:subPropertyOf. Both properties are transitive relations which are used to describe class or property hierarchies of ontologies. Finding information in graph-shaped data, keyword search seems to be nat- ural because it is the de facto standard for current search engines. The eld of keyword-based search on graph-structured data in general, and in particular over RDF data, is a prevalent research topic and corresponding eorts can be referred in [1], [2], and [3] at dierent glances. Tran et al. [4] describe an interesting ap- proach of keyword-based semantic search for computing the top-k ranked search results from RFD(S) graphs. They rst compute queries from the keywords, al- lowing the users to choose one of them, and nally to process the query using the underlying database engine. Nevertheless, these approaches do not focus on the hierarchy of properties and thus, they cannot make use of the transitive tree of the rdfs:subPropertyOf relation. Recent publications of that group focused on the processing of the approximated top-k ranked results [5] to reduce computa- tion time but they did not consider further aspects of the RDFS semantics. The primary motivation of our work is to make the rdfs:subPropertyOf available for keyword-based semantic search on graph shaped data. We extend the approach of Tran et al. by using a special graph mapping and a sophisticated weighting strategy which is implemented in the KOIOS++ search engine. We claim that in this way we get a semantic benet because we can favor certain graph patterns during the search. As a result, it is possible, for instance, to make use of the property hierarchy by leveraging the rdfs:subPropertyOf relation. The paper is structured as follows. Sec. 2 introduces the search engine KOI- OS++ including its search algorithm and data structures. We conclude with a brief summary and a small outlook (Sec. 3). 2 Koios++ As mentioned in the previous chapter, the approach of KOIOS++ is based on the work of Tran et al.[4] which is depicted in Figure 1. In the Data Preprocess- ing two main data structures are built from the RDF(S) data, namely keyword index and summary graph. The graph represents a summarization of the RDF(S) data comprising only structural information. Thus, it contains only classes and properties, but no literals and instances. Literals and instances are integrated at runtime by means of the keyword index. In general, the keyword index is used to map keywords to elements of the RDF(S) data. The basic thinking behind the establishment of both structures is to enable a high performance search along with scalability. The summary graph (kept in memory) contains only as much information as necessary whereas the keyword index (native database) represents an entry point containing additional information. Keywords Conjunctive Queries Keyword-Element- Augmentation of Top-K Graph Element-Query- Mapping Graph Index Exploration Mapping Query Computation Keyword Index Summary Graph Keyword Indexing Summarization Data Preprocessing RDF(S) Data Fig. 1. Data Processing and Query Computation One contribution of our work is the adjustment of the summary graph in order to leverage both, the class hierarchy and the property hierarchy. Figure 2 is intended to point out the main dierence between the two approaches. The gure contains some example triples and the corresponding summary. In con- trast to the summarization of Tran et al. a triple with two instances is mapped to a directed edge. Source and target correspond to the classes of the instances and the property of the triple is noted as label of the edge. In our approach every argument of a triple is mapped to a typed node (class or property node). The node of the subject and the node of the predicate are connected by a directed edge without label. The same applies for predicate and object. The only excep- tion to that rule are triples including a 'rdfs:subClassOf' or 'rdfs:subPropertyOf' predicate. In these cases the nodes are directly linked with an edge. For computing SPARQL queries, the loose keywords are rst disassembled into its constituent parts. For simplicity, let the keywords already be separated. Thus, there is a set of h terms T = {t1 , ..., th }, whereas ti ∈ T is either a single word or a multi-word expression. In the following step, the terms are mapped to el- ements of the input RDF(S) data which are called M-Resources. In general, a term ti ∈ T is mapped to a set of j resources Ri = {ri1 , ..., rij }. As follows, there are h sets of M-Resources R1 , ..., Rh that are used for further processing. The resource sets R1 to Rh were distributed to h threads and in each thread Zi a graph exploration is performed on the prepared (augmented) summary graph for each M-Resource riq ∈ Ri . Hence, many paths were explored starting from riq . In case there is a resource rc that is reached by any path in each thread a connecting subgraph Gs of the knowledge base can be constructed consisting of h paths. The resource rc is called C-Resource which connects all paths with one another. The outcome of the graph search algorithm is a set of weighted sub- graphs G1 , ..., Gq , whose size can be restricted by an upper weight limit and a general time limit. The weighting strategy can be separated into two parts. The static part weights all nodes and edges in the preprocessing step (further infor- mation is presented in Tran et al. [4]). The dynamic part of the weighting takes place at the runtime of the system and concerns visited paths only. The weight wpn for a new path pn is based on the path before po , the new edge ea and node vb , and the resulting path pattern mn : w(pn ) = w(po ) + w(ea ) + w(vb ) + w(mn ). The last steps of KOIOS++ are straightforward. For each subgraph G1 , ..., Gq a conjunctive SPARQL query is constructed and presented as semantic network to the user. Subsequently, the user can select relevant queries to retrieve the corresponding answer from the triplestore. Consider the example keywords 'person' and 'worked'. The rst one is mapped to the class 'zpid:Person' and the second one is mapped to the property 'zpid:WorkedOn'. The exploration may track two paths starting from both nodes which may end up in the nodes 'zpid:Librarian' or 'zpid:Author'. As follows, two dierent SPARQL queries can be constructed. The presented data structure has an inherent disadvantage because the queries may contain statements that are not included in the origin data, for instance, 'a librarian wrote an article'. This information is not necessarily incorrect, but if this relation is not entailed in the triplestore unnecessary graph traveling is done. As follows, the algorithm gets expensive and time-consuming. This is one reason, why we integrated the dynamic weighting as described above. If the graph trav- eling ends in a path pn with an unwanted pattern the additional weight w(mn ) is set to innite (if not unwanted w(mn ) = 0). That means that the new path is (very likely) not considered for further exploration and query construction. Thus, the described graph mapping in combination with the dynamic weighting enables the integration of the property hierarchy without constructing unwanted SPARQL queries. However, this kind of weighting strategy could also be used to foster (w(mn ) < 0) certain graph patterns which is important for explanation scenarios. zpid:Author zpid:Publication zpid:Librarian zpid:Metadata rdf:type rdf:type rdf:type rdf:type zpid:wrote zpid:wrote zpid:ErichW zpid:Article-01 zpid:JuergenW zpid:Abstract-02 Summarization zpid:Author zpid:Article zpid:Person zpid:wrote zpid:workedOn zpid:Librarian zpid:Metadata Fig. 2. Summarization of some triples 3 Conclusion In this paper, we presented the semantic search engine KOIOS++ that enables a keyword-based search on RDF(S) triple stores. It interprets the keywords and computes a set of SPARQL queries which can be selected by users to search the triple store. The RDF(S) data is mapped to a graph structure which is explored for the computation. In contrast to other approaches a predicate is not mapped to an edge, it is mapped to a node. To avoid unnecessary workload a sophisticated weighting strategy is applied. In particular, the strategy takes place at runtime whereas the exploration of new graph elements is based on the previous explored elements (conditional search). The presented approach enables not only heuristic reasoning on class hierarchies but also on the property hierarchies. The next step of our work is to integrate SPARQL operators into our approach. Thus, it becomes possible to use words, such as "greater" or "smaller". References 1. Achiezra, H., Golenberg, K., Kimelfeld, B., Sagiv, Y.: Exploratory keyword search on data graphs. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD '10, New York, USA, ACM (2010) 2. Cappellari, P., De Virgilio, R., Maccioni, A., Roantree, M.: A path-oriented rdf index for keyword search query processing. In: Proceedings of the 22Nd International Conference on Database and Expert Systems Applications - Volume Part II. (2011) 3. De Virgilio, R., Maccioni, A., Cappellari, P.: A linear and monotonic strategy to keyword search over rdf data. In Daniel, F., Dolog, P., Li, Q., eds.: Web Engineering. Springer Berlin Heidelberg (2013) 4. Tran, D.T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candi- dates for ecient keyword search on graph-shaped (rdf) data. In: Proc. of the 25th Intern. Conference on Data Engineering (ICDE'09), Shanghai, China (2009) 5. Wagner, A., Bicer, V., Tran, T.: Pay-as-you-go approximate join top-k processing for the web of data. In Presutti, V., d'Amato, C., Gandon, F., d'Aquin, M., Staab, S., Tordai, A., eds.: The Semantic Web: Trends and Challenges. Springer International Publishing (2014)