Keyword-Based Navigation and Search over the Linked Data Web Luca Matteis Aidan Hogan Roberto Navigli Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Sapienza University of Rome University of Chile Sapienza University of Rome matteis@di.uniroma1.it ahogan@dcc.uchile.cl navigli@di.uniroma1.it ABSTRACT Keyword search approaches over RDF graphs have proven intuitive for users. However, these approaches rely on local copies of RDF graphs. In this paper, we present an algo- rithm that uses RDF keyword search methodologies to find information in the live Linked Data web rather than against time local indexes. Users navigate between documents by specify- ing keywords that are matched against triples. Navigation Figure 1: Navigating the Web of Documents involves click- is performed through a pipeline which streams results to ing through various URIs coming from different sources. users as soon as they are found. Keyword search is assisted through the resolution of predicate URIs. We evaluate our methodology by converting several natural language ques- filmed tions into lists of keywords and seed URIs. For each question y eb we measured how quickly and how many triples appeared in n tow s ac clo me ge close b y to the output stream of each step of the pipeline. Results show ho o r that relevant triples are streamed back to users in less than 5 seconds on average. We think that this approach can help people analyze and explore various Linked Datasets in a fol- time low your nose fashion by simply typing keywords. hometown geo close by filmed actor 1. INTRODUCTION Figure 2: Navigating the Web of Data using our approach involves typing keywords rather than clicking. On the traditional “Web of Documents”, navigation is characterized by following links (with anchor text) between documents. On the Web of Data (also known as Linked Data), documents contain structured resource descriptions, often wish to explore different combinations of data sources where resources are interlinked with specific properties. This and URIs to get a more complete picture; they are unlikely increased granularity has opened the door to new approaches to know what navigational steps or what URIs are relevant for browsing [1] and querying [2, 3, 4] this information space, until they actually encounter them. navigating links that match user-specified structured pat- In order to avoid the need for users to specify a complex terns. With such methods, we are getting closer to answer- structured query upfront [2, 4] or for local free-text indexes ing complex queries against the live Linked Data web, and to be built [5, 6], we propose an approach that combines key not just against a crawled and hardly up-to-date index. aspects of (i) Linked Data navigation, such as link traversal However, users of these link traversal approaches need: techniques [2] and semantic controlled navigation [4]; and (i) to provide a formal description of the portion of the Web (ii) RDF search methodologies [5, 6]. The overall idea is they wish to traverse upfront and (ii) to know which URIs that starting at (a) given location(s) on the Web, the user to follow at each step of the navigation (similar to writing can interactively specify the next step by typing keywords SPARQL queries, for example, where users need to under- that are matched against the current data and used to find stand the structure of the data). Furthermore, users may and traverse links, adding more data. Users can thus use keywords to navigate and answer questions across multiple datasets without needing to know their structure or vocab- ulary beforehand or without needing to go through a (pos- sibly out-of-date or incomplete) centralized index. All the user needs to start is the seed URL of an initial document. This paper continues with discussion of related work in Section 2. We introduce our methodology in Section 3 and provide a detailed algorithm in Section 4. In Section 5 we Copyright is held by the author/owner(s). evaluate our approach against similar existing techniques. WWW2015 Workshop: Linked Data on the Web (LDOW2015). In Section 6 we conclude and discuss future work. 2. RELATED WORK known interest Several works propose methods for browsing and navigat- Andreas Kjetil Harth Kjernsmo Semantic Web ing Linked Datasets. NautiLOD [4] was our main source of inspiration, providing a language for navigating the Web Sarven Philosophy Capadisli of Data in a controlled manner. Other approaches, such as Link Traversal Querying [7] and Linked Data Frag- Figure 4: Navigation using a streaming pipeline ments [8], allow for querying Linked Datasets by resolv- ing and traversing URIs automatically. Unlike these ap- proaches, we do not require an upfront query or plan. explicit URIs or regular expressions, allowing for a more flex- Instead our approach enables users to perform their dis- ible and less rigorous browsing experience. Furthermore, in coveries “on the go”. In this sense it is somewhat similar other related approaches, users typically have to construct to Linked Data browsers such as Tabulator [1]. However, queries upfront and wait for the entire process to finish be- Tabulator is more akin to a traditional browsing scenario fore they can act on the results. Instead we rely on returning on the Web of Documents. Figures 1 and 2 draw a com- results to users in a streaming fashion as quickly as possible, parison. In both scenarios, the user is involved in every enabling the discovery of data as you go. step. But rather than clicking to resolve a link manually, To enable the streaming of results, we structure the nav- our approach enables users to search for multiple links using igation process in the form of a pipeline. Each element of keywords: in our approach, the navigation can branch. the pipeline is responsible for searching against the data it Versus keyword search approaches (e.g., see [5, 6]), we do receives. Figure 4 gives an example. The resource Andreas not require a local inverted index. We rather aim to enable Harth is fed into the first element of the pipeline, and us- live exploration of the Web in search of (up-to-date) answers. ing the keyword “known”, we look for relevant triples. These A related technique that combines both navigation and triples are then sent to an output, which becomes the in- search is Treo [9], which accepts a natural language ques- put stream for the next element of the pipeline. Rather tion that it uses to navigate a Linked Dataset in search for than having to wait for the process to finish at each step, answers. However, Treo again requires an upfront question. we act on results as soon as they are found. Thus the sec- Likewise Treo is assisted by third-party corpora, such as ond element can already start processing the resource Kjetil Wikipedia, to find specific keyword paths to follow, while Kjernsmo even though the earlier step has not finished look- our approach (currently) relies entirely on Linked Data. ing for triples. This enables users to continue the naviga- In terms of streaming results to users on-the-fly, Triple tion using new keywords, even while other elements of the Pattern Fragments [3] was a source of inspiration. pipeline are still busy looking for relevant results. 3. KEYWORD-BASED NAVIGATION AND 3.2 Search SEARCH To find relevant triples at each step of the pipeline, several To navigate and search the Web of Data using our ap- methodologies can be used. For instance, work such as [5] proach, users (or agents) initiate their browsing similar to show that keyword search against RDF structures is achiev- the usual Web of Documents: they provide a list of starting able with high accuracy. We keep the search component locations (seed URIs).1 The navigation starts by resolving modular, so that specific implementations can decide which the seed URIs after which users are invited to introduce a technique to utilize. In the evaluation section herein we per- keyword. The keyword is used to search for relevant con- form search using a string similarity algorithm against the tent against the information retrieved. This process is illus- string representation of each triple. trated in more detail in Figure 3 where we perform a navi- gation from seed URI http://harth.org/andreas/foaf#ah known looking for relevant information using the keyword “known”. URIs found using this keyword are sent to an output stream. Andreas Kjetil f o a f : knows In the upper-right corner we see this entire process happen- Harth . Kjernsmo ing again, using a new keyword and new seed URIs, namely f o a f : knows r d f s : comment Sarven the ones returned from the earlier step. ”A p e r s o n known by . . . ” . Capadisli In the following sub-sections, we will explain in more de- tail the navigation and search aspects of our approach. Figure 5: Search using predicate resolution 3.1 Navigation To enable further matches, we extend the search with The type of navigation we use to browse the Linked Data content acquired by resolving the predicate URIs associated Web in a programmatic way is similar to other works such as with the current resource. When matching RDF links, we LDpath2 and NautiLOD [4]. These approaches rely specif- are thus not only limited to matching tokens in the URI ically on following RDF links between resources and servers. string itself, but can match labels, comments, etc. Figure However, these methods only work in scenarios where users 5 shows an example of how predicate resolution can enrich know upfront which RDF links they wish to follow whereas a search result. For the Andreas Harth resource, no triples our approach does not make this assumption. Instead, our are found using the keyword “known”. However, by resolv- navigation is interactive and driven by keywords rather than ing predicate URIs we are able to find triples within the 1 foaf:knows predicate that match the keyword. We can there- One could imagine these URIs coming from a search engine or bookmark list, etc. fore use triples with the foaf:knows predicate that are asso- 2 http://marmotta.apache.org/ldpath/ ciated with our main resource to continue our navigation. output Semantic Web Chess Philosophy stream Excerpt of the navigation showing interest the contents of the streams as the http://www.kjetil.kjernsmo.net/foaf#me http://csarven.ca/#i navigation proceeds along the Kjetil rdfs:label “Kjetil Sarven rdfs:label “Sarven keyword pipeline. Kjernsmo Kjernsmo” Capadisli foa Capadisli” foa f:in f:to tere pic foa i st nte f:in Seed URI: r est ter http://harth.org/andreas/foaf#ah Chess est Keywords: known, interest Semantic Web Philosophy Kjetil Sarven output Kjernsmo Capadisli stream known http://harth.org/andreas/foaf#ah rdfs:label Andreas rdfs:label “Andreas rdfs:label “label” Harth foa Harth” rdfs:c f:kn omm “A human-readable ow ent foa s name for the f:kn subject.” Kjetil ow Kjernsmo s rdfs:label foaf:knows “knows” Sarven Capadisli rdfs:co mm ent “A person known by this person...” Figure 3: What are the interests of the people known by Andreas Harth? Many predicate URIs will resolve to an entire ontology or Algorithm 1 Keyword-based navigation and search vocabulary and not just to the term identified. For instance, Input: input stream S and keyword k the URI http://xmlns.com/foaf/0.1/knows resolves to a Output: output stream out of URIs found description of the entire FOAF vocabulary. For this reason 1: upon U RI in stream S do our approach relies only on the triples where the resolved 2: R ← triples from resolving U RI URI appears in either the subject or object position. 3: F ← search for k in R and return triples 4: if F.length > 0 then 4. ALGORITHM 5: N ← { unvisited URIs of F except predicates } 6: out.write(N ) We will now describe our approach in more detail by pre- 7: end if senting an algorithm and explaining its functionality using 8: T ← R − F a walkthrough example. The question in Figure 3 involves 9: for each triple t ∈ T do accessing data from different sources and following specific 10: P ← triples from resolving predicate of t paths based on the people Andreas knows. To represent this 11: F 0 ← search for k in P and return triples question formally we define a seed URI from which the nav- 12: if F 0 .length > 0 then igation initiates, and an ordered set of keywords we need 13: N ← { unvisited URIs of t except predicates } to follow to answer our question (e.g., in the order entered 14: out.write(N ) interactively by the user). Each keyword is effectively an 15: end if element in the pipeline. Algorithm 1 gives an overview of 16: end for the procedure for a single element of the pipeline. The algorithm takes as input a stream which we call S, and a keyword k. By following Figure 3, once the URI URIs from the current triple t (line 13). On line 14 we write http://harth.org/andreas/foaf#ah appears in the stream the extracted URIs to the stream out. Thanks to predicate S (line 1), we resolve this URI and assign the triples returned resolution, we now match triples with predicate foaf:knows, to the set R (line 2). Next on line 3 we apply our search resulting in 2 output URIs, namely http://csarven.ca/#i methodology against the set R using the keyword k (which and http://www.kjetil.kjernsmo.net/foaf#me. is “known”) and the found triples are assigned to the set F . If Figure 3 continues by listening for the keyword “interest”. triples are found we extract the URIs from the found triples The algorithm is re-executed with an input stream fed from (line 5) and write the resulting URIs to our output stream the output stream of the earlier execution. We therefore out. For the keyword “known” no triples were found, there- obtain 2 URIs in the input stream; for each of these URIs fore nothing is written to the output stream yet. the process is repeated. The process terminates when all Next on line 8 we obtain triples that were not yet found processing has finished for the given keywords or the user and assign them to the set T . We proceed by resolving the manually terminates the session. predicate URIs of this set. We loop through each triple of It is important to note that the resolution of URIs can oc- this set on line 9, and we resolve each predicate on line 10. cur in parallel. In the evaluation section we specifically cre- We then apply our search methodology against the set P ate a new thread for each request to obtain higher through- using the keyword “known” with found triples assigned to put. URIs resolved should also be cached to avoid unneces- the set F 0 (line 12); if F 0 is non-empty, we extract new sary re-retrieval of commonly requested documents. 836 115 #triples #triples 100 10 known 10 match interest definition 1 6.26 9.26 31.6 2.03 3 10 27 time(s) time(s) (a) What are the interests of the people known by (b) What are the available definitions for the English Andreas Harth? 2 keywords and 1 seed URI: noun “apple”? 2 keywords and 1 seed URI: http://harth.org/andreas/foaf#ah http://babelnet.org/rdf/s00005054n 74 45 63 #triples #triples 10 dbpedia director 50 based airport 1 runway @it 2.91 5.94 7.55 2.2 4.3 6.8 10 18.4 time(s) time(s) (c) What are the lengths of the runways of the airports (d) Give me the locations of the movies directed by Ridley in Rome, Italy? 3 keywords and 1 seed URI: Scott, in Italian. 3 keywords and 1 seed URI: http://sws.geonames.org/3169070/ http://data.linkedmdb.org/resource/director/8472 Figure 6: Execution of questions 5. EVALUATION errors, we used a thread pool size of 5 (swget was also con- The main characteristic of our keyword-based navigation figured with this size). All results for all the experiments and search is that it allows to answer questions “as you are the average of 5 runs. go” by simply typing keywords. The primary purpose of this preliminary evaluation is to check whether our identi- 5.2 Results fied method produces relevant results in an acceptable time Figures 6a, 6b, 6c and 6d show the contents of the var- frame. To this end, we performed two different experiments: ious output streams at specific times. Each plotted mark represents data being written to the stream. 1. First we executed a series of questions, each repre- We will analyze the results in terms of response time (re- sented as a set of keywords with a seed URI, using our ception of first solution for every keyword) and navigation identified approach. We measured how fast and how hop time (reception of first solution between each keyword). many relevant triples appeared in the output streams. For the question in Figure 6a, the response time is around 6 seconds for the first keyword and 9 seconds for the sec- 2. Next we compared our methodology against a similar ond keyword. This is mainly due to the fact that we had to implementation called swget [10], a multi-threaded resolve the predicate URIs in order to find triples matching tool that executes NautiLOD expressions (swgetM), the first “known” keyword. The average navigation hop time and we measured the response times at each navigation is 7.7 seconds. For the question in Figure 6b, the response step for both approaches. time is 2 seconds for the first keyword and 2.3 seconds for the second. The navigation hop time is 2.2 seconds. For 5.1 Experimental setup this question we see that the results are returned quicker To run the experiments, we implemented our navigation because we did not have to resort to predicate resolution. pipeline component using an observer pattern and our search For the questions in Figure 6c and 6d, triples relating to the component as a multi-threaded Java process that uses the first keyword are found in 2.8 and 2.2 seconds respectively, open source Apache Any233 library to resolve URIs and 5.9 and 4.3 seconds for the second keyword, and 7.5 and 5.8 obtain triples. We instantiated keyword search by utilizing seconds for the third keyword. The navigation hop time is the dice coefficient 4 string matching algorithm which works 5.4 and 4.1 seconds. For these two questions we see that, especially well with strings of variable length. The swget even with a greater number of keywords, the navigation hop comparison does not take into account our search method- time remains below 10 seconds. ology since we use explicit URIs. As HTTP requests were In summary, in these results we can see that response run in parallel in different threads, to avoid many simultane- times are all under 10 seconds. Even more important is the ous connections that may overload the servers and generate navigation hop time, which suggests that users can navigate across resources, and match relevant triples, in around 5 sec- 3 onds on average. These results, although preliminary, show http://any23.apache.org/ 4 us that browsing the Web of Data using keywords is possi- http://www.catalysoft.com/articles/StrikeAMatch. html ble within the bounds of interactive time-frames, albeit a bit sluggish by modern browsing terms. Methods for improving Acknowledgments response times are discussed in the future work section. Sapienza affiliated authors gratefully acknowledge support swget comparsion: The speed difference between swget from the LIDER project (№ 610782), a Coordination and and our approach (indicated as kbld) is shown in Figure Support Action funded by the European Commission un- 7 where we see swget’s response time increasing especially der FP7. This work was also supported by the Millennium for results returned by following owl:sameAs links. With Nucleus Center for Semantic Web Research under Grant our approach the response times are stable without any rel- № NC120004, and Fondecyt under Grant № 11140900. evant jumps. The quicker response times of our approach are mainly due to our pipeline algorithm which, instead of wait- ing for each step to finish, sends results along the pipeline 7. REFERENCES [1] T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly, as soon as they are found; our goal is specifically to enable R. Dhanaraj, J. Hollenbach, A. Lerer, and D. Sheets, interactivity. swget was configured to stream results, how- “Tabulator: Exploring and Analyzing linked data on ever, it still performs a breadth-first search which means the Semantic Web,” in In Proceedings of the 3rd each step has to finish before the navigation can continue. International Semantic Web User Interaction Workshop, 2006. 20 17.52 [2] O. Hartig, C. Bizer, and J.-C. Freytag, “Executing SPARQL Queries over the Web of Linked Data,” in 15 ISWC’09, 2009. 10.35 10.7 11.02 [3] R. Verborgh, O. Hartig, B. De Meester, time(s) 10 11.25 G. Haesendonck, L. De Vocht, M. Vander Sande, 7.5 R. Cyganiak, P. Colpaert, E. Mannens, and R. Van de 8.66 9.08 9.47 3.56 6.7 Walle, “Querying Datasets on the Web with High 5 kbld 1.96 Availability,” in ISWC’14, pp. 180–196, 2014. 2.9 swget 2.06 [4] V. Fionda, C. Gutierrez, and G. Pirró, “Semantic 0 m ity Of tal ry eAs s:labe l Navigation on the Web of Data: Specification of :tea dbp:c art api ego dbp o: isP dbo:c p : cat w l:sam r df Routes, Web Fragments and Actions,” in WWW’12, db db o pp. 281–290, 2012. Figure 7: Response times comparison. Seed URI: [5] S. Elbassuoni and R. Blanco, “Keyword search over http://dbpedia.org/resource/Michael_Jordan RDF graphs,” in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 237–242, ACM, 2011. [6] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker, “Searching and browsing 6. CONCLUSION AND FUTURE WORK Linked Data with SWSE: The Semantic Web Search In this short paper we combined RDF keyword search Engine,” Journal of Web Semantics, vol. 9, no. 4, methodologies with Linked Data traversal techniques. Pre- pp. 365–401, 2011. liminary results showed that combining these two method- [7] O. Hartig and J.-C. Freytag, “Foundations of traversal ologies – with pipeline-based navigation and predicate res- based query execution over Linked Data,” in olution search – enables clients to explore the Web of Data HYPERTEXT, pp. 43–52, ACM, 2012. using keywords. Our approach of streaming results back to [8] R. Verborgh, M. Vander Sande, P. Colpaert, the user as quickly as possible shows that applications can S. Coppens, E. Mannens, and R. Van de Walle, be developed to make use of our keyword-based approach. “Web-Scale Querying through Linked Data For future work we would like to further develop and eval- Fragments,” in Proceedings of the 7th Workshop on uate the techniques presented. Reasoning methods could be Linked Data on the Web, 2014. used to induce and find other relevant information. Babel- [9] A. Freitas, J. G. Oliveira, E. Curry, S. O’Riain, and Net [11] may be useful to match synonyms and translations. J. C. P. da Silva, “Treo: Combining Entity-Search, Disambiguation methods such as Babelfy [12], using the Spreading Activation and Semantic Relatedness for context acquired during navigation, may increase the accu- Querying Linked Data,” in Proc. of 1st Workshop on racy of the search. For traversal and navigation techniques, Question Answering over Linked Data (QALD-1) at presenting a summary of the most used keywords available the 8th Extended Semantic Web Conference (ESWC at each resource could assist users with navigation. Sev- 2011), 2011. eral issues arise when URIs are unavailable or when limits are imposed by data providers; better approaches are also [10] V. Fionda, C. Gutierrez, and G. Pirro, “The swget needed for concurrent resolution of URIs so as to provide portal: Navigating and acting on the web of linked quick response times while not straining servers. Future data,” Web Semantics: Science, Services and Agents work could therefore provide better methods for effectively on the World Wide Web, vol. 26, pp. 29–35, 2014. crawling Linked Datasets at runtime. [11] R. Navigli and S. P. Ponzetto, “BabelNet: The We would also like to look at the potential of using our Automatic Construction, Evaluation and Application keyword-based approach inside an application such as Tab- of a Wide-Coverage Multilingual Semantic Network,” ulator [1]. Furthermore we would like to implement a stan- Artificial Intelligence, vol. 193, pp. 217–250, 2012. dalone application to showcase the possibilities of this ap- [12] A. Moro, A. Raganato, and R. Navigli, “Entity proach, namely a browser that navigates Linked Data re- Linking meets Word Sense Disambiguation: a Unified sources using keywords rather than clicking links. Approach,” TACL, vol. 2, pp. 231–244, 2014.