Virtual Document-based Methods for Keyword Search on RDF Graphs Dennis Dosso Gianmaria Silvello Department of Information Engineering, University of Department of Information Engineering, University of Padua Padua Padua, Italy Padua, Italy dosso@dei.unipd.it silvello@dei.unipd.it ABSTRACT Search is a best effort approach based on bag of words query. The In recent years, RDF datasets emerged as the de-facto standard for result is a ranking of potential answers ordered following their publishing data on the web. SPARQL, the structured query language relevance to the user query. Since we are working with construct to interrogate RDF dataset, is however hard to use for non-expert SPARQL query, the output of a keyword query is a ranking of users due to its syntax. Keyword Search, on the other hand, is an answer subgraphs. intuitive query paradigm to which users are today accustomed to. Keyword Search has been thoroughly studied in the contest of In this paper, we discuss the recent research about Keyword Search structured datasets such as relational databases and Knowledge on RDF datasets with virtual document-based approaches and the Bases. Good reviews about this topic are [1] and [10]. However, future directions in the creation of virtual documents in order to as pointed out in [3], no prototype has led to a transition from improve the quality of the retrieval. proof-of-concept implementations into fully deployed systems. In this work, we discuss a particular subclass of keyword search CCS CONCEPTS systems: the one based on the virtual document-based approach. The associated virtual document of an RDF graph is the textual bag • Information systems → Retrieval models and ranking; In- of words derived from the extraction of words from the resources formation retrieval; Document representation; Evaluation of retrieval composing the graph. In [4] we studied already existing state-of- results; the-art methods and developed new strategies that leverage on the virtual documents and showed how they are able to overcome the KEYWORDS limitations of effectiveness and efficiency highlighted in [2] and [3]. RDF datasets, keyword search, virtual documents In the following, we discuss the virtual document-based approach and some systems that make use of it and we highlight new possible 1 INTRODUCTION AND RELATED WORKS directions for the improvement of the approach. In recent years The Resource Description Framework (RDF) has be- come the de facto standard for the Linked Data paradigm and the 2 THE VIRTUAL DOCUMENT-BASED publication of information in the Web of Data. An RDF dataset is a APPROACHES set of triples composed by subject, predicate, and object. This set can Among the keyword search system based on virtual documents we also be seen as a directed labeled graph. Every node is labeled with count SLM [5], MRF-KS [7] and SUMM [6]. In [4] we propose other an IRI or a string called Literal (only when object) while the edges two systems: TSA+BM25 and TSA+VDP. As we show, the two sys- are labeled with an IRI. RDF graphs enable flexible manipulation tems overcome limitations presented by the baselines. SLM ranks of data, their enrichment, their discovery and reuse across differ- its answers using an adapted language model. These are produced ent applications. RDF datasets on the Web today contain typically by concatenating triples that contain keywords. MRF-KS creates an- thousands of millions of edges [9]. swer graphs in the form of trees whose leaves are nodes containing To interrogate these databases we use the structured language keywords and uses a Markov Random Field (MRF) function [8] for SPARQL. This language is complex due to its syntax and the neces- the ranking. SUMM also creates answer trees with a particular focus sity to know the structure of the graph to build the query. This is on the efficiency, using homomorphism among graphs and indexes a hindrance for non-expert users like doctors or other specialists to speed up the computation on-line. TSA+BM25 creates a subset of that do not have the time or the will to learn the language. In the subgraphs extracted from the main database and builds one virtual following, we always consider a particular type of SPARQL queries: document from each of them. It indexes these documents and per- the construct query type, which returns a subgraph. forms the ranking with the BM25 ranking model. The final ranking Keyword Search is a simpler paradigm that can enable users of graphs is based on the ranking produced over the corresponding to easily access data, overcoming SPARQL complexity. Keyword virtual documents by BM25. TSA+VDP improves the ranking of TSA+BM25. In particular, firstly it applies a pruning heuristic. It Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This is an extended abstract from the proceeds inward, removing all the triples that do not contain at paper “A Scalable Virtual Document-based Keyword Search System for RDF Datasets” least one keyword, starting from the nodes with highest distance from SIGIR 2019 by the same authors. from the source node. Then the system applies a second Markov IIR 2019, September 16–18, 2019, Padova, Italy Random Field ranking function to the new collection of graphs. This function presents two factors, which take into accounts unigrams 38 IIR 2019, September 16–18, 2019, Padova, Italy D. Dosso and G. Silvello Table 1: Performances of the different algorithms. † indi- However, the creation of a virtual document is still quite unso- cates the systems in the top performing group with α < 0.01. phisticated in its execution: the document is built extracting words The best system is in bold. from the IRIs and Literals of the graph. Often, these words are extracted from the last part of the path of an IRI. However, often Dataset Systems tb-DCG time (sec) memory (MB) there is more information contained inside an IRI that can be used TSA+BM25 0.201±0.02 39.64±01.90† 13.19±0.41 to build more structured and useful virtual documents. TSA+VDP 0.490±0.04† 318.78±21.60 21.06±1.18 LinkedMDB 1M SLM 0.011±0.00 39.90±08.69† 0.82±0.22 For example, an RDF triple that can be found in LinkedMDB is MRF-KS 0.400±0.03† 285.22±30.10 0.99±0.09 , which has TSA+BM25 0.284±0.06† 35.28±20.36† 18.59±9.67 corresponding virtual document: “8469 director name Quentin Tarantino”. TSA+VDP 0.243±0.05† 406.64±90.74 19.44±9.69 The number “8469” is not useful for a human reader, since it is sim- LUBM 1M SLM 0.048±0.00† 61.14±28.84† 1.71±0.94 MRF-KS 0.090±0.03† 304.86±61.54† 2.00±0.40 ply a serial inside the database. Taken alone, the single word does SUMM 0.024±0.01 526.14±65.67 11.11±1.41 not contain any useful information. However, the subject’s IRI also TSA+BM25 0.171±0.01 87.52±12.86† t 23.24±00.55 contains the word “resource”, signaling that this is an entity, and LinkedMDB 7M TSA+VDP 0.429±0.04† 425.30±42.26 45.88±03.22 “director”, signaling that this entity is a director. While this infor- SUMM 0.049±0.01 741.16±25.78 37.03±00.76 mation is somehow already contained in the IRI of the predicate of TSA+BM25 0.281±0.07† 29.57±8.27† 26.42±14.72 LUBM 10M TSA+VDP 0.234±0.06† 37.42±10.14† 26.71±14.72 this triple, this may not always be the case. Thus, the whole con- SUMM 0.053±0.00 794.29±79.99 10.93±01.60 tent of the path of the IRI can be used to improve the information contained in the virtual documents. A possibility is to divide a virtual document in fields, for exam- ple metadata and content. The content field contains the bag of and bigrams, and also weights the distance of a keyword from the words document, while the metadata field can contain the meta- center of the graph. In this way TSA+VDP also takes into account data derived from the IRIs, like, in this case, the word “director”, the graph structure of the answer. highlighting the fact that the document is talking about a direc- Table 1 reports some of our results on four databases that we used: tor. It is also often the case that words like “director” or “actor” in LinkedMDB, of circa 7 millions of triples, and a reduced version databases like LinkedMDB are quite frequent. This can be a problem called here LinkedMDB 1M of one million triples and two versions for models like BM25, that can rank a non-relevant document high of LUBM of 1M and 10M of triples respectively. For LinkedMDB the in the ranking simply because a single keyword is repeated many results are computed on average over 50 topics we created by hand. times. Using the metadata field can help the algorithm to avoid the For LUBM we used the 14 queries available on the website of the repetition of the same word many times if it is associated with the database. For every topic, we create one SPARQL query that enables same entity, thus creating more human-friendly documents. us to extract an RDF subgraph. This subgraph works as a Ground Acknowledgments This work is partially supported by the Truth (GT). Every system should return answer graphs that are as Computational Data Citation (CDC-STARS) project of the Univer- close as possible to the GT. Every triple in the GT is considered sity of Padua. as a relevant triple. tb-DCG is a new evaluation metric defined by us which rewards the number of relevant triples in the answer, its REFERENCES position in the ranking and the quantity of noise (non-relevant [1] H. Bast, B. Buchhold, and H. Haussmann. 2016. Semantic Search on Text and triples) in the graph. Knowledge Bases. Foundations and Trends in Information Retrieval 10, 2-3 (2016), 119–271. Only three systems, TSA+BM25, TSA+VDP, and SUMM are able [2] J. Coffman and A. C. Weaver. 2010. A framework for evaluating database keyword to scale to the bigger databases. SLM is not able to scale since search strategies. In Proc. of the 19th ACM International Conference on Information and knowledge management. ACM Press, 729–738. it performs too many operations online, and thus the bigger the [3] J. Coffman and A. C. Weaver. 2014. An Empirical Performance Evaluation of database the bigger the execution time. MRF-KS relies too much on Relational Keyword Search Systems. IEEE Transactions on Knowledge and Data a Dijkstra-based exploration of the graph in its offline phase, which Engineering 26, 1 (2014), 30–42. [4] D. Dosso and G. Silvello. 2019. A Scalable Virtual Document-Based Keyword does not allow to scale to bigger dimensions Search System for RDF Datasets. In Proceedings of the 42nd International ACM TSA+VDP is the top performing system when we work on the SIGIR conference on Research and Development in Information Retrieval, SIGIR. real database LinkedMDB, while TSA+BM25 is the top one on LUBM [5] S. Elbassuoni and R. Blanco. 2011. Keyword Search over RDF Graphs. In Proc. of the 20th ACM Conference on Information and Knowledge Management, CIKM 1M. TSA+BM25 obtains the lowest execution time thanks to its 2011,. ACM Press, New York, USA, 237–242. BM25-based strategy, while TSA+VDP performs a trade-off between [6] W. Le, F. Li, A. Kementsietsidis, and S. Duan. 2014. Scalable Keyword Search on Large RDF Data. IEEE Trans. Knowl. Data Eng. 26, 11 (2014), 2774–2788. time and effectiveness. Similarly, for LinkedMDB 7M TSA+VDP [7] Y. Mass and Y. Sagiv. 2016. Virtual Documents and Answer Priors in Keyword obtains the highest result of tb-DCG among the three systems, Search over Data Graphs. In Proc. of the Workshops of the EDBT/ICDT 2016 Joint while TSA+BM25 is the best one in LUBM 10M. Conference (CEUR Workshop Proceedings), Vol. 1558. CEUR-WS.org. [8] D. Metzler and W. B. Croft. 2005. A Markov random field model for term depen- dencies. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR 3 FUTURE DIRECTIONS Conference on Research and Development in Information Retrieval. ACM, 472–479. Systems based on virtual documents are able to scale to bigger [9] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu. 2017. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB 11, 4 datasets thanks to the help of the virtual document nature of the (2017), 420–431. subgraphs, that enables them to use adapted state of the art IR [10] H. Wang and C. C. Aggarwal. 2010. A Survey of Algorithms for Keyword Search methods to improve efficiency. on Graph Data. In Managing and Mining Graph Data. Springer, 249–273. 39