Virtual Document-based Methods for Keyword Search on RDF
                            Graphs
                                   Dennis Dosso                                                                    Gianmaria Silvello
      Department of Information Engineering, University of                                      Department of Information Engineering, University of
                            Padua                                                                                       Padua
                          Padua, Italy                                                                               Padua, Italy
                      dosso@dei.unipd.it                                                                        silvello@dei.unipd.it

ABSTRACT                                                                                        Search is a best effort approach based on bag of words query. The
In recent years, RDF datasets emerged as the de-facto standard for                              result is a ranking of potential answers ordered following their
publishing data on the web. SPARQL, the structured query language                               relevance to the user query. Since we are working with construct
to interrogate RDF dataset, is however hard to use for non-expert                               SPARQL query, the output of a keyword query is a ranking of
users due to its syntax. Keyword Search, on the other hand, is an                               answer subgraphs.
intuitive query paradigm to which users are today accustomed to.                                   Keyword Search has been thoroughly studied in the contest of
In this paper, we discuss the recent research about Keyword Search                              structured datasets such as relational databases and Knowledge
on RDF datasets with virtual document-based approaches and the                                  Bases. Good reviews about this topic are [1] and [10]. However,
future directions in the creation of virtual documents in order to                              as pointed out in [3], no prototype has led to a transition from
improve the quality of the retrieval.                                                           proof-of-concept implementations into fully deployed systems.
                                                                                                   In this work, we discuss a particular subclass of keyword search
CCS CONCEPTS                                                                                    systems: the one based on the virtual document-based approach.
                                                                                                The associated virtual document of an RDF graph is the textual bag
• Information systems → Retrieval models and ranking; In-
                                                                                                of words derived from the extraction of words from the resources
formation retrieval; Document representation; Evaluation of retrieval
                                                                                                composing the graph. In [4] we studied already existing state-of-
results;
                                                                                                the-art methods and developed new strategies that leverage on the
                                                                                                virtual documents and showed how they are able to overcome the
KEYWORDS
                                                                                                limitations of effectiveness and efficiency highlighted in [2] and [3].
RDF datasets, keyword search, virtual documents                                                 In the following, we discuss the virtual document-based approach
                                                                                                and some systems that make use of it and we highlight new possible
1     INTRODUCTION AND RELATED WORKS                                                            directions for the improvement of the approach.
In recent years The Resource Description Framework (RDF) has be-
come the de facto standard for the Linked Data paradigm and the                                 2   THE VIRTUAL DOCUMENT-BASED
publication of information in the Web of Data. An RDF dataset is a                                  APPROACHES
set of triples composed by subject, predicate, and object. This set can                         Among the keyword search system based on virtual documents we
also be seen as a directed labeled graph. Every node is labeled with                            count SLM [5], MRF-KS [7] and SUMM [6]. In [4] we propose other
an IRI or a string called Literal (only when object) while the edges                            two systems: TSA+BM25 and TSA+VDP. As we show, the two sys-
are labeled with an IRI. RDF graphs enable flexible manipulation                                tems overcome limitations presented by the baselines. SLM ranks
of data, their enrichment, their discovery and reuse across differ-                             its answers using an adapted language model. These are produced
ent applications. RDF datasets on the Web today contain typically                               by concatenating triples that contain keywords. MRF-KS creates an-
thousands of millions of edges [9].                                                             swer graphs in the form of trees whose leaves are nodes containing
   To interrogate these databases we use the structured language                                keywords and uses a Markov Random Field (MRF) function [8] for
SPARQL. This language is complex due to its syntax and the neces-                               the ranking. SUMM also creates answer trees with a particular focus
sity to know the structure of the graph to build the query. This is                             on the efficiency, using homomorphism among graphs and indexes
a hindrance for non-expert users like doctors or other specialists                              to speed up the computation on-line. TSA+BM25 creates a subset of
that do not have the time or the will to learn the language. In the                             subgraphs extracted from the main database and builds one virtual
following, we always consider a particular type of SPARQL queries:                              document from each of them. It indexes these documents and per-
the construct query type, which returns a subgraph.                                             forms the ranking with the BM25 ranking model. The final ranking
   Keyword Search is a simpler paradigm that can enable users                                   of graphs is based on the ranking produced over the corresponding
to easily access data, overcoming SPARQL complexity. Keyword                                    virtual documents by BM25. TSA+VDP improves the ranking of
                                                                                                TSA+BM25. In particular, firstly it applies a pruning heuristic. It
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). This is an extended abstract from the        proceeds inward, removing all the triples that do not contain at
paper “A Scalable Virtual Document-based Keyword Search System for RDF Datasets”                least one keyword, starting from the nodes with highest distance
from SIGIR 2019 by the same authors.                                                            from the source node. Then the system applies a second Markov
IIR 2019, September 16–18, 2019, Padova, Italy
                                                                                                Random Field ranking function to the new collection of graphs. This
                                                                                                function presents two factors, which take into accounts unigrams


                                                                                           38
IIR 2019, September 16–18, 2019, Padova, Italy                                                                                             D. Dosso and G. Silvello

Table 1: Performances of the different algorithms. † indi-                             However, the creation of a virtual document is still quite unso-
cates the systems in the top performing group with α < 0.01.                        phisticated in its execution: the document is built extracting words
The best system is in bold.                                                         from the IRIs and Literals of the graph. Often, these words are
                                                                                    extracted from the last part of the path of an IRI. However, often
    Dataset        Systems         tb-DCG        time (sec)   memory (MB)           there is more information contained inside an IRI that can be used
                   TSA+BM25     0.201±0.02   39.64±01.90†       13.19±0.41
                                                                                    to build more structured and useful virtual documents.
                   TSA+VDP    0.490±0.04†     318.78±21.60      21.06±1.18
    LinkedMDB 1M   SLM          0.011±0.00    39.90±08.69†      0.82±0.22              For example, an RDF triple that can be found in LinkedMDB is
                   MRF-KS      0.400±0.03†    285.22±30.10       0.99±0.09          <http:// data.linkedmdb.org/ resource/ director/ 8469, http:// data.linkedmdb.
                   SUMM         0.106±0.01    429.52±37.17      20.54±1.20          org/ resource/ movie/ director_name, "Quentin Tarantino">, which has
                   TSA+BM25   0.284±0.06†    35.28±20.36†       18.59±9.67          corresponding virtual document: “8469 director name Quentin Tarantino”.
                   TSA+VDP     0.243±0.05†    406.64±90.74      19.44±9.69
                                                                                    The number “8469” is not useful for a human reader, since it is sim-
    LUBM 1M        SLM         0.048±0.00†    61.14±28.84†      1.71±0.94
                   MRF-KS      0.090±0.03†   304.86±61.54†       2.00±0.40          ply a serial inside the database. Taken alone, the single word does
                   SUMM         0.024±0.01    526.14±65.67      11.11±1.41          not contain any useful information. However, the subject’s IRI also
                   TSA+BM25     0.171±0.01   87.52±12.86†     t 23.24±00.55         contains the word “resource”, signaling that this is an entity, and
    LinkedMDB 7M   TSA+VDP    0.429±0.04†     425.30±42.26       45.88±03.22        “director”, signaling that this entity is a director. While this infor-
                   SUMM         0.049±0.01    741.16±25.78       37.03±00.76
                                                                                    mation is somehow already contained in the IRI of the predicate of
                   TSA+BM25   0.281±0.07†     29.57±8.27†       26.42±14.72
    LUBM 10M       TSA+VDP     0.234±0.06†    37.42±10.14†       26.71±14.72
                                                                                    this triple, this may not always be the case. Thus, the whole con-
                   SUMM         0.053±0.00    794.29±79.99       10.93±01.60        tent of the path of the IRI can be used to improve the information
                                                                                    contained in the virtual documents.
                                                                                       A possibility is to divide a virtual document in fields, for exam-
                                                                                    ple metadata and content. The content field contains the bag of
and bigrams, and also weights the distance of a keyword from the                    words document, while the metadata field can contain the meta-
center of the graph. In this way TSA+VDP also takes into account                    data derived from the IRIs, like, in this case, the word “director”,
the graph structure of the answer.                                                  highlighting the fact that the document is talking about a direc-
   Table 1 reports some of our results on four databases that we used:              tor. It is also often the case that words like “director” or “actor” in
LinkedMDB, of circa 7 millions of triples, and a reduced version                    databases like LinkedMDB are quite frequent. This can be a problem
called here LinkedMDB 1M of one million triples and two versions                    for models like BM25, that can rank a non-relevant document high
of LUBM of 1M and 10M of triples respectively. For LinkedMDB the                    in the ranking simply because a single keyword is repeated many
results are computed on average over 50 topics we created by hand.                  times. Using the metadata field can help the algorithm to avoid the
For LUBM we used the 14 queries available on the website of the                     repetition of the same word many times if it is associated with the
database. For every topic, we create one SPARQL query that enables                  same entity, thus creating more human-friendly documents.
us to extract an RDF subgraph. This subgraph works as a Ground                         Acknowledgments This work is partially supported by the
Truth (GT). Every system should return answer graphs that are as                    Computational Data Citation (CDC-STARS) project of the Univer-
close as possible to the GT. Every triple in the GT is considered                   sity of Padua.
as a relevant triple. tb-DCG is a new evaluation metric defined by
us which rewards the number of relevant triples in the answer, its                  REFERENCES
position in the ranking and the quantity of noise (non-relevant                      [1] H. Bast, B. Buchhold, and H. Haussmann. 2016. Semantic Search on Text and
triples) in the graph.                                                                   Knowledge Bases. Foundations and Trends in Information Retrieval 10, 2-3 (2016),
                                                                                         119–271.
   Only three systems, TSA+BM25, TSA+VDP, and SUMM are able                          [2] J. Coffman and A. C. Weaver. 2010. A framework for evaluating database keyword
to scale to the bigger databases. SLM is not able to scale since                         search strategies. In Proc. of the 19th ACM International Conference on Information
                                                                                         and knowledge management. ACM Press, 729–738.
it performs too many operations online, and thus the bigger the                      [3] J. Coffman and A. C. Weaver. 2014. An Empirical Performance Evaluation of
database the bigger the execution time. MRF-KS relies too much on                        Relational Keyword Search Systems. IEEE Transactions on Knowledge and Data
a Dijkstra-based exploration of the graph in its offline phase, which                    Engineering 26, 1 (2014), 30–42.
                                                                                     [4] D. Dosso and G. Silvello. 2019. A Scalable Virtual Document-Based Keyword
does not allow to scale to bigger dimensions                                             Search System for RDF Datasets. In Proceedings of the 42nd International ACM
   TSA+VDP is the top performing system when we work on the                              SIGIR conference on Research and Development in Information Retrieval, SIGIR.
real database LinkedMDB, while TSA+BM25 is the top one on LUBM                       [5] S. Elbassuoni and R. Blanco. 2011. Keyword Search over RDF Graphs. In Proc.
                                                                                         of the 20th ACM Conference on Information and Knowledge Management, CIKM
1M. TSA+BM25 obtains the lowest execution time thanks to its                             2011,. ACM Press, New York, USA, 237–242.
BM25-based strategy, while TSA+VDP performs a trade-off between                      [6] W. Le, F. Li, A. Kementsietsidis, and S. Duan. 2014. Scalable Keyword Search on
                                                                                         Large RDF Data. IEEE Trans. Knowl. Data Eng. 26, 11 (2014), 2774–2788.
time and effectiveness. Similarly, for LinkedMDB 7M TSA+VDP                          [7] Y. Mass and Y. Sagiv. 2016. Virtual Documents and Answer Priors in Keyword
obtains the highest result of tb-DCG among the three systems,                            Search over Data Graphs. In Proc. of the Workshops of the EDBT/ICDT 2016 Joint
while TSA+BM25 is the best one in LUBM 10M.                                              Conference (CEUR Workshop Proceedings), Vol. 1558. CEUR-WS.org.
                                                                                     [8] D. Metzler and W. B. Croft. 2005. A Markov random field model for term depen-
                                                                                         dencies. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR
3      FUTURE DIRECTIONS                                                                 Conference on Research and Development in Information Retrieval. ACM, 472–479.
Systems based on virtual documents are able to scale to bigger                       [9] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu. 2017. The Ubiquity
                                                                                         of Large Graphs and Surprising Challenges of Graph Processing. PVLDB 11, 4
datasets thanks to the help of the virtual document nature of the                        (2017), 420–431.
subgraphs, that enables them to use adapted state of the art IR                     [10] H. Wang and C. C. Aggarwal. 2010. A Survey of Algorithms for Keyword Search
methods to improve efficiency.                                                           on Graph Data. In Managing and Mining Graph Data. Springer, 249–273.


                                                                               39