Introduction

A Document-based RDF Keyword Search System: Query-by-Query Analysis

0 Department of Information Engineering, University of Padua

RDF datasets are today used more and more for a great variety of applications mainly due to their exibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with di erent classes of queries.

RDF Keyword Search Virtual Documents

Introduction

In the last decade, the Web of Data emerged as one of the principal means to access, share and re-use structured data on the Web [ 12 ]. RDF has become the de facto standard for the publication of information on the Web [ 14 ] because it eases the enrichment and discovery of data as well as their interoperability.

In the past years, we have seen a signi cant increase in the number of knowledge bases published as large RDF datasets, such as DBpedia , Wikidata , and OpenPHACTS . The use of RDF is growing also for the representation and management of enterprise data with heterogeneous and very large graphs, often containing millions of edges [ 14 ].

RDF datasets are typically queried through the SPARQL structured language [ 13 ], which is often di cult to use for non-expert users due to its syntax and the required knowledge of the underlying structure of the database and the utilized IRIs. This language is not intuitive for end-users and it does not allow the natural expression of their information needs based on keywords [ 17 ]. 2

Related Work

In recent years keyword search received a lot of attention in the research community, both in the contest of Relational Databases (RDB) and Graph DB. Good reviews about this topic are [ 2 ], [ 18 ] (for a focus on RDB) and [ 16 ] (for a focus on graph data).

Keyword Search systems may be categorized into three main classes: (i) schema-based (e.g. DISCOVER [ 1 ]), which uses the schema of the database to produce a structured query; (ii) graph-based (e.g. BANKS [ 3 ]), which explores the graph (the graph induced by the foreign keys, if we are in a relational database); (iii) document-based ( rst described in [ 10 ]), stemming from the graph-based, this class is based on the creation of textual documents, also called virtual documents, from the words contained in the graphs. It also uses state-of-the-art IR methods for the indexing and ranking of the answers. Thanks to this strategy, the systems belonging to this class showed to scale to real-world scenarios.

In this paper we compare our system TSA+VDP [ 6 ] to the following stateof-the-art keyword search systems: 1. Structured Language Model (SLM) [ 7 ] is among the few native RDF search systems. It starts the searching process from the triples that match at least one of the query keywords and then produces answers by seeking connected components in the graph. SLM is based on virtual documents and employs IR models (statistical language models) for retrieval. SLM returns a ranking of RDF subgraphs to the user and not virtual text documents as in [ 8, 15 ]. 2. Markov Random Fields-Keyword Search (MRF-KS)[ 11 ] is based on a combination of graph and virtual-document-based approaches. This system has proven to be the most e ective one on the shared benchmark in [ 4 ]. MRF-KS was originally thought to work on RDBs, even though in principle it can be used for any kind of graph data. We redesigned it to work on RDF and use it as a competing baseline. 3. SUMM [ 9 ] is a native RDF search system. The core idea behind this system is to produce a rst partition of the whole graph through a BFS-based greedy algorithm. The partitions are edge-disjointed but are connected one to the other through nodes called portal nodes. Once given the keyword query, a backtracking algorithm similar to BANKS [ 3 ] is deployed to nd connected subgraphs containing all the keywords. On these subgraphs, a di erent backtracking algorithm similar to BLINKS [ 8 ] is used to produce tree-shaped answer graphs whose leaves are nodes containing one keyword each. The systems use indexes and particular stopping conditions to speedup the execution. 2.1

TSA+BM25 and TSA+VDP To overcome the limitations of the previous state-of-the-art systems, in [ 6 ] we proposed two new keyword search systems, based on the virtual document approach, one the re nement of the other. These systems are composed of an o -line and an on-line phase: O -line phase The o -line phase is common to both systems and is called

Topological Semantical Aggregator (TSA), a virtual document-based ap

proach that produces documents by aggregating RDF triples containing close concepts. TSA mimics the \clustering hypothesis" de ned in IR which dictates that triples characterized by similar concepts will cluster together.

The algorithm takes an RDF graph as input and creates subgraphs without considering the user query. The idea is to build a subgraph around a single \topic" which characterizes the graph semantics. To do so, the algorithm starts randomly from a node that presents an out-degree higher of an empiricallyde ned threshold. From that node, the algorithm starts a Breadth-First Searchlike exploration that visits only nodes that present in and out-degree between certain thresholds, limiting the exploration inside a radius . All these parameters are also empirically-de ned. In this way, a subgraph is extracted from the bigger database. Once the exploration stops, the algorithm marks the utilized nodes as visited and proceeds randomly to the next unvisited node with high out-degree. Nodes may be visited more than once if they present a high in-degree, and in this case, they are shared by many graphs. The rationale is that these nodes may contain information belonging to more topics.

The result of TSA is a collection of subgraphs covering the whole database. They are then converted to their corresponding virtual documents through the extraction and concatenation of the words contained in the IRIs and Literals in their nodes and edges. We then index these documents with Terrier. On-line phase Given a keyword query, our rst system, TSA+BM25, applies the ranking function BM25 to the collection of virtual documents, and obtains a ranking of graphs. This system then looks for overlapping graphs in the ranking produced. Graphs that present a percentage of shared triples beyond an empirically-found threshold are merged. BM25 is used a second time to produce a new ranking on these merged graphs, which is returned to the user. This system is rather quick, since it uses state-of-the-art implementations of BM25, but does not leverage the information contained in the topology of the graphs.

Our second system is called TSA+VDP (VDP stands for Virtual Document Pruning). The system then considers the result of the set-based union of these ranking produced by TSA+BM25, which is called query graph E. This graph is not necessarily connected.

TSA+VDP then starts a BFS from every subject node within E. This BFS is also limited to a radius 0, and produces subgraphs. Of these subgraphs, only the ones containing all the keywords are kept, producing a new collection, called the best candidates collection. The rationale is that these selected graphs will contain, with high probability, triples that are relevant to the user. Since the BFS exploration could also have introduced many non-relevant triples, the next step is a pruning algorithm over these candidate graphs. The pruning proceeds inward, removing all the triples that do not contain at least one keyword, starting from the nodes with the highest distance from the source node.

These pruned graphs are nally ranked using a Markov Random Field scoring function inspired by the one used by MRF-KS. The nal ranking is the answer returned to the user. 3 3.1

Methods Preliminaries

While SPARQL enables the user to obtain an exact answer to his/her information need, it is not always the most viable approach [ 10 ]. SPARQL is a structured language with a complex syntax, which also requires complete knowledge of the graph being queried. A knowledge that the standard user (one who is not familiar with the concepts of RDF graphs) may not possess.

Keyword Search is a di erent paradigm that can help this kind of user to retrieve her information without SPARQL. In this approach, queries are an array < w1; ; wn > of keywords. The output is a list of answers, ordered depending on their relevance to the information need expressed by the query. This ordering is de ned by a ranking function. It is important to note that while SPARQL returns an exact answer (even an empty one is that is the case), Keyword Search is a best e ort approach.

The nature of the answers provided by the keyword search system depends on the task. In this paper, we consider a user who has an information need that can be formulated with a SPARQL CONSTRUCT query, which he/she is not able to formulate. If that query existed, the answer graph can be considered as the set of all and only the triples of the database that the user wants to see. We consider this set of triples as a ground truth (GT). The triples of the GT are called relevant triples.

Our task is to produce a ranking that contains as many relevant triples as possible in answer graphs ranked as highly as possible. These graphs should also not contain too many non-relevant (noisy) triples. A good system, therefore, produces a list of (preferably not-overlapping) subgraphs extracted from the database such that the subgraphs at the top of the list contain many relevant triples. 3.2

Evaluation

If a Keyword Search System returns a list of answers, we rst of all need to understand when one of these graphs is relevant. We follow the same approach of [ 6 ], where a signal-to-noise ratio is de ned for each answer graph. De nition 1. Given a topic tk 2 T , a ranking Rk and the ground-truth graph GTtk , we de ne the Signal-to-Noise Ratio (SNR) of Gi 2 Rk as SN R(Gi) = j(Gi \ GTtk ) n Sj jGij (1) where S is the union set of all the relevant triples in Gj 2 Rk; 8j 2 [1; i[ such that Gj is relevant.

Given this de nition, we say that a graph is relevant when its SN R is bigger or equal to a threshold value 2 [0; 1]. Note how the SNR keeps into consideration the relevant triples already seen in the ranking, avoiding to count them twice. We can say that a quality answer is a graph that has a high SNR.

We call the relevance parameter, which describes the quality required for a graph to be considered relevant. In this evaluation framework, the relevance is therefore not an absolute, xed concept, but can be varied with . This parameter models the preferences of a user. A user who does not want noise in its answers is modeled with a close to 1. A user that allows for imperfect answers is modeled with a parameter closer to 0. In what follows we study how di erent values of change the evaluation of the output of the considered systems.

As evaluation measures we use recall and tb-DCG.

Recall is the total number of relevant triples found in the ranking produced by a system { typically considering the rst 1000 results { divided by jGT j. The recall enables to understand the extractive power of a system, the raw quantity of relevant triples extrapolated from the database. While useful, it does not convey any information about the disposition of the triples in the ranking and little information on the quantity of noise that surrounds them in the subgraphs where they are contained.

tb-DCG [ 5 ] is a more sophisticated metric, a derivation of the classic NDCG, which takes into consideration aspects such as the relevance of the graphs (the number of triples of the GT contained in them), their position (relevant graphs put early weight more) and their uniqueness (repeated relevant triples are not counted). 3.3

Experimental Setup

We consider two databases: LinkedMDB and IMDB. LinkedMDB is a native RDF database of circa 7M triples. IMDB is a relational database that we converted into an RDF version of about 116M triples.

For both databases, we created 100 topics, divided into 5 classes of 20 queries each. 50 were used for training the parameters of TSA+VDP, the other 50 for testing. Every topic is represented as a pair made of a CONSTRUCT SPARQL query and a keyword counterpart, both representing the same information need. The SPARQL query is used to extract the subgraph which works as GT for the keyword query.

B D M d e k n i L B D M I

C1 a a a

C4 a a b b

c c b b b c c c d d

e e d d d

e e e

Every class is syntactically di erent from the others, i.e. they di er in the shape of the pattern P , as can be seen in Figure 1. The gure presents the 10 patterns associated with each class ordered by increasing complexity. This complexity derives from the number of triples in the pattern and the shape of the connections created by these triples. Note that it is possible that two classes may di er only in the direction of one edge.

Each one of these classes is further divided into two sub-classes, di ering by their semantic. For example, class C1 of LinkedMDB is divided into C1.1, with queries asking for all the lms of a certain director, and class C1.2, made of queries asking for all the lms starring a certain actor.

The databases considered here, LinkedMDB 1M and IMDB 1M (of 1M triples each), are built by randomly extracting connected subsets of triples from the original datasets. The triples of the ground truths produced by the SPARQL are added to these datasets. This was necessary since only SUMM, TSA+BM25 and TSA+VDP are able to scale to the whole dimensions and we needed to be able to compare all systems.

Experimental Performances Average Analysis

Analysis of the quality of the answers through the factor Note that in Table 1 we xed to 0:1. This parameter was empirically chosen to show the di erences in the e ectiveness of the di erent systems. A higher could provoke all the values to be attened to 0.

One may ask now what happens if we change . A higher describes a user who tolerates less noise in her answer graphs. It asks for answers that are more \tailored" around the GT. Figure 2 presents the values of recall and tb-DCG of the ve systems on the two datasets as varies. On the x-axis we have di erent values for , from 0 to 1, with step 0:1. Every point in the plots is obtained computing the averages on the 50 queries issued on each dataset.

Let us start on LinkedMDB 1M. TSA+BM25 appears to be consistently the top-performing system in terms of recall at the varying of . Moreover, when = 0, that is when we do not count the noise in the answers, the values of average recall is around 0:9, illustrating how, on average, many relevant triples are retrieved by the system in the top 1000 positions. Immediately below TSA+BM25,

LinkedMDB 1M IMDB 1M

SLM SUMM TSA+BM25 TSA+VDP MRF-KS SLM SUMM TSA+BM25 TSA+VDP MRF-KS 1 0.8 n0l.6 iiseolca c e rR P0.4 0.2 1 0.8 0.6 G C D b t0.4 we nd TSA+VDP. The heuristics used by TSA+VDP eliminate some relevant triples from the ranking, thus reducing the overall recall.

For tb-DCG, it is evident how TSA+VDP sacri ces recall to obtain a better tb-DCG. In particular, TSA+VDP is consistently the top-performing system until we reach = 0:5. After this point, all the systems collapse on low values, due to the presence of few high-quality answers.

For IMDB 1M and the recall value, the top-performing system is now TSA+VDP, with TSA+BM25 immediately behind. This means that in the case of IMDB, BM25 is less e ective in extracting relevant triples from the whole graph. The pruning heuristics of TSA+VDP improve the quality of some answers, making them relevant and therefore counted in the computation of the average recall. We also note that the trend of the SUMM method is more stable than the other systems. In fact, the system manages to get high-quality answers that satisfy many values. Only after a value of = 0:5 there is a performance decline.

Regarding tb-DCG, TSA+VDP remains the top system, with a sizeable difference from the second-placed TSA+BM25 and MRF-KS, which appear to be quite similar in performances while increases. SUMM maintains its stability but is still below the other systems.

This analysis clearly shows how bigger values of make the task harder. In comparison, however, the TSA-based systems are able to obtain consistently better results with respect to the state of the art systems. Recall Figure 3 shows the behavior query-by-query in terms of recall and time for the two datasets and for the two top-performing systems TSA+VDP and MRF-KS with = 0:1. We grouped the queries by their classes and subclasses and highlighted them with di erent colors.

Starting from LinkedMDB 1M, the queries of certain classes such as C1.1 obtain good results in both systems. Others, like class C5, appear hard for both systems.

What is evident is that a simpler P does not necessarily imply high performances. In fact, a class such as C1.2 presents low results in both systems, while C4, which has a more complex P , has higher recall with TSA+VDP. It seems that the semantic of the queries plays a critical role in determining their di culty. Subclasses such as C1.1 and C1.2, for example, show di erent behaviors, even if they share the same pattern structure.

For IMDB 1M, we can make the same observations. It is even more evident in this instance how a simpler syntax does not correspond to higher recall. For TSA+VDP class C5 almost always presents a recall of 1, while many queries of class C1 present values close to 0. This may be explained considering the fact that a simpler pattern corresponds to a ground truth with fewer relevant triples, harder to extract from the database in an e ective way. tb-DCG Figure 4 presents similar graphics but for the tb-DCG metric.

The high variability in the behavior of the queries is immediately evident from the di erent performances obtained by the systems. Let us start with LinkedMDB 1M. Certain classes, such as C4, appear to be relatively easy for TSA+VDP. On the other hand, they are harder for MRF-KS. If we consider that C1.2, instead, is hard for both systems, we see how it is not necessarily true that a simpler syntax (i.e. a simpler pattern P ) corresponds to an easier query to answer.

On the other hand, a class such as C2 presents subclass C2.1 which appears hard for both systems, while C2.2 is easier. This suggests again how the same syntactic class can become harder or easier depending on its semantic nature, independently from the system.

For IMDB 1M, we see how similar observations can be made. Class C5 appears simpler for TSA+VDP and harder for MRF-KS. More in general, it appears that MRF-KS struggles more in IMDB to answer the queries: its average tb-DCG is much lower than the one of TSA+VDP. This shows how it may be possible that the nature of a database (the number and structure of connections among nodes and the quantity and disposition of keywords among them) may in uence the e ectiveness of a system. Time The execution times are presented in both Figure 3 and 4. Note how we used a time limit of 1000s.

Once again, queries in the same class may present di erent behaviors in terms of execution time. For example, MRF-KS presents relatively low values of time for subclass C5.1 in IMDB, while subclass C5.2 presents much higher times on average. This new evidence of the fact that the syntactical factor is not the only one that plays a role in the determination of the execution time. The choice of keywords (more or less frequent inside the database) appears to also play a role. Queries with more frequent keywords will, in fact, demand much more execution time. 5

Conclusions and Future Works

In this paper, we presented an in-depth analysis of the behavior of state-of-theart systems with a particular focus on the quality of required answers and the behavior presented with di erent kinds of queries.

We used a sub-set of LinkedMDB and IMDB as databases. We de ned ve di erent classes of queries for each dataset, which di er from one another on their syntax and semantic, for a total of 100 test queries. We used recall and tb-DCG as evaluation metrics. The rst one describes the simple raw power of extraction of relevant triples of a system. The latter is a more sophisticated measure that takes into consideration the relevance of the answers, their position in the ranking and the quantity of noise in them. We considered ve di erent state-of-the-art systems: SLM, MRF-KS, SUMM, TSA+BM25, and TSA+VDP.

We performed two di erent classes of experiments. In the rst one, we varied the minimal quantity of SNR required from an answer to be considered as relevant. We showed that the higher the required threshold, the lower the performances of the systems. TSA+BM25 and TSA+VDP consistently appeared to obtain the higher results, although we registered a progressive detriment of the performances.

The second class of experiments showed that certain classes of queries seem to be simpler than others for both systems. It resulted that not necessarily a simpler pattern P implies better results. The complexity of the query is therefore not present only in its syntax, but also in its semantic and in the keywords of choice. To corroborate this intuition, it appeared that queries belonging to the same syntactic class with di erent semantics presented completely di erent performances.

For our future works, we will keep researching, also on di erent databases, how di erent kinds of queries correspond to di erent behaviors, and how poor results may be mitigated with techniques as query rewriting.

1. Balmin , A. , Hristidis , V. , Koudas , N. , Papakonstantinou , Y. , Srivastava , D. , Wang , T. : A system for keyword proximity search on XML databases . In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB . pp. 1069 { 1072 . Morgan Kaufmann ( 2003 )

2. Bast , H. , Buchhold , B. , Haussmann , H.: Semantic search on text and knowledge bases . Foundations and Trends in Information Retrieval 10 ( 2-3 ), 119 { 271 ( 2016 )

3. Bhalotia , G. , Hulgeri , A. , Nakhe , C. , Chakrabarti , S. , Sudarshan , S. : Keyword searching and browsing in databases using BANKS . In: Proceedings of the 18th International Conference on Data Engineering . pp. 431 { 440 . IEEE Computer Society ( 2002 )

4. Co

man

, J., Weaver , A.C. : A framework for evaluating database keyword search strategies . In: Proc. of the 19th ACM International Conference on Information and knowledge management . pp. 729 { 738 . ACM Press ( 2010 )

5. Dosso , D. , Silvello , G.: A scalable virtual document-based keyword search system for rdf datasets . In: Proceedings of the 42nd International ACM SIGIR conference on Research and Development in Information Retrieval , SIGIR ( 2019 )

6. Dosso , D. , Silvello , G. : Search text to retrieve graphs: A scalable RDF keywordbased search system . IEEE Access 8 , 14089 { 14111 ( 2020 )

7. Elbassuoni , S. , Blanco , R.: Keyword Search over RDF Graphs . In: Proc. of the 20th ACM Conference on Information and Knowledge Management , CIKM 2011 ,. pp. 237 { 242 . ACM Press, New York, USA ( 2011 )

8. He , H. , Wang , H. , Yang , J. , Yu , P.S.: BLINKS: ranked keyword searches on graphs . In: Proceedings of the ACM SIGMOD International Conference on Management of Data . pp. 305 { 316 . ACM ( 2007 )

9. Le , W. , Li , F. , Kementsietsidis , A. , Duan , S. : Scalable keyword search on large RDF data . IEEE Trans. Knowl. Data Eng. (11) , 2774 { 2788 . https://doi.org/10.1109/TKDE. 2014 .2302294

10. Lopez-Veyna , J.I. , Sosa-Sosa , V.J. , Lopez-Arevalo , I.: A virtual document approach for keyword search in databases . In: DATA . pp. 39 { 48 . SciTePress ( 2012 )

11. Mass, Y. , Sagiv , Y. : Virtual Documents and Answer Priors in Keyword Search over Data Graphs . In: Proc. of the Workshops of the EDBT/ICDT 2016 Joint Conference. CEUR Workshop Proceedings , vol. 1558 . CEUR-WS.org ( 2016 )

12. Pound , J. , Mika , P. , Zaragoza , H.: Ad-hoc object retrieval in the web of data . In: Proc. of the 19th International Conference on World Wide Web , WWW 2010 . pp. 771 { 780 . ACM Press, New York, USA ( 2010 )

13. Prud , E. , Seaborne , A. , et al.: Sparql query language for rdf ( 2006 )

14. Sahu , S. , Mhedhbi , A. , Salihoglu , S. , Lin , J. , Ozsu, M.T.: The ubiquity of large graphs and surprising challenges of graph processing . PVLDB 11 ( 4 ), 420 { 431 ( 2017 )

15. Su , Q. , Widom , J.: Indexing Relational Database Content O ine for E cient Keyword-Based Search . In: Proc. of the 9th International Database Engineering and Applications Symposium (IDEAS 2005 ). pp. 297 { 306 . IEEE Computer Society ( 2005 )

16. Wang , H. , Aggarwal , C.C. : A survey of algorithms for keyword search on graph data . In: Managing and Mining Graph Data , pp. 249 { 273 . Springer ( 2010 )

17. Wu , W. : Proactive Natural Language Search Engine: Tapping into Structured Data on the Web . In: Proc. of the Joint 2013 EDBT/ICDT Conferences . pp. 143 { 148 . ACM Press, New York, USA ( 2013 )

18. Yu , J.X. , Qin , L. , Chang , L. : Keyword Search in Relational Databases: A Survey. IEEE Data Eng . Bull . 33 ( 1 ), 67 { 78 ( 2010 )