A Document-based RDF Keyword Search System: Query-by-Query Analysis Dennis Dosso1[0000−0001−7307−4607] and Gianmaria Silvello1[0000−0003−4970−4554] Department of Information Engineering, University of Padua {dosso, silvello}@dei.unipd.it Abstract. RDF datasets are today used more and more for a great variety of applications mainly due to their flexibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with different classes of queries. Keywords: RDF · Keyword Search · Virtual Documents 1 Introduction In the last decade, the Web of Data emerged as one of the principal means to access, share and re-use structured data on the Web [12]. RDF has become the de facto standard for the publication of information on the Web [14] because it eases the enrichment and discovery of data as well as their interoperability. In the past years, we have seen a significant increase in the number of knowl- edge bases published as large RDF datasets, such as DBpedia , Wikidata , and OpenPHACTS . The use of RDF is growing also for the representation and management of enterprise data with heterogeneous and very large graphs, often containing millions of edges [14]. RDF datasets are typically queried through the SPARQL structured lan- guage [13], which is often difficult to use for non-expert users due to its syntax and the required knowledge of the underlying structure of the database and the utilized IRIs. This language is not intuitive for end-users and it does not allow the natural expression of their information needs based on keywords [17]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. https://wiki.dbpedia.org/ https://www.wikidata.org/wiki/Wikidata:Main Page https://www.openphacts.org/ D. Dosso and G. Silvello Keyword search is a simpler paradigm that enables users to express their information need through keywords [10]. Keyword-based methods have gained importance over time both in research and in industry as a means to facilitate access to structured data. The result of a keyword query is a ranking of potential answers, ordered depending on their relevance to the information need. In this paper, we introduce the TSA+BM25 and TSA+VDP keyword search systems, first published in [6] and the new evaluation methodology presented in [5]. The new contributions of this paper are: (i) a study of the quality of the answers produced by state-of-the-art systems in keyword search on RDF: SLM, MRF-KS, SUMM, TSA+BM25, and TSA+VDP; (ii) a study of the behavior of the two most performing systems, MRF-KS and TSA+VDP, on different classes of queries. Each query is composed of a CONSTRUCT SPARQL query and a corresponding “equivalent” keyword query. These classes differ both in terms of syntax (the SPARQL pattern used to write the query) and in their semantics (the information need expressed by the query). We analyze how the syntax and the semantic impact the performances of the system at hand. The paper is organized as follows: Section 2 summarizes the related works; Section 3 briefly describes our methods, the datasets, the evaluation metrics and the classes of queries we defined; in Section 4 we present the experiments; finally, Section 5 presents the conclusions. 2 Related Work In recent years keyword search received a lot of attention in the research commu- nity, both in the contest of Relational Databases (RDB) and Graph DB. Good reviews about this topic are [2], [18] (for a focus on RDB) and [16] (for a focus on graph data). Keyword Search systems may be categorized into three main classes: (i) schema-based (e.g. DISCOVER [1]), which uses the schema of the database to produce a structured query; (ii) graph-based (e.g. BANKS [3]), which explores the graph (the graph induced by the foreign keys, if we are in a relational database); (iii) document-based (first described in [10]), stemming from the graph-based, this class is based on the creation of textual documents, also called virtual doc- uments, from the words contained in the graphs. It also uses state-of-the-art IR methods for the indexing and ranking of the answers. Thanks to this strategy, the systems belonging to this class showed to scale to real-world scenarios. In this paper we compare our system TSA+VDP [6] to the following state- of-the-art keyword search systems: 1. Structured Language Model (SLM) [7] is among the few native RDF search systems. It starts the searching process from the triples that match at least one of the query keywords and then produces answers by seeking connected components in the graph. SLM is based on virtual documents and employs IR models (statistical language models) for retrieval. SLM returns a ranking of RDF subgraphs to the user and not virtual text documents as in [8, 15]. A Document-based RDF Keyword Search System: Query-by-Query Analysis 2. Markov Random Fields-Keyword Search (MRF-KS)[11] is based on a com- bination of graph and virtual-document-based approaches. This system has proven to be the most effective one on the shared benchmark in [4]. MRF-KS was originally thought to work on RDBs, even though in principle it can be used for any kind of graph data. We redesigned it to work on RDF and use it as a competing baseline. 3. SUMM [9] is a native RDF search system. The core idea behind this sys- tem is to produce a first partition of the whole graph through a BFS-based greedy algorithm. The partitions are edge-disjointed but are connected one to the other through nodes called portal nodes. Once given the keyword query, a backtracking algorithm similar to BANKS [3] is deployed to find connected subgraphs containing all the keywords. On these subgraphs, a different backtracking algorithm similar to BLINKS [8] is used to produce tree-shaped answer graphs whose leaves are nodes containing one keyword each. The systems use indexes and particular stopping conditions to speed- up the execution. 2.1 TSA+BM25 and TSA+VDP To overcome the limitations of the previous state-of-the-art systems, in [6] we proposed two new keyword search systems, based on the virtual document ap- proach, one the refinement of the other. These systems are composed of an off-line and an on-line phase: Off-line phase The off-line phase is common to both systems and is called Topological Semantical Aggregator (TSA), a virtual document-based ap- proach that produces documents by aggregating RDF triples containing close concepts. TSA mimics the “clustering hypothesis” defined in IR which dictates that triples characterized by similar concepts will cluster together. The algorithm takes an RDF graph as input and creates subgraphs with- out considering the user query. The idea is to build a subgraph around a single “topic” which characterizes the graph semantics. To do so, the algorithm starts randomly from a node that presents an out-degree higher of an empirically- defined threshold. From that node, the algorithm starts a Breadth-First Search- like exploration that visits only nodes that present in and out-degree between certain thresholds, limiting the exploration inside a radius τ . All these parame- ters are also empirically-defined. In this way, a subgraph is extracted from the bigger database. Once the exploration stops, the algorithm marks the utilized nodes as visited and proceeds randomly to the next unvisited node with high out-degree. Nodes may be visited more than once if they present a high in-degree, and in this case, they are shared by many graphs. The rationale is that these nodes may contain information belonging to more topics. The result of TSA is a collection of subgraphs covering the whole database. They are then converted to their corresponding virtual documents through the extraction and concatenation of the words contained in the IRIs and Literals in their nodes and edges. We then index these documents with Terrier. D. Dosso and G. Silvello On-line phase Given a keyword query, our first system, TSA+BM25, applies the ranking function BM25 to the collection of virtual documents, and obtains a ranking of graphs. This system then looks for overlapping graphs in the rank- ing produced. Graphs that present a percentage of shared triples beyond an empirically-found threshold are merged. BM25 is used a second time to produce a new ranking on these merged graphs, which is returned to the user. This sys- tem is rather quick, since it uses state-of-the-art implementations of BM25, but does not leverage the information contained in the topology of the graphs. Our second system is called TSA+VDP (VDP stands for Virtual Docu- ment Pruning). The system then considers the result of the set-based union of these ranking produced by TSA+BM25, which is called query graph E. This graph is not necessarily connected. TSA+VDP then starts a BFS from every subject node within E. This BFS is also limited to a radius τ 0 , and produces subgraphs. Of these subgraphs, only the ones containing all the keywords are kept, producing a new collection, called the best candidates collection. The rationale is that these selected graphs will contain, with high probability, triples that are relevant to the user. Since the BFS exploration could also have introduced many non-relevant triples, the next step is a pruning algorithm over these candidate graphs. The pruning proceeds inward, removing all the triples that do not contain at least one keyword, starting from the nodes with the highest distance from the source node. These pruned graphs are finally ranked using a Markov Random Field scoring function inspired by the one used by MRF-KS. The final ranking is the answer returned to the user. 3 Methods 3.1 Preliminaries While SPARQL enables the user to obtain an exact answer to his/her informa- tion need, it is not always the most viable approach [10]. SPARQL is a structured language with a complex syntax, which also requires complete knowledge of the graph being queried. A knowledge that the standard user (one who is not familiar with the concepts of RDF graphs) may not possess. Keyword Search is a different paradigm that can help this kind of user to retrieve her information without SPARQL. In this approach, queries are an array < w1 , · · · , wn > of keywords. The output is a list of answers, ordered depending on their relevance to the information need expressed by the query. This ordering is defined by a ranking function. It is important to note that while SPARQL returns an exact answer (even an empty one is that is the case), Keyword Search is a best effort approach. The nature of the answers provided by the keyword search system depends on the task. In this paper, we consider a user who has an information need that can be formulated with a SPARQL CONSTRUCT query, which he/she is not able to formulate. If that query existed, the answer graph can be considered as the A Document-based RDF Keyword Search System: Query-by-Query Analysis set of all and only the triples of the database that the user wants to see. We consider this set of triples as a ground truth (GT). The triples of the GT are called relevant triples. Our task is to produce a ranking that contains as many relevant triples as possible in answer graphs ranked as highly as possible. These graphs should also not contain too many non-relevant (noisy) triples. A good system, therefore, produces a list of (preferably not-overlapping) subgraphs extracted from the database such that the subgraphs at the top of the list contain many relevant triples. 3.2 Evaluation If a Keyword Search System returns a list of answers, we first of all need to understand when one of these graphs is relevant. We follow the same approach of [6], where a signal-to-noise ratio is defined for each answer graph. Definition 1. Given a topic tk ∈ T , a ranking Rk and the ground-truth graph GTtk , we define the Signal-to-Noise Ratio (SNR) of Gi ∈ Rk as |(Gi ∩ GTtk ) \ S| SN R(Gi ) = (1) |Gi | where S is the union set of all the relevant triples in Gj ∈ Rk , ∀j ∈ [1, i[ such that Gj is relevant. Given this definition, we say that a graph is relevant when its SN R is bigger or equal to a threshold value λ ∈ [0, 1]. Note how the SNR keeps into consid- eration the relevant triples already seen in the ranking, avoiding to count them twice. We can say that a quality answer is a graph that has a high SNR. We call λ the relevance parameter, which describes the quality required for a graph to be considered relevant. In this evaluation framework, the relevance is therefore not an absolute, fixed concept, but can be varied with λ. This pa- rameter models the preferences of a user. A user who does not want noise in its answers is modeled with a λ close to 1. A user that allows for imperfect answers is modeled with a parameter closer to 0. In what follows we study how different values of λ change the evaluation of the output of the considered systems. As evaluation measures we use recall and tb-DCG. Recall is the total number of relevant triples found in the ranking produced by a system – typically considering the first 1000 results – divided by |GT |. The recall enables to understand the extractive power of a system, the raw quantity of relevant triples extrapolated from the database. While useful, it does not convey any information about the disposition of the triples in the ranking and little information on the quantity of noise that surrounds them in the subgraphs where they are contained. tb-DCG [5] is a more sophisticated metric, a derivation of the classic NDCG, which takes into consideration aspects such as the relevance of the graphs (the number of triples of the GT contained in them), their position (relevant graphs D. Dosso and G. Silvello put early weight more) and their uniqueness (repeated relevant triples are not counted). 3.3 Experimental Setup We consider two databases: LinkedMDB and IMDB. LinkedMDB is a native RDF database of circa 7M triples. IMDB is a relational database that we con- verted into an RDF version of about 116M triples. For both databases, we created 100 topics, divided into 5 classes of 20 queries each. 50 were used for training the parameters of TSA+VDP, the other 50 for testing. Every topic is represented as a pair made of a CONSTRUCT SPARQL query and a keyword counterpart, both representing the same information need. The SPARQL query is used to extract the subgraph which works as GT for the keyword query. C1 C2 C3 C4 C5 LinkedMDB a a b b c c a a b b c c d d e e a b c IMDB d d d e e e Fig. 1: The five query syntactical classes defined for LinkedMDB and IMDB. Every class is syntactically different from the others, i.e. they differ in the shape of the pattern P , as can be seen in Figure 1. The figure presents the 10 patterns associated with each class ordered by increasing complexity. This complexity derives from the number of triples in the pattern and the shape of the connections created by these triples. Note that it is possible that two classes may differ only in the direction of one edge. Each one of these classes is further divided into two sub-classes, differing by their semantic. For example, class C1 of LinkedMDB is divided into C1.1, with queries asking for all the films of a certain director, and class C1.2, made of queries asking for all the films starring a certain actor. The databases considered here, LinkedMDB 1M and IMDB 1M (of 1M triples each), are built by randomly extracting connected subsets of triples from the original datasets. The triples of the ground truths produced by the SPARQL are added to these datasets. This was necessary since only SUMM, TSA+BM25 and TSA+VDP are able to scale to the whole dimensions and we needed to be able to compare all systems. A Document-based RDF Keyword Search System: Query-by-Query Analysis Table 1: Performances of the algorithms on the 1M databases. † indicates the top performing systems returned by the Tukey HSD statistical test, with α < 0.01. The best system is in bold. λ = 0.1 Dataset Systems tb-DCG recall time (sec) TSA+BM25 0.201±0.02 0.733±0.02† 39.64±01.90† TSA+VDP 0.490±0.04† 0.592±0.040† 318.78±21.60 LinkedMDB 1M SLM 0.011±0.00 0.076±0.01 39.90±08.69† MRF-KS 0.400±0.03† 0.462±0.03 285.22±30.10 SUMM 0.106±0.01 0.268±0.02 429.52±37.17 TSA+BM25 0.276±0.02 0.423±0.03 30.28±00.25† TSA+VDP 0.583±0.04† 0.623±0.04† 196.90±20.69 IMDB 1M SLM 0.011±0.00 0.057±0.00 15.92±06.57† MRF-KS 0.312±0.02 0.382±0.02 399.68±37.34 SUMM 0.126±0.01 0.254±0.02 510.26±27.02 4 Experimental Performances 4.1 Average Analysis Table 1 reports the average performances obtained by the five systems on the two databases, already presented in [6]. Considering tb-DCG, TSA+VDP is the best performing system, immediately followed by MRF-KS. The former also obtains good values of recall, only exceeded by TSA+BM25 on LinkedMDB 1M. The two fastest algorithms are SLM and TSA+BM25. The former is very quick to produce its answers but has low effectiveness. The latter is fast thanks to its reliance on state-of-the-art implementations of BM25. On this note, TSA+VDP presents higher execution time due to its heuristics which prune and re-rank the graphs, obtaining more relevant graphs that are better ranked. Analysis of the quality of the answers through the λ factor Note that in Table 1 we fixed λ to 0.1. This parameter was empirically chosen to show the differences in the effectiveness of the different systems. A higher λ could provoke all the values to be flattened to 0. One may ask now what happens if we change λ. A higher λ describes a user who tolerates less noise in her answer graphs. It asks for answers that are more “tailored” around the GT. Figure 2 presents the values of recall and tb-DCG of the five systems on the two datasets as λ varies. On the x-axis we have different values for λ, from 0 to 1, with step 0.1. Every point in the plots is obtained computing the averages on the 50 queries issued on each dataset. Let us start on LinkedMDB 1M. TSA+BM25 appears to be consistently the top-performing system in terms of recall at the varying of λ. Moreover, when λ = 0, that is when we do not count the noise in the answers, the values of average recall is around 0.9, illustrating how, on average, many relevant triples are re- trieved by the system in the top 1000 positions. Immediately below TSA+BM25, D. Dosso and G. Silvello LinkedMDB 1M IMDB 1M 1 1 SLM SLM SUMM SUMM TSA+BM25 TSA+BM25 0.8 TSA+VDP 0.8 TSA+VDP MRF-KS MRF-KS 0.6 0.6 Precision Precision Recall Recall 0.4 0.4 0.2 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 LinkedMDB 1M IMDB 1M 1 1 SLM SLM SUMM SUMM TSA+BM25 TSA+BM25 0.8 TSA+VDP 0.8 TSA+VDP MRF-KS MRF-KS 0.6 0.6 tb-DCG tb-DCG 0.4 0.4 0.2 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 2: Plot to confront average performances of recall and tb-DCG on the real databases LinkedMDB 1M and IMDB 1M for the different systems at the varying of λ. we find TSA+VDP. The heuristics used by TSA+VDP eliminate some relevant triples from the ranking, thus reducing the overall recall. For tb-DCG, it is evident how TSA+VDP sacrifices recall to obtain a better tb-DCG. In particular, TSA+VDP is consistently the top-performing system until we reach λ = 0.5. After this point, all the systems collapse on low values, due to the presence of few high-quality answers. For IMDB 1M and the recall value, the top-performing system is now TSA+VDP, with TSA+BM25 immediately behind. This means that in the case of IMDB, BM25 is less effective in extracting relevant triples from the whole graph. The pruning heuristics of TSA+VDP improve the quality of some answers, making them relevant and therefore counted in the computation of the average recall. We also note that the trend of the SUMM method is more stable than the other systems. In fact, the system manages to get high-quality answers that satisfy many λ values. Only after a value of λ = 0.5 there is a performance decline. Regarding tb-DCG, TSA+VDP remains the top system, with a sizeable dif- ference from the second-placed TSA+BM25 and MRF-KS, which appear to be quite similar in performances while λ increases. SUMM maintains its stability but is still below the other systems. This analysis clearly shows how bigger values of λ make the task harder. In comparison, however, the TSA-based systems are able to obtain consistently better results with respect to the state of the art systems. A Document-based RDF Keyword Search System: Query-by-Query Analysis LinkedMDB 1M TSA+VDP LinkedMDB LinkedMDB 1M TSA+VDP LinkedMDB 1M 1M MRFKSMRF-KS 1 1000 1 1000 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 0.9 900 0.9 900 0.8 800 0.8 800 0.7 700 0.7 700 Recall, λ = 0.1 Recall, λ = 0.1 Seconds Seconds 0.6 600 0.6 600 = 0.1 = 0.1 seconds seconds 0.5 500 0.5 500 Recall Recall 0.4 400 0.4 400 0.3 300 0.3 300 0.2 200 0.2 200 0.1 100 0.1 100 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 queries queries IMDB 1M TSA+VDP IMDBIMDB 1M MRF-KS IMDB 1MTSA+VDP 1MMRFKS 1 1000 1 1000 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 0.9 900 0.9 900 0.8 800 0.8 800 0.7 700 0.7 700 Recall, λ = 0.1 Recall, λ = 0.1 0.6 600 0.6 600 = 0.1 = 0.1 Seconds Seconds seconds seconds 0.5 500 0.5 500 Precision Precision 0.4 400 0.4 400 0.3 300 0.3 300 0.2 200 0.2 200 0.1 100 0.1 100 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 queries queries Fig. 3: Plots to compare performances of time and recall on LinkedMDB 1M and IMDB 1M for the two systems TSA+VDP and MRF-KS. The horizontal lines represent the averages for recall (blue) and time (red). A time limit of 1000s was considered for the execution of a query. 4.2 Query-by-query analysis Recall Figure 3 shows the behavior query-by-query in terms of recall and time for the two datasets and for the two top-performing systems TSA+VDP and MRF-KS with λ = 0.1. We grouped the queries by their classes and subclasses and highlighted them with different colors. Starting from LinkedMDB 1M, the queries of certain classes such as C1.1 obtain good results in both systems. Others, like class C5, appear hard for both systems. What is evident is that a simpler P does not necessarily imply high perfor- mances. In fact, a class such as C1.2 presents low results in both systems, while C4, which has a more complex P , has higher recall with TSA+VDP. It seems that the semantic of the queries plays a critical role in determining their diffi- culty. Subclasses such as C1.1 and C1.2, for example, show different behaviors, even if they share the same pattern structure. For IMDB 1M, we can make the same observations. It is even more evident in this instance how a simpler syntax does not correspond to higher recall. For TSA+VDP class C5 almost always presents a recall of 1, while many queries of class C1 present values close to 0. This may be explained considering the fact that a simpler pattern corresponds to a ground truth with fewer relevant triples, harder to extract from the database in an effective way. D. Dosso and G. Silvello LinkedMDB 1M TSA+VDP LinkedMDB 1M MRFKS 1 1000 1 1000 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 0.9 900 0.9 900 0.8 800 0.8 800 0.7 700 = 0.1 0.7 700 = 0.1 seconds 0.6 600 seconds 0.6 600 0.5 500 0.5 500 tb-DCG tb-DCG 0.4 400 0.4 400 0.3 300 0.3 300 0.2 200 0.2 200 0.1 100 0.1 100 0 0 0 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 queries queries IMDB 1M TSA+VDP IMDB 1M MRFKS 1 1000 1 1000 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 c 1.1 c 1.2 c 2.1 c 2.2 c 3.1 c 3.2 c 4.1 c 4.2 c 5.1 c 5.2 0.9 900 0.9 900 0.8 800 0.8 800 0.7 700 0.7 700 = 0.1 = 0.1 seconds seconds 0.6 600 0.6 600 0.5 500 0.5 500 tb-DCG tb-DCG 0.4 400 0.4 400 0.3 300 0.3 300 0.2 200 0.2 200 0.1 100 0.1 100 0 0 0 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 queries queries Fig. 4: Plots to compare performances of time and rb-DCG on LinkedMDB 1M and IMDB 1M for the two systems TSA+VDP and MRF-KS. The horizontal blue line represents the average precision, the red one the time. tb-DCG Figure 4 presents similar graphics but for the tb-DCG metric. The high variability in the behavior of the queries is immediately evident from the different performances obtained by the systems. Let us start with LinkedMDB 1M. Certain classes, such as C4, appear to be relatively easy for TSA+VDP. On the other hand, they are harder for MRF-KS. If we consider that C1.2, instead, is hard for both systems, we see how it is not necessarily true that a simpler syntax (i.e. a simpler pattern P ) corresponds to an easier query to answer. On the other hand, a class such as C2 presents subclass C2.1 which appears hard for both systems, while C2.2 is easier. This suggests again how the same syntactic class can become harder or easier depending on its semantic nature, independently from the system. For IMDB 1M, we see how similar observations can be made. Class C5 appears simpler for TSA+VDP and harder for MRF-KS. More in general, it appears that MRF-KS struggles more in IMDB to answer the queries: its average tb-DCG is much lower than the one of TSA+VDP. This shows how it may be possible that the nature of a database (the number and structure of connections among nodes and the quantity and disposition of keywords among them) may influence the effectiveness of a system. A Document-based RDF Keyword Search System: Query-by-Query Analysis Time The execution times are presented in both Figure 3 and 4. Note how we used a time limit of 1000s. Once again, queries in the same class may present different behaviors in terms of execution time. For example, MRF-KS presents relatively low values of time for subclass C5.1 in IMDB, while subclass C5.2 presents much higher times on average. This new evidence of the fact that the syntactical factor is not the only one that plays a role in the determination of the execution time. The choice of keywords (more or less frequent inside the database) appears to also play a role. Queries with more frequent keywords will, in fact, demand much more execution time. 5 Conclusions and Future Works In this paper, we presented an in-depth analysis of the behavior of state-of-the- art systems with a particular focus on the quality of required answers and the behavior presented with different kinds of queries. We used a sub-set of LinkedMDB and IMDB as databases. We defined five different classes of queries for each dataset, which differ from one another on their syntax and semantic, for a total of 100 test queries. We used recall and tb-DCG as evaluation metrics. The first one describes the simple raw power of extraction of relevant triples of a system. The latter is a more sophisticated measure that takes into consideration the relevance of the answers, their position in the ranking and the quantity of noise in them. We considered five different state-of-the-art systems: SLM, MRF-KS, SUMM, TSA+BM25, and TSA+VDP. We performed two different classes of experiments. In the first one, we var- ied the minimal quantity of SNR required from an answer to be considered as relevant. We showed that the higher the required threshold, the lower the per- formances of the systems. TSA+BM25 and TSA+VDP consistently appeared to obtain the higher results, although we registered a progressive detriment of the performances. The second class of experiments showed that certain classes of queries seem to be simpler than others for both systems. It resulted that not necessarily a simpler pattern P implies better results. The complexity of the query is therefore not present only in its syntax, but also in its semantic and in the keywords of choice. To corroborate this intuition, it appeared that queries belonging to the same syntactic class with different semantics presented completely different performances. For our future works, we will keep researching, also on different databases, how different kinds of queries correspond to different behaviors, and how poor results may be mitigated with techniques as query rewriting. References 1. Balmin, A., Hristidis, V., Koudas, N., Papakonstantinou, Y., Srivastava, D., Wang, T.: A system for keyword proximity search on XML databases. In: Proceedings of D. Dosso and G. Silvello 29th International Conference on Very Large Data Bases, VLDB. pp. 1069–1072. Morgan Kaufmann (2003) 2. Bast, H., Buchhold, B., Haussmann, H.: Semantic search on text and knowledge bases. Foundations and Trends in Information Retrieval 10(2-3), 119–271 (2016) 3. Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: Proceedings of the 18th International Conference on Data Engineering. pp. 431–440. IEEE Computer So- ciety (2002) 4. Coffman, J., Weaver, A.C.: A framework for evaluating database keyword search strategies. In: Proc. of the 19th ACM International Conference on Information and knowledge management. pp. 729–738. ACM Press (2010) 5. Dosso, D., Silvello, G.: A scalable virtual document-based keyword search system for rdf datasets. In: Proceedings of the 42nd International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR (2019) 6. Dosso, D., Silvello, G.: Search text to retrieve graphs: A scalable RDF keyword- based search system. IEEE Access 8, 14089–14111 (2020) 7. Elbassuoni, S., Blanco, R.: Keyword Search over RDF Graphs. In: Proc. of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011,. pp. 237–242. ACM Press, New York, USA (2011) 8. He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 305–316. ACM (2007) 9. Le, W., Li, F., Kementsietsidis, A., Duan, S.: Scalable keyword search on large RDF data. IEEE Trans. Knowl. Data Eng. (11), 2774–2788. https://doi.org/10.1109/TKDE.2014.2302294 10. Lopez-Veyna, J.I., Sosa-Sosa, V.J., López-Arévalo, I.: A virtual document approach for keyword search in databases. In: DATA. pp. 39–48. SciTePress (2012) 11. Mass, Y., Sagiv, Y.: Virtual Documents and Answer Priors in Keyword Search over Data Graphs. In: Proc. of the Workshops of the EDBT/ICDT 2016 Joint Conference. CEUR Workshop Proceedings, vol. 1558. CEUR-WS.org (2016) 12. Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Proc. of the 19th International Conference on World Wide Web, WWW 2010. pp. 771–780. ACM Press, New York, USA (2010) 13. Prud, E., Seaborne, A., et al.: Sparql query language for rdf (2006) 14. Sahu, S., Mhedhbi, A., Salihoglu, S., Lin, J., Özsu, M.T.: The ubiquity of large graphs and surprising challenges of graph processing. PVLDB 11(4), 420–431 (2017) 15. Su, Q., Widom, J.: Indexing Relational Database Content Offline for Efficient Keyword-Based Search. In: Proc. of the 9th International Database Engineering and Applications Symposium (IDEAS 2005). pp. 297–306. IEEE Computer Society (2005) 16. Wang, H., Aggarwal, C.C.: A survey of algorithms for keyword search on graph data. In: Managing and Mining Graph Data, pp. 249–273. Springer (2010) 17. Wu, W.: Proactive Natural Language Search Engine: Tapping into Structured Data on the Web. In: Proc. of the Joint 2013 EDBT/ICDT Conferences. pp. 143–148. ACM Press, New York, USA (2013) 18. Yu, J.X., Qin, L., Chang, L.: Keyword Search in Relational Databases: A Survey. IEEE Data Eng. Bull. 33(1), 67–78 (2010)