Virtual Documents and Answer Priors in Keyword Search over Data Graphs∗ Yosi Mass†‡ and Yehoshua Sagiv‡ † IBM Haifa Research Lab, Haifa 31905, Israel yosimass@il.ibm.com ‡ The Hebrew University, Jerusalem 91904, Israel sagiv@cs.huji.ac.il ABSTRACT in ordinary keyword search, where a result is always a single In keyword search over data graphs, an answer is a non- unit (e.g., document). redundant subtree that contains the keywords of the query. The nodes of a data graph represent entities and relation- Ranking of answers should take into account both their tex- ships, while the connections among them are introduced as tual relevance and the significance of their semantic struc- edges. Free text can be associated with nodes and edges. In ture. A novel method for answers priors is developed and keyword search over data graphs, answer are non-redundant used in conjunction with query-dependent features. Since subtrees that contain all the keywords of the query. the space of all possible answers is huge, efficiency is also a Keyword search over data graphs has been investigated major problem. A new algorithm that drastically cuts down extensively in recent years (cf. [5]). It involves two main the search space is presented. It generates candidate an- challenges: effectiveness and efficiency. The first one means swers by first selecting top-n roots and top-n nodes for each that we have to develop effective methods for ranking of query keyword. The selection is by means of a novel con- answers (i.e., subtrees) that take into account the structure cept of virtual documents with weighted term frequencies. (e.g., the importance of entities and the strength of relation- Markov random field models are used for ranking the virtual ships among them) as well as the relevance of the keywords documents and then the generated answers. The proposed of the query. The second challenge is to generate candidate approach outperforms existing systems on a standard eval- (i.e., potentially relevant) answers efficiently. This is not uation framework. an easy problem, because there could be a huge number of subtrees that contain all the keywords of the query. A common approach begins by assigning weights to the Keywords nodes and edges of the data graph. Then, the search for Data graph, keyword search, query, ranking, answer prior, answers starts from nodes containing keywords of the query. virtual documents The weights are intended to focus the search on paths that lead to roots of the most relevant answers. However, those weights are assigned based on local considerations (e.g., the 1. INTRODUCTION text of a node) and, hence, their effectiveness is limited for Data graphs are a convenient, flexible way of representing the following reason. Nodes belonging to relevant answers knowledge bases. They can be constructed from a variety are those that have, in their vicinity, other nodes with key- of formats (e.g., XML, RDB and RDF). Their semistruc- words of the query. To solve this problem, we introduce vir- tured nature makes it possible to create them incrementally, tual documents (VDs) that contain a node and its vicinity. distributively and heterogeneously. Since they do not have We rank VDs by applying (similarly to [1,20]) a Markov ran- a rigid schema, it is essential to support keyword search dom field (MRF) model that combines query-dependent and over them (rather than querying in some formal language, independent features. In particular, we adapt positional lan- such as XQuery or SPARQL). Yet, it could be possible to guage models [18] by using distances based on static weights get succinct answers that have some semantic structure. In that reflect the structure of the data graph. Also, we use particular, their answers can show semantic connections be- node priors as a query-independent feature. tween distinct units (e.g., Web pages), which is impossible We apply the ranking of VDs to the process of selecting the top-n keyword nodes (i.e., nodes containing keywords of ∗This work was supported by the Israel Science Foundation the query) and the top-n roots. Only after selecting both (Grant No. 1632/12). keyword nodes and roots, do we construct paths that con- nect them, thereby yielding answers. By first selecting roots and not just keyword nodes, we realize a higher degree of both effectiveness and efficiency. It should be noted that several papers [10, 21] use notions of virtual documents that are also returned as answers. In contrast, we apply VDs just c 2016, Copyright is with the authors. Published in the Workshop Pro- as a first step in the construction of answers. ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- For the final ranking of the generated answers, we use deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this once again an MRF model for combining query-dependent paper is permitted under the terms of the Creative Commons license CC- and independent features. Unlike previous work, our query- by-nc-nd 4.0 independent feature is not ad hoc, but based on a rigorous the query and to the structure of answers. In [3] they use computation of answer priors. pseudo-feedback to apply relevance models to tuples and We have evaluated our system on the framework of [5] that use them for ranking of answers. The works of [12, 17, 19] consists of three datasets: IMDB, Wikipedia and Mondial, resemble our work in that they concatenate the text in the as well as fifty queries for each one. On all three datasets, nodes of answers and use IR methods combined with query- our method outperforms the state-of-the-art systems. Our independent features for ranking. We also apply IR features experiment include automatic learning of the parameters of on the concatenated text, but different from those works, the MRF models. we present a probabilistic feature that considers the prior of We also show that our approach offers an excellent way of an answer, and we use a well established theory of MRF for improving the efficiency with just a minor impact on the ef- combining it with the IR features. fectiveness (i.e., quality of answers). It is done by decreasing In [16] they builds priors to paths in graphs for the ap- n (i.e., the number of selected roots and keyword nodes). plication of recommendation of conferences, papers to cite, In summary, our main contributions are the following. or experts for a new paper that one writes. In comparison, we assign different priors to each answer (i.e., subtree) while 1. By developing virtual documents and using them for they assign priors for paths, and their priors are fixed for all first selecting the top-n keyword nodes and also the paths of the same type. top-n roots, we have an algorithm for generating an- In [15], they learn per-term weights for each field; in our swers that is both highly efficient and effective. case, it gave inferior results. So, we use a term-independent 2. We formally compute answer priors and use them as weight for each field as in [11, 14]. a query-independent feature in the final ranking. Our experiments show that they make a highly significant contribution to the effectiveness of our system. In the area of keyword search over data graphs, this is the first time that query-independent features are based on a rigorous formalism, rather than ad hoc intuition. 3. We describe an efficient implementation using inverted indexes. We have done extensive experiments showing that our system is highly effective and efficient. The rest of the paper is organized as follows. In Section 2, we discuss related work. Section 3 reviews basic concepts. Section 4 develops the virtual documents and their features. Section 5 presents the algorithm that uses the selected roots and keyword nodes for generating answers. Section 6 devel- ops the features of answers. Section 7 describes the imple- Figure 1: Tiny snippet of the IMDB data graph mentation and the experiments that verify the effectiveness and efficiency of our approach. We conclude in Section 8. 3. PRELIMINARIES 2. RELATED WORK We discuss systems for keyword search over data graphs 3.1 The Data Model along the dimensions of efficiency (i.e., how to generate an- Data graphs can be created from any format (XML, RDF, swers) and effectiveness (i.e., how to rank answers). The RDB, etc.). Figure 1 shows a tiny portion of the IMDB data common approach to generate answers efficiently [2, 12, 19] graph. In this paper, we experiment with data graphs that uses backward iterators that start from keyword nodes until are obtained from relational databases as follows. Each tuple they meet at a root node. In [13] they improved [2] by adding t becomes a node vt that has the relation name as its type. forward iterators that can go back from the detected roots, A tuple t can be either an entity or a relationship. In the to keyword nodes. Still, they have to start from keyword former case, vt is an entity node and is shown as a rectangle nodes. In [8], they present a method for keyword search in Figure 1. In the latter case, vt is a relationship node over RDF graphs that starts with triples that match each and is shown as a diamond. The attribute-value pairs of vt keyword. Then they produce answers by joining the triples are those of the tuple t, excluding foreign keys. That is, a through their subjects and objects. In our work we use vir- foreign key is represented in the data graph by a directed tual documents to start simultaneously from roots and key- edge (shown in Figure 1 as a solid arrow). An opposite edge word nodes, thus we further reduce the search space. (shown as a dashed arrow) is added in the reverse direction Other papers have also used virtual documents in key- in order not to miss relevant answers. word search over data graphs. In [10, 21], they create a In Figure 1, the diamond of a relationship node shows virtual document from a node and its vicinity, and search the type (e.g., cast). The top line of a rectangle shows it for the keywords. In [4], they do entity search over RDF the entity’s name (e.g., goldfinger) followed by its type data and their virtual documents contain a node, but only (e.g., movie). The other attribute-value pairs are listed be- with its literal neighbors. All of these systems return virtual low the top line. documents or their roots as answers, whereas we use them The value of an attribute could be free text. For the name just as a first step to generate answers (i.e., subtrees). of an entity node, we choose the value of an appropriate at- For an effective ranking of answers (cf. [7]), systems use tribute, such as title, etc. We shall refer to that attribute features that consider both relevance to the keywords of generically as name. The value of name is a short string that serves as a not-necessarily-unique identifier; for example, it parameters λT , λT̂ , λU , λÛ and λL are nonnegative and could be the title of a movie or the name of a person. Rela- their sum is 1. We learn them automatically (see Section 7). tionship nodes typically do not have the attribute name. We actually use two variants of Equation (1). The algo- rithm for generating answers (Section 5) starts by selecting 3.2 The Content and Title Fields the top-n roots and keyword nodes using the potential func- In [9,11,14], they showed that combining fields yields bet- tions fTn , fT̂n , fUn , fÛn and fLn that are defined in Section 4.4. ter results when searching XML, the Web or flat documents. After the algorithm generates n answers, they are re-ranked Similarly, we group the attributes of a node into two seman- using the potential functions fTa , fT̂a , fUa , fÛa and fLa that are tic fields (that could overlap). The content field consists of defined in Section 6. the names and values of all the attributes. The title field comprises only the value of the attribute name. In Figure 1, 4. VIRTUAL DOCUMENTS FOR RANKING for example, the title field of node 2 is: goldfinger, and the content field is: title goldfinger type movie release- 4.1 Virtual Documents date 1964 genre action adventure producedin uk plot We consider a data graph G = (V, E), where V and E bond is back and his next mission ... are the sets of nodes and edges, respectively. In [19], each The content and title fields (of a node) contain textual node v ∈ V is deemed a document. We take a different information that we use to assign IR scores. We distinguish approach and view a small vicinity of a node (including the between these two fields in order to control the relative im- node itself) as a virtual document. Intuitive motivation is portance of an occurrence (of a query keyword) in the title that often keywords of a query are spread over several nodes compared with an occurrence only in the content field. that are close to one another. Formally, the virtual document (abbr. VD) of a node v, 3.3 Queries and Answers denoted by v ? , consists of all nodes u, such that the following A query Q = (q1 , ..., qm ) on a data graph is a set of key- holds. There is a path p in the data graph G from v to some words. Each qi should match a term in the content of some entity node x (where x could be u), such that p includes node(s); that is, we use the AND semantics as usually done u and has at most τ entity nodes, excluding v itself. The in keyword search over data graphs (supporting the OR se- parameter τ is called the diameter of the VD. As a running mantics is left for future work). example, we use Figure 1 with τ = 1. The VD of node 6 Answers are subtrees of the data graph, rather than sub- consists of nodes 2, 3, 4 and 6 (node 3 is not counted in τ , graphs, because a tree is easier to understand quickly and because it is a relationship). The VD of node 2 comprises is typically an indivisible unit of information. Formally, an nodes 1, 2, 3, 4, 5, and 6. Note that v ? is defined as a set answer to Q is a non-redundant subtree a of the data graph, of nodes. However, v ? can also be viewed as the subgraph such that a contains all the keywords of Q. Containment of G induced by its nodes (i.e., the subgraph comprising the means that each keyword appears in some node(s). Non- nodes of v ? and the edges between them). redundancy requires an answer not to have a proper subtree A node consists of two fields: content and title. We denote that also contains all the keywords of the query. by vf the text in the field f of node v. In particular, vcnt As an example, consider the data graph of Figure 1 and and vttl are ordinary documents consisting of the text in the query “sean connery fleming” for finding movies that are the content and title fields, respectively, of v. Similarly, vf? related to those names. A possible answer comprises five denotes the field f of the VD v ? , that is, the concatenation of nodes: goldfinger (of type movie), ian fleming and james the text in field f of all the nodes comprising v ? . (However, bond (both of type person), and the two connecting nodes the weighted term frequencies defined later are applied to of type cast. Note that non-redundancy does not imply v ? .) Recall that V is the set of nodes of the data graph. We minimality, and a query could have numerous answers. use Vf to denote the collection that comprises all the vf . 3.4 Markov Random Fields 4.2 Static Weights Markov random field (MRF) models make it possible to In the VD v ? of a node v, occurrences of terms closer to v combine query-dependent and independent features. We are more important. Distances among nodes are determined apply the sequential dependency model of [1, 20] and use by minimal-weight paths. We assign static weights (which unigrams and unordered bigrams as query-dependent fea- are query independent) to nodes and edges as follows. tures. Our query-independent feature is the prior of either For an entity node u, the importance is proportional to a node or an answer. The score with respect to a query the number of its neighbors. Thus, the static weight of u, Q = (q1 , ..., qm ) is given by denoted by wn s (u), is defined as 1 wn s (u) = , (2) X  score(Q, x) = λT fT (qi , x) + λT̂ fT̂ (qi , x) ln(e + Deg(u)) qi ∈Q X   where e is Euler’s number and Deg(u) is the degree of u + λU fU (qi , qi+1 , x) + λÛ fÛ (qi , qi+1 , x) (i.e., the number of its edges) in G. Notice that 0 ≤ wn s (u) ≤ {qi ,qi+1 }∈Q 1 and a node with a higher degree has a better (i.e., lower) + λL fL (x), (1) weight. We use logarithm in the denominator so that the weight will not decay too fast as the degree increases, or where x is either a node or an answer, and the potential else there would be a negligible difference between nodes function are: fT and fT̂ for unigrams of the content and with large degrees. For example, node 2 in Figure 1 has two title fields, respectively; similarly, fU and fÛ for unordered neighbors, so its static weight is 0.64; the static weight of bigrams; and fL for the query-independent feature. The node 4 is 0.76, because it has a single neighbor. Next, we consider relationship nodes and edges. Two en- Unigrams. For unigrams, we use two potential functions tity nodes are directly related if they are connected by either fTn (qi , v) and fT̂n (qi , v) for the content and title fields, re- a single edge or a pair of edges that pass through a relation- spectively. These functions consider the fields of the VD v ? ship node. In this paper, we do not consider types of nodes (rather than node v itself). They are defined by and edges when determining static weights. In addition, the fTn (qi , v) = ln (1 − αTn )P (qi |vcnt ? ) + αTn P (qi |Vcnt ) , (6)  degree of a relationship node is usually 2 or 3. Therefore, we apply the rule that the static weight of a direct relationship ? fT̂n (qi , v) = ln (1 − αT̂n )P (qi |vttl ) + αT̂n P (qi |Vttl ) ,  is always 1. In this way, we give some preference to smaller (7) answers (i.e., with fewer nodes). To conform to the above rule, the static weight wn s (u) of a relationship node u is 1. where αTn and αT̂n are smoothing parameters for the content The static weight of an edge e, denoted by we s (e), is 1 if and title fields, respectively, and Vcnt and Vttl are the collec- e connects two entity nodes; otherwise e connects an entity tions comprising the content and title fields, respectively, of with a relationship node and we s (e) = 0. all the nodes of the data graph. We use Dirichlet smoothing A path p from node v to node u is written as pv→u . The for αTn and αT̂n , as described in Section 7.3. static weight of pv→u , denoted by ws (pv→u ), is the sum of We use the maximum likelihood estimate. Hence, in each static weights of all the edges and nodes of p; that is, one of Equations (6) and (7), the first probability in the X X right side is given by ws (pv→u ) = we s (e) + wn s (x), (3) wtf (qi , vx? ) e∈p x∈p P (qi |vx? ) = P ? , (8) t∈v ? wtf (t, vx ) x where the first sum is over all edges e of p and the second— over all nodes x of p. Note that if the path consists of only where we use weighted term frequencies (defined by Equa- node v, its weight is wn s (v). tion (5)) and x is either cnt or ttl . The summation in the denominator is over all unigrams t that appear in vx? and is called the length of vx? . 4.3 Weighted Term Frequencies In each of Equations (6) and (7), the second probability in Next, we define the weighted terms frequencies of uni- the right side (which does the smoothing with the collection) grams and unordered bigrams in a VD v ? . Given a node is given by u of v ? , the relative static weight of u in v ? , denoted by P ws (v ? , u), is the minimum weight over all paths from v to u tf (qi , ux ) P (qi |Vx ) = P u∈V . (9) in v ? ; that is, P u∈V t∈ux tf (t, ux ) ws (v ? , u) = min ws (pv→u ). (4) where x is either cnt or ttl . pv→u is in v ? Unordered bigrams. For an unordered bigram {qi , qi+1 } of Q (similarly to unigrams), we use the following two po- Note that all the nodes and edges of pv→u are in v ? . For tential functions for the content and title fields. example, in Figure 1, the relative static weight of node 4 in the VD of node 2 is ws (2? , 4) = 0.64 + 1 + 0.76 = 2.4. fUn (qi , qi+1 , v) = ln (1 − αU n ? )P ({qi , qi+1 }|vcnt )+ Let t be either a unigram or an unordered bigram. To n  + αU P ({qi , qi+1 }|Vcnt ) , (10) define the frequency of t in v ? , we adapt the method used in positional language models [18]. That is, the weight of t in a node u of v ? is inversely proportional to ws (v ? , u)−ws (v ? , v), fÛn (qi , qi+1 , v) = ln (1 − αÛ n ? )P ({qi , qi+1 }|vttl )+ which is the weighted distance of u from v in the VD v ? . In n + αÛ P ({qi , qi+1 }|Vttl )  (11) particular, a kernel serves as a discounting factor. We use a n n Gaussian kernel, because it was shown to be the best [18]. Here, αU and αÛ are the smoothing parameters for un- Formally, the weighted term frequency of t in field f of v ? , ordered bigrams. Similarly to unigrams, we use Dirichlet denoted by wtf (t, vf? ), is smoothing for these parameters. As earlier, the probabili- ties in Equations (10) and (11) are derived according to the −(ws (v ? ,u)−ws (v ? ,v))2 wtf (t, vf? ) = X e 2σ 2 tf (t, uf ), (5) maximum likelihood estimate. That is, they are given by u∈v ? Equations (8) and (9), respectively, except that we substi- tute {qi , qi+1 } for qi and assume that t denotes unordered where tf (t, uf ) is the ordinary term frequency of t in the field bigrams (rather than unigrams). f of node u and σ is a parameter that controls the spread Query independent. We use one query-independent po- of the kernel. tential function that is given by the node prior. In particular, Note that the sum in Equations (5) is over all nodes u we assume that the probability of a node v is proportional in v ? . Observe that the weight of a single occurrence of t to its degree in the graph. Hence, is at most one, and it is exactly one in v. For example, in Deg(v) Figure 1, the keyword bond appears twice in the VD of node fLn (v) = ln P . (12) 2: once in node 2 itself and once in node 4. If we set σ = 1, u∈V Deg(u) then wtf (bond, 2?cnt ) = 1 + 0.21 = 1.21. Overall, there are five potential functions and, thus, we have to learn five parameters (i.e., λn n n n n T , λT̂ , λU , λÛ and λL ). 4.4 Node Potential Functions In the next section, we use the five functions twice: once Consider a query Q = (q1 , ..., qm ). We now define poten- for selecting roots and a second time for choosing keyword tial functions for unigrams, unordered bigrams and nodes. nodes. The learning is done separately for each one of these In Section 5, we use Equation (1) with these functions. two cases, as described in Section 7.3. 5. GENERATING ANSWERS cursively removing the root r, thereby decreasing its height. In this section, we rank nodes according to score(Q, v) Second, a is not a duplicate of another answer that is al- of Equation (1), where the potential functions are given by ready in the output. Duplicates are removed based on an Equations (6), (7), (10), (11) and (12). undirected semantics (to conform to the evaluation frame- We begin by selecting roots and keyword nodes. The for- work of Section 7). Third, a does not have a relationship mer will be roots of answers. The latter will appear in node with fewer than two adjacent entity nodes. answers as nodes containing keywords of the given query Q = (q1 , . . . , qm ). We do it as follows. First, we consider all 6. ANSWER POTENTIAL FUNCTIONS nodes v of G, such that v ? contains every qi . We rank them We view an answer a as a document by concatenating the according to score(Q, v) and select the top-n. These are the instances of each field over all the nodes of a. Thus, acnt selected roots. Second, we consider the set U of all nodes v, and attl are ordinary documents obtained by concatenating such that v is in r? where r is a selected root. Let Ui be the the content and title fields, respectively, of all the nodes subset of U that comprises all nodes v, such that v contains of a. Similarly to nodes, we define potential functions for the keyword qi of Q. For each keyword qi ∈ Q, we rank the unigrams, unordered bigrams and query-independent answer nodes of Ui according to score(Q, v) and choose the top-n. priors. These functions are used for scoring answers with These are the selected keyword nodes (for qi ). Let S be the respect to Q by means of Equation (1). set consisting of all the selected roots and keyword nodes. Unigrams. Analogously to Equations (6) and (7), we use We will construct answers from minimal-weight paths that the following two potential functions for the content and connect the selected roots and keyword nodes. We want title fields, respectively. the weight of a path from a root r to a keyword node v to reflect also the scores of its endpoints (i.e., score(Q, r) fTa (qi , a) = ln (1 − αTa )P (qi |acnt ) + αTa P (qi |Vcnt )  (15) and score(Q, v)), rather than just the static weights of Sec- tion 4.2. Hence, we convert scores into dynamic weights. fT̂a (qi , a) = ln (1 − αT̂a )P (qi |attl ) + αT̂a P (qi |Vttl )  When converting, we invert the scores, because lower weights (16) are better (whereas it is the opposite for scores). The conver- Recall that Vcnt and Vttl are the collections comprising the sion produces dynamic weights in the interval [0, 1], to make content and title fields, respectively, of all the nodes of the them commensurate with the static weights. Formally, the data graph. dynamic weight of a node v ∈ S, denoted by wn d (v), is Unordered bigrams. Similarly to Equations (10) and (11), max score(Q, u) for an unordered bigram {qi , qi+1 } of Q, we define two po- u∈S wn d (v) = 1 − . (13) tential functions for the content and title fields as follows. score(Q, v) fUa (qi , qi+1 , a) = ln (1 − αU a )P ({qi , qi+1 }|acnt ) + Since score(Q, v) is negative (i.e., obtained by applying log- a  arithm to probabilities), wn d (v) is in the interval [0, 1]. Note + αU P ({qi , qi+1 }|Vcnt ) (17) that wn d (v) = 0 if node v has the highest score in S. The static weight of a path pr→v is given by Equation (3). fÛa (qi , qi+1 , a) = ln (1 − αÛ a )P ({qi , qi+1 }|attl ) + The combined weight of pr→v , denoted by wc (pr→v ), also a  incorporates the dynamic weights of r and v (while omitting + αÛ P ({qi , qi+1 }|Vttl ) (18) their static weights). That is, As in Section 4.4, we use the maximum likelihood estimate wc (pr→v ) = wn d (r) + wn d (v) + for the probabilities in the above equations. Since acnt and X X attl are ordinary documents, we employ the usual term fre- + we s (e) + wn s (x). (14) quencies, rather than the weighted ones. We use Dirichlet e∈p x∈p∧x∈{r,v} / smoothing for αTa , αT̂a , αUa and αÛa (see Section 7.3). Query independent. We use one query-independent po- Let r be a selected root. The set Uri consists of all the tential function fLa that is derived from an answer prior as keyword nodes that were selected for qi and are in r? . Ob- follows. Let ar be an answer. The subscript of ar means serve that every combination of m minimal-weight paths, that node r is the root. To derive the prior P (ar ), we con- such that each one is from r to a keyword node of Uri sider the skeleton sr that is obtained by deleting the content (1 ≤ i ≤ m), yields an answer to Q = (q1 , . . . , qm ).1 For of ar . Namely, sr has the same nodes and edges as ar , but each root r and keyword qi , we generate these paths and without attribute-value pairs. keep them in a separate sorted list. To obtain the probability P (ar ), we assume that ar is To generate answers, a priority queue A stores for each generated in two steps. First, the skeleton sr is generated selected root r, the next best answer (with r as the root) that with probability P (sr ). Second, sr is instantiated to ar with has not yet been added to the output (or discarded if it is not probability P (ar |sr ); this is also done in two steps. First, the valid). Answers are removed from A by increasing height. root of sr is instantiated to a specific node of the data graph Note that the height of a tree is the maximum combined G. Second, the following is repeated. After instantiating a weight over all the paths from the root to some leaf. node v of sr to some node v 0 of G, we select a neighbor of An answer a is valid if it satisfies the following conditions. v 0 for each child of v. First, a is non-redundant, that is, a does not have a proper The prior of an answer ar is P (ar ) = P (ar |sr )P (sr ). For subtree that also contains all the keywords of the query. If a simplicity, we assume that P (sr ) is the same for all skele- is redundant, we convert it to a non-redundant answer by re- tons, so P (ar ) = P (ar |sr ). To compute P (ar |sr ), we make 1 another simplifying assumption, namely, the probability of Formally, such a combination may not create a tree. How- ever, it can be easily modified to form a tree. choosing a specific node u of ar depends only on its parent, denoted by par (u). In particular, the probability of choos- ing the root r of ar depends only on the data graph G. And Table 1: Index sizes (in MB) raw graph node virtual index as in the derivation of Equation (12), it is proportional to data index index τ =1 τ =2 the degree or r. Thus, Mondial 2 22.8 4.3 6.1 137 Deg(r) Wikipedia 220 330 316 299 23K P (r) = P . (19) IMDB 304 1800 519 734 40K v∈V Deg(v) For a node u 6= r of ar , the probability of choosing u is pro- portional to its degree when compared with all the adjacent The second data structure, called the node index, is an nodes of its parent. Hence, Apache Lucene3 inverted index. It handles each node as a separate document consisting of the title and content fields, Deg(u) after applying stemming and stop-word removal.4 We use P (u|par (u)) = P , (20) v∈N (par (u)) Deg(v) the Lucene Field class to implement those two fields. We keep two instances of the inverted node index, one for uni- where N (par (u)) consists of all the neighbors of par (u). grams and another for unordered bigrams. The node index Therefore, the probability of ar is also stores the full text of each node, because it is needed for building the virtual index that is described next. Y P (ar ) = P (r) P (u|par (u)). (21) u∈a,u6=r The third data structure, called the virtual index, is a Lucene inverted index for the VDs (virtual documents) (one Note that the product of the conditional probabilities is over per node). Similarly to the node index, it has title and con- all nodes u of a, except the root. Since an answer a is an tent fields, and two instances (for unigrams and unordered undirected subtree, we can pick any one of its nodes as the bigrams). The posting list of t keeps, as a Lucene payload, root; hence, we define the weighted term frequency of t (see Equation (5)). ! Table 1 gives the sizes of the indexes for each dataset.5 For fLa (a) = ln max P (ar ) . (22) IMDB, the graph index is relatively big, due to the large r is a node of a number of entities and relationships in that dataset. The node index and virtual index for τ = 1 have similar sizes. 7. EXPERIMENTS This is due to the fact that the node index also stores the full text of the title and content fields (rather than just the 7.1 The Benchmark inverted lists). The virtual index for τ = 2 is larger by two orders of magnitude than the one for τ = 1. We use the evaluation framework of [5] that was specif- The selection of roots and keyword nodes in Section 5 is ically developed for testing the effectiveness of systems for efficiently implemented as follows. Let Q = (q1 , . . . , qm ) be keyword search over data graphs. The framework consists the given query. In Lucene, we run Q as a boolean conjunc- of three datasets: IMDB, Wikipedia and Mondial. tive query on the virtual index and find all nodes v, such The datasets are given as relations. Each tuple of those that the VD v ? contains all the qi . We rank those nodes relations has a unique id and may have some foreign keys by score(Q, v) of Equation (1) and the potential functions pointing to other tuples. IMDB and Wikipedia contain six of Section 4.4; the top-n are the selected roots. We select relations each, and Mondial contains twenty four relations. the top-n keyword nodes for each qi (1 ≤ i ≤ m) as follows. IMDB has 1.6M tuples, Wikipedia has 206K tuples and Using the node index and the Filter tool of Lucene, we run Mondial—only 17K tuples. IMDB and Wikipedia contain a Lucene query to find all the nodes that contain qi and are relatively large chunks of text, while Mondial has only short in the VD of some selected root, and then choose the top-n strings, such as names of countries and cities, etc. The Mon- according to score(Q, v). dial data graph has on average a large number of edges per node, compared with the other two datasets. 7.3 Parameters Tuning For each dataset, the evaluation framework of [5] has fifty queries and their qrels (i.e., query relevance judgments). The We use Equation (1) in three places. First, to select the average number of keywords per query is 2.91, when exclud- top-n roots, second, to select the top-n keyword nodes and ing five queries of IMDB that consist of very long quotations finally to rank the top-k answers. Thus, we have to learn from movies. The average number of answers per query is the parameters λT , λT̂ , λU , λÛ and λL for each of the three 4.49 and the largest number of tuples in an answer is 5. cases. We use the coordinate ascent algorithm [20], as im- plemented in RankLib,6 to learn the parameters. 7.2 System Implementation We generated three labeled files—for roots, keyword nodes We translated each dataset into a data graph, as explained and answers—with positive and negative examples for each in Section 3.1. Each data graph is indexed into three data query as follows. The qrels in the benchmark [5] contain only structures. First, the graph index comprises the nodes and 3 https://lucene.apache.org/ edges, and is kept in a Berkeley DB.2 It is used for travers- 4 We used stop words from http://www.textfixer.com/ ing the data graph and for storing query-independent in- resources/common-english-words.txt, added some nega- formation that is needed for computing the potential func- tion words, such as can’t and won’t, and removed stop words tions. The stored information includes, for example, the that could be names of people, such as Will. ? 5 static weights and for each node v, the lengths of vcnt and The sizes of the node and virtual indexes are shown for ? vttl ; the latter two are used in Equation (8). unigrams. The unordered-bigram indexes are between two to eight times larger than those of the unigrams. 2 6 http://www.oracle.com/technetwork/products/berkeleydb http://sourceforge.net/p/lemur/wiki/RankLib/ the correct answers that are labeled with 1. For each node dial, MRF-KS has a MAP of 0.89 compared with 0.83 for v of those answers, if v can be a root (i.e., its VD contains GraphLM (an improvement of 7.6%). all the query keywords) or a keyword node, then it is added The advantage of MRF-KS over the second best system to the corresponding file with the label 1. To add negative GraphLM is statistically significant on all three datasets, as examples, we ran our system with equal weights of 0.2 for all measured in a one-tailed t-test (p-value < 0.05). We exclude of the five parameters (when scoring roots, keyword nodes the work of [3] from this comparison due to the following. and answers). We used the default values of n = 1, 000 and Some essential details (e.g., the algorithm for generating an- k = 1, 000. Among the top-k answers obtained in this way, swers in ranked order) are missing from their paper, which we chose those that are not in the qrels and labeled them has made it impossible for us to reproduce their results. with 0. Each node that can be a root or a keyword node Moreover, they declined our request to get their code or, of an answer that is not in the qrels was also labeled with at least, the AP (average precision) they obtained for each 0 and added to the corresponding file, provided that it had individual query of the evaluation framework of [5]; hence, not previously been labeled with 1. verifying their results is problematic. In any case, they re- Our experiments use cross validation. Namely, we learn ported that for Mondial, Wikipedia and IMDB, they got a the parameters on two datasets and use them on the third MAP of 0.9, 0.78 and 0.79, respectively. Our results are one. We report the results for each dataset with those almost the same: 0.89, 0.76 and 0.76, respectively. learned parameters. The files with both 0 and 1 labels are just for training. The MAP is measured on the third dataset 0.9 by using the original qrels of the evaluation framework. 0.8 µ 0.7 We use Dirichlet smoothing with the parameter µ+|x| , 0.6 MAP n n n n where |x| is the length of x. For αT , αT̂ , αU and αÛ in Equa- 0.5 ? 0.4 tions (6) and (10), the variable x ranges over vcnt , whereas 0.3 in Equations (7) and (11), it ranges over vttl ? . The parameter 0.2 ? ? 0.1 µ is the average of |x| over all vcnt and vttl when smoothing ? ? vcnt and vttl , respectively. Mondial Wikipedia IMDB For the parameters αTa , αT̂a , αU a a and αÛ in Equations (15) MRF-KS GraphLM CD Bidirectional Banks and (17), the variable x ranges over the acnt field of the top- k answers, whereas in Equations (16) and (18), it ranges over their attl field. We view an answer a as an ordinary Figure 2: MAP of MRF-KS and state-of-the-art sys- document. But since answers have a variable number of tems (k = 1, 000, n = 1, 000) nodes, we define the length of af (where P f is either cnt or ttl ) as the average per node; namely, 1s t∈af tf (t, af ), where s is the number of nodes of a. When smoothing af , 0.9 the parameter µ is the average of |vf | over all nodes v of the 0.8 data graph; similarly for unordered bigrams. MAP The diameter of VDs is τ = 1 (see Section 4.1). Unless 0.7 otherwise specified, when selecting the top-n roots and key- word nodes for each qi (see Section 5), we use n = 1, 000. 0.6 7.4 The Setup of the Experiments Mondial Wikipedia IMDB We compare our approach with the top systems, namely, MRF-KS MRF-KS(τ = 2) noWTF(τ = 2) noBigram BANKS [2], Bidirectional [13] and CD [6], among those noTitle noAnswerPriors onlyAnserPriors tested in [5], as well as with GraphLM [19]. In [7], they extended the binary relevance set of [5] to include answers Figure 3: Effect of individual features (k = 1, 000, with marginal relevance. However, they have not made that n = 100) extended framework available, so we cannot use it. Similarly to [5, 19], we use the default k = 1, 000 when producing the top-k answers for each query. 7.6 The Effect of Different Components We now show the effect of various components of our sys- 7.5 Comparison with the State of the Art tem by measuring the MAP yielded by different configura- We compare our approach, called MRF-KS, with the state- tions, as shown in Figure 3. Unless otherwise specified, VDs of-the-art systems using the evaluation framework of [5].7 have the default diameter of τ = 1. To magnify the effect Figure 2 shows that MRF-KS outperforms the other sys- of the different configurations, we select a relatively small tems on each of the three datasets.8 The second best is number (n = 100) of roots and keyword nodes. For each GraphLM. On Wikipedia, MRF-KS achieves a MAP of 0.76 configuration, we relearned the relevant parameters on two compared with 0.63 for GraphLM (an improvement of 20%). datasets and then measured the MAP on the third one, as On IMDB, MRF-KS has a MAP of 0.76 compared with 0.68 explained in Section 7.3. for GraphLM (an improvement of 11.3%). And on Mon- The first two columns show that increasing the diameter τ of VDs from 1 to 2 slightly lowers the MAP. The third 7 For all the systems, the MAP on IMDB is based on the up- column (labeled with noWTF) is when the diameter is 2 and dated qrels that appear in http://www.cs.virginia.edu/ ordinary term frequencies (instead of the weighted ones) are ~jmc7tp/resources.php#search. used. In this case, the MAP drops on all three datasets and 8 We show only the top performing state-of-the-art systems. the larger effect is on Wikipedia (from 0.74 to 0.66). The rest of the columns show an ablation test that dis- swers. Third, we presented an efficient algorithm for gener- ables one feature at a time. The column noBigram is when ating and ranking answers. The ranking of nodes (i.e., roots disabling the unordered bigrams in the content and title and keyword nodes) and answers is based on a combination fields (setting λU , λÛ = 0) for roots, keyword nodes and of query-dependent and independent features. answers. The column noTitle denotes the effect of disabling We compared our approach with other systems on the the title field for unigrams and unordered bigrams (setting evaluation framework of [5] that consists of three datasets: λT̂ , λÛ = 0) for roots, keyword nodes and answers. The col- IMDB, Wikipedia and Mondial. In terms of MAP, our ap- umn noAnswerPriors is when using all features except the proach has a statistically significant advantage over other answer priors (setting λL = 0). The last column onlyAn- tested systems on each of these datasets. For example, on swerPriors shows the MAP when using only answer priors Wikipedia, we achieved an improvement of 20% compared for ranking the answers (setting λT , λU , λT̂ , λÛ = 0 for an- with the best state-of-the-art system. swers and keeping all features for roots and keyword nodes). We performed an extensive analysis of the contribution We can see that the most dominant feature is the answer of the various components. The most significant feature is priors. When disabled, the MAP drops sharply on all three the answer priors. We further showed that that the MAP datasets. For example, on Wikipedia it drops from 0.75 to remains almost the same even when selecting a small number 0.6 and on IMDB—from 0.76 to 0.56. The contribution of of keyword nodes and roots, thereby reducing the search the answer priors is statistically significant. The title field space and increasing the efficiency. has the second largest effect on IMDB and Wikipedia. The unordered bigrams have a moderate effect on Mondial and 9. ACKNOWLEDGMENTS Wikipedia, but a larger one on IMDB. The authors thank Oren Kurland for helpful comments. 7.7 Efficiency vs. Effectiveness In this section, we show that our method offers a useful 10. REFERENCES [1] M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased trade-off between effectiveness and efficiency that can be ranking of web documents. In WSDM, 2011. easily tuned, thereby substantially improving the running [2] G. Bhalotia, A. Hulgeri, C. Nakhe, and S. Chakrabarti. time while only slightly lowering the MAP. This ability is Keyword searching and browsing in databases using banks. In ICDE, pages 431–440, 2002. hardly found in any other system. Figure 4 gives the average [3] V. Bicer, T. Tran, and R. Nedkov. Ranking support for running time per query and the MAP for different values of n keyword search on structured data using relevance models. In (i.e., the number of selected roots and keyword nodes). The CIKM, 2011. results are for k = 100 rather than the default k = 1, 000, [4] R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in rdf data. In ISWC, 2011. thereby making the results more significant. [5] J. Coffman and A. C. Weaver. A framework for evaluating Figure 4 shows that the running time drops at a much database keyword search strategies. In CIKM, 2010. faster rate than the MAP, as n gets smaller. On the three [6] J. Coffman and A. C. Weaver. Structured data retrieval using datasets, even for the small value of n = 50, the MAP is cover density ranking. In KEYS, pages 115–126, 2010. almost the same as for n = 1, 000. Hence, the parameter [7] J. Coffman and A. C. Weaver. Learning to rank results in relational keyword search. In CIKM, 2011. n enables us to substantially increase the efficiency without [8] S. Elbassuoni and R. Blanco. Keyword search over rdf graphs. sacrificing effectiveness. In CIKM, 2011. [9] B. He and I. Ounis. Combining fields for query expansion and adaptive query expansion. In Information Processing and Management, volume 43, pages 1294–1307, 2007. 0.8 3 Time (sec) [10] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks: Ranked MAP 0.6 keyword searches on graphs. In SIGMOD, 2007. 2 [11] D. Himestra. Statistical language models for intelligent XML 0.4 retrieval. In Intelligent Search on XML Data, LNCS 2818. 1 Springer-Verlag Berlin Heidelberg, 2003. 0.2 0 [12] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient ir-style keyword search over relational databases. In VLDB, n=10 n=50 n=100 n=1000 pages 850–861, 2003. Mondial-MAP Wikipedia-MAP IMDB-MAP [13] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, and R. Desai. Bidirectional expansion for keyword search on graph Mondial-Time Wikipedia-Time IMDB-Time databases. In VLDB, pages 505–516, 2005. [14] J. Kamps, G. Mishne, and M. de Rijke. Language models for searching in Web corpora. In TREC, 2004. Figure 4: Effect of n (number of selected roots and [15] J. Y. Kim and W. B. Croft. A field relevance model for structured document retrieval. In European Conference on keyword nodes) for k = 100 Information Retrieval (ECIR), pages 97–108, 2012. [16] N. Lao and W. W. Cohen. Relational retrieval using a combination of path-constrained random walks. In Mach Learn, volume 81, pages 53–67, 2010. 8. CONCLUSIONS [17] Y. Luo, X. Lin, W. Wang, and X. Zhou. Top-k keyword query We presented a novel approach, couched in probability in relational databases. In SIGMOD, pages 115–126, 2007. [18] Y. Lv and C. Zhai. Positional language models for information theory, to finding the top-k answers in keyword search over retrieval. In SIGIR, 2009. data graphs. It is based on new ideas and concepts. First, we [19] Y. Mass and Y. Sagiv. Language models for keyword search showed how to estimate the prior of an answer (i.e., subtree) over data graphs. In WSDM, pages 363–372, 2012. and showed its contribution to the final ranking of answers. [20] D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR, 2005. Second, we defined virtual documents with weighted term [21] Q. Su and J. Widom. Indexing relational database content frequencies and showed their effectiveness in selecting the offline for efficient keyword-based search. In IDEAS, pages most promising roots and keyword nodes of candidate an- 297–306, 2005.