Virtual Documents and Answer Priors in
                         Keyword Search over Data Graphs∗

                                           Yosi Mass†‡ and Yehoshua Sagiv‡
                                      †
                                       IBM Haifa Research Lab, Haifa 31905, Israel
                                                 yosimass@il.ibm.com
                                    ‡
                                      The Hebrew University, Jerusalem 91904, Israel
                                                  sagiv@cs.huji.ac.il

ABSTRACT                                                                 in ordinary keyword search, where a result is always a single
In keyword search over data graphs, an answer is a non-                  unit (e.g., document).
redundant subtree that contains the keywords of the query.                  The nodes of a data graph represent entities and relation-
Ranking of answers should take into account both their tex-              ships, while the connections among them are introduced as
tual relevance and the significance of their semantic struc-             edges. Free text can be associated with nodes and edges. In
ture. A novel method for answers priors is developed and                 keyword search over data graphs, answer are non-redundant
used in conjunction with query-dependent features. Since                 subtrees that contain all the keywords of the query.
the space of all possible answers is huge, efficiency is also a             Keyword search over data graphs has been investigated
major problem. A new algorithm that drastically cuts down                extensively in recent years (cf. [5]). It involves two main
the search space is presented. It generates candidate an-                challenges: effectiveness and efficiency. The first one means
swers by first selecting top-n roots and top-n nodes for each            that we have to develop effective methods for ranking of
query keyword. The selection is by means of a novel con-                 answers (i.e., subtrees) that take into account the structure
cept of virtual documents with weighted term frequencies.                (e.g., the importance of entities and the strength of relation-
Markov random field models are used for ranking the virtual              ships among them) as well as the relevance of the keywords
documents and then the generated answers. The proposed                   of the query. The second challenge is to generate candidate
approach outperforms existing systems on a standard eval-                (i.e., potentially relevant) answers efficiently. This is not
uation framework.                                                        an easy problem, because there could be a huge number of
                                                                         subtrees that contain all the keywords of the query.
                                                                            A common approach begins by assigning weights to the
Keywords                                                                 nodes and edges of the data graph. Then, the search for
Data graph, keyword search, query, ranking, answer prior,                answers starts from nodes containing keywords of the query.
virtual documents                                                        The weights are intended to focus the search on paths that
                                                                         lead to roots of the most relevant answers. However, those
                                                                         weights are assigned based on local considerations (e.g., the
1.    INTRODUCTION                                                       text of a node) and, hence, their effectiveness is limited for
  Data graphs are a convenient, flexible way of representing             the following reason. Nodes belonging to relevant answers
knowledge bases. They can be constructed from a variety                  are those that have, in their vicinity, other nodes with key-
of formats (e.g., XML, RDB and RDF). Their semistruc-                    words of the query. To solve this problem, we introduce vir-
tured nature makes it possible to create them incrementally,             tual documents (VDs) that contain a node and its vicinity.
distributively and heterogeneously. Since they do not have               We rank VDs by applying (similarly to [1,20]) a Markov ran-
a rigid schema, it is essential to support keyword search                dom field (MRF) model that combines query-dependent and
over them (rather than querying in some formal language,                 independent features. In particular, we adapt positional lan-
such as XQuery or SPARQL). Yet, it could be possible to                  guage models [18] by using distances based on static weights
get succinct answers that have some semantic structure. In               that reflect the structure of the data graph. Also, we use
particular, their answers can show semantic connections be-              node priors as a query-independent feature.
tween distinct units (e.g., Web pages), which is impossible                 We apply the ranking of VDs to the process of selecting
                                                                         the top-n keyword nodes (i.e., nodes containing keywords of
∗This work was supported by the Israel Science Foundation
                                                                         the query) and the top-n roots. Only after selecting both
(Grant No. 1632/12).                                                     keyword nodes and roots, do we construct paths that con-
                                                                         nect them, thereby yielding answers. By first selecting roots
                                                                         and not just keyword nodes, we realize a higher degree of
                                                                         both effectiveness and efficiency. It should be noted that
                                                                         several papers [10, 21] use notions of virtual documents that
                                                                         are also returned as answers. In contrast, we apply VDs just
 c 2016, Copyright is with the authors. Published in the Workshop Pro-   as a first step in the construction of answers.
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-       For the final ranking of the generated answers, we use
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this     once again an MRF model for combining query-dependent
paper is permitted under the terms of the Creative Commons license CC-   and independent features. Unlike previous work, our query-
by-nc-nd 4.0
independent feature is not ad hoc, but based on a rigorous         the query and to the structure of answers. In [3] they use
computation of answer priors.                                      pseudo-feedback to apply relevance models to tuples and
   We have evaluated our system on the framework of [5] that       use them for ranking of answers. The works of [12, 17, 19]
consists of three datasets: IMDB, Wikipedia and Mondial,           resemble our work in that they concatenate the text in the
as well as fifty queries for each one. On all three datasets,      nodes of answers and use IR methods combined with query-
our method outperforms the state-of-the-art systems. Our           independent features for ranking. We also apply IR features
experiment include automatic learning of the parameters of         on the concatenated text, but different from those works,
the MRF models.                                                    we present a probabilistic feature that considers the prior of
   We also show that our approach offers an excellent way of       an answer, and we use a well established theory of MRF for
improving the efficiency with just a minor impact on the ef-       combining it with the IR features.
fectiveness (i.e., quality of answers). It is done by decreasing      In [16] they builds priors to paths in graphs for the ap-
n (i.e., the number of selected roots and keyword nodes).          plication of recommendation of conferences, papers to cite,
   In summary, our main contributions are the following.           or experts for a new paper that one writes. In comparison,
                                                                   we assign different priors to each answer (i.e., subtree) while
     1. By developing virtual documents and using them for
                                                                   they assign priors for paths, and their priors are fixed for all
        first selecting the top-n keyword nodes and also the
                                                                   paths of the same type.
        top-n roots, we have an algorithm for generating an-
                                                                      In [15], they learn per-term weights for each field; in our
        swers that is both highly efficient and effective.
                                                                   case, it gave inferior results. So, we use a term-independent
     2. We formally compute answer priors and use them as          weight for each field as in [11, 14].
        a query-independent feature in the final ranking. Our
        experiments show that they make a highly significant
        contribution to the effectiveness of our system. In the
        area of keyword search over data graphs, this is the
        first time that query-independent features are based
        on a rigorous formalism, rather than ad hoc intuition.
     3. We describe an efficient implementation using inverted
        indexes. We have done extensive experiments showing
        that our system is highly effective and efficient.
  The rest of the paper is organized as follows. In Section 2,
we discuss related work. Section 3 reviews basic concepts.
Section 4 develops the virtual documents and their features.
Section 5 presents the algorithm that uses the selected roots
and keyword nodes for generating answers. Section 6 devel-
ops the features of answers. Section 7 describes the imple-
                                                                     Figure 1: Tiny snippet of the IMDB data graph
mentation and the experiments that verify the effectiveness
and efficiency of our approach. We conclude in Section 8.
                                                                   3.    PRELIMINARIES
2.     RELATED WORK
   We discuss systems for keyword search over data graphs          3.1    The Data Model
along the dimensions of efficiency (i.e., how to generate an-         Data graphs can be created from any format (XML, RDF,
swers) and effectiveness (i.e., how to rank answers). The          RDB, etc.). Figure 1 shows a tiny portion of the IMDB data
common approach to generate answers efficiently [2, 12, 19]        graph. In this paper, we experiment with data graphs that
uses backward iterators that start from keyword nodes until        are obtained from relational databases as follows. Each tuple
they meet at a root node. In [13] they improved [2] by adding      t becomes a node vt that has the relation name as its type.
forward iterators that can go back from the detected roots,        A tuple t can be either an entity or a relationship. In the
to keyword nodes. Still, they have to start from keyword           former case, vt is an entity node and is shown as a rectangle
nodes. In [8], they present a method for keyword search            in Figure 1. In the latter case, vt is a relationship node
over RDF graphs that starts with triples that match each           and is shown as a diamond. The attribute-value pairs of vt
keyword. Then they produce answers by joining the triples          are those of the tuple t, excluding foreign keys. That is, a
through their subjects and objects. In our work we use vir-        foreign key is represented in the data graph by a directed
tual documents to start simultaneously from roots and key-         edge (shown in Figure 1 as a solid arrow). An opposite edge
word nodes, thus we further reduce the search space.               (shown as a dashed arrow) is added in the reverse direction
   Other papers have also used virtual documents in key-           in order not to miss relevant answers.
word search over data graphs. In [10, 21], they create a              In Figure 1, the diamond of a relationship node shows
virtual document from a node and its vicinity, and search          the type (e.g., cast). The top line of a rectangle shows
it for the keywords. In [4], they do entity search over RDF        the entity’s name (e.g., goldfinger) followed by its type
data and their virtual documents contain a node, but only          (e.g., movie). The other attribute-value pairs are listed be-
with its literal neighbors. All of these systems return virtual    low the top line.
documents or their roots as answers, whereas we use them              The value of an attribute could be free text. For the name
just as a first step to generate answers (i.e., subtrees).         of an entity node, we choose the value of an appropriate at-
   For an effective ranking of answers (cf. [7]), systems use      tribute, such as title, etc. We shall refer to that attribute
features that consider both relevance to the keywords of           generically as name. The value of name is a short string that
serves as a not-necessarily-unique identifier; for example, it                 parameters λT , λT̂ , λU , λÛ and λL are nonnegative and
could be the title of a movie or the name of a person. Rela-                   their sum is 1. We learn them automatically (see Section 7).
tionship nodes typically do not have the attribute name.                          We actually use two variants of Equation (1). The algo-
                                                                               rithm for generating answers (Section 5) starts by selecting
3.2    The Content and Title Fields                                            the top-n roots and keyword nodes using the potential func-
   In [9,11,14], they showed that combining fields yields bet-                 tions fTn , fT̂n , fUn , fÛn and fLn that are defined in Section 4.4.
ter results when searching XML, the Web or flat documents.                     After the algorithm generates n answers, they are re-ranked
Similarly, we group the attributes of a node into two seman-                   using the potential functions fTa , fT̂a , fUa , fÛa and fLa that are
tic fields (that could overlap). The content field consists of                 defined in Section 6.
the names and values of all the attributes. The title field
comprises only the value of the attribute name. In Figure 1,                   4.    VIRTUAL DOCUMENTS FOR RANKING
for example, the title field of node 2 is: goldfinger, and the
content field is: title goldfinger type movie release-                         4.1    Virtual Documents
date 1964 genre action adventure producedin uk plot                                We consider a data graph G = (V, E), where V and E
bond is back and his next mission ...                                          are the sets of nodes and edges, respectively. In [19], each
   The content and title fields (of a node) contain textual                    node v ∈ V is deemed a document. We take a different
information that we use to assign IR scores. We distinguish                    approach and view a small vicinity of a node (including the
between these two fields in order to control the relative im-                  node itself) as a virtual document. Intuitive motivation is
portance of an occurrence (of a query keyword) in the title                    that often keywords of a query are spread over several nodes
compared with an occurrence only in the content field.                         that are close to one another.
                                                                                   Formally, the virtual document (abbr. VD) of a node v,
3.3    Queries and Answers                                                     denoted by v ? , consists of all nodes u, such that the following
   A query Q = (q1 , ..., qm ) on a data graph is a set of key-                holds. There is a path p in the data graph G from v to some
words. Each qi should match a term in the content of some                      entity node x (where x could be u), such that p includes
node(s); that is, we use the AND semantics as usually done                     u and has at most τ entity nodes, excluding v itself. The
in keyword search over data graphs (supporting the OR se-                      parameter τ is called the diameter of the VD. As a running
mantics is left for future work).                                              example, we use Figure 1 with τ = 1. The VD of node 6
   Answers are subtrees of the data graph, rather than sub-                    consists of nodes 2, 3, 4 and 6 (node 3 is not counted in τ ,
graphs, because a tree is easier to understand quickly and                     because it is a relationship). The VD of node 2 comprises
is typically an indivisible unit of information. Formally, an                  nodes 1, 2, 3, 4, 5, and 6. Note that v ? is defined as a set
answer to Q is a non-redundant subtree a of the data graph,                    of nodes. However, v ? can also be viewed as the subgraph
such that a contains all the keywords of Q. Containment                        of G induced by its nodes (i.e., the subgraph comprising the
means that each keyword appears in some node(s). Non-                          nodes of v ? and the edges between them).
redundancy requires an answer not to have a proper subtree                         A node consists of two fields: content and title. We denote
that also contains all the keywords of the query.                              by vf the text in the field f of node v. In particular, vcnt
   As an example, consider the data graph of Figure 1 and                      and vttl are ordinary documents consisting of the text in
the query “sean connery fleming” for finding movies that are                   the content and title fields, respectively, of v. Similarly, vf?
related to those names. A possible answer comprises five                       denotes the field f of the VD v ? , that is, the concatenation of
nodes: goldfinger (of type movie), ian fleming and james                       the text in field f of all the nodes comprising v ? . (However,
bond (both of type person), and the two connecting nodes                       the weighted term frequencies defined later are applied to
of type cast. Note that non-redundancy does not imply                          v ? .) Recall that V is the set of nodes of the data graph. We
minimality, and a query could have numerous answers.                           use Vf to denote the collection that comprises all the vf .

3.4    Markov Random Fields                                                    4.2    Static Weights
  Markov random field (MRF) models make it possible to                           In the VD v ? of a node v, occurrences of terms closer to v
combine query-dependent and independent features. We                           are more important. Distances among nodes are determined
apply the sequential dependency model of [1, 20] and use                       by minimal-weight paths. We assign static weights (which
unigrams and unordered bigrams as query-dependent fea-                         are query independent) to nodes and edges as follows.
tures. Our query-independent feature is the prior of either                      For an entity node u, the importance is proportional to
a node or an answer. The score with respect to a query                         the number of its neighbors. Thus, the static weight of u,
Q = (q1 , ..., qm ) is given by                                                denoted by wn s (u), is defined as
                                                                                                                      1
                                                                                                  wn s (u) =                  ,                  (2)
                   X                                      
 score(Q, x) =           λT fT (qi , x) + λT̂ fT̂ (qi , x)                                                     ln(e + Deg(u))
                 qi ∈Q
               X                                                             where e is Euler’s number and Deg(u) is the degree of u
       +                    λU fU (qi , qi+1 , x) + λÛ fÛ (qi , qi+1 , x)    (i.e., the number of its edges) in G. Notice that 0 ≤ wn s (u) ≤
           {qi ,qi+1 }∈Q                                                       1 and a node with a higher degree has a better (i.e., lower)
       + λL fL (x),                                                      (1)   weight. We use logarithm in the denominator so that the
                                                                               weight will not decay too fast as the degree increases, or
where x is either a node or an answer, and the potential                       else there would be a negligible difference between nodes
function are: fT and fT̂ for unigrams of the content and                       with large degrees. For example, node 2 in Figure 1 has two
title fields, respectively; similarly, fU and fÛ for unordered                neighbors, so its static weight is 0.64; the static weight of
bigrams; and fL for the query-independent feature. The                         node 4 is 0.76, because it has a single neighbor.
   Next, we consider relationship nodes and edges. Two en-                              Unigrams. For unigrams, we use two potential functions
tity nodes are directly related if they are connected by either                         fTn (qi , v) and fT̂n (qi , v) for the content and title fields, re-
a single edge or a pair of edges that pass through a relation-                          spectively. These functions consider the fields of the VD v ?
ship node. In this paper, we do not consider types of nodes                             (rather than node v itself). They are defined by
and edges when determining static weights. In addition, the
                                                                                            fTn (qi , v) = ln (1 − αTn )P (qi |vcnt
                                                                                                                                ?
                                                                                                                                    ) + αTn P (qi |Vcnt ) , (6)
                                                                                                                                                         
degree of a relationship node is usually 2 or 3. Therefore, we
apply the rule that the static weight of a direct relationship
                                                                                                                                  ?
                                                                                            fT̂n (qi , v) = ln (1 − αT̂n )P (qi |vttl ) + αT̂n P (qi |Vttl ) ,
                                                                                                                                                            
is always 1. In this way, we give some preference to smaller                                                                                                      (7)
answers (i.e., with fewer nodes). To conform to the above
rule, the static weight wn s (u) of a relationship node u is 1.                         where αTn and αT̂n are smoothing parameters for the content
The static weight of an edge e, denoted by we s (e), is 1 if                            and title fields, respectively, and Vcnt and Vttl are the collec-
e connects two entity nodes; otherwise e connects an entity                             tions comprising the content and title fields, respectively, of
with a relationship node and we s (e) = 0.                                              all the nodes of the data graph. We use Dirichlet smoothing
   A path p from node v to node u is written as pv→u . The                              for αTn and αT̂n , as described in Section 7.3.
static weight of pv→u , denoted by ws (pv→u ), is the sum of                               We use the maximum likelihood estimate. Hence, in each
static weights of all the edges and nodes of p; that is,                                one of Equations (6) and (7), the first probability in the
                          X              X                                              right side is given by
            ws (pv→u ) =      we s (e) +    wn s (x),       (3)                                                                wtf (qi , vx? )
                                 e∈p                  x∈p                                                  P (qi |vx? ) = P                    ?
                                                                                                                                                 ,                (8)
                                                                                                                              t∈v ? wtf (t, vx )
                                                                                                                                  x
where the first sum is over all edges e of p and the second—
over all nodes x of p. Note that if the path consists of only                           where we use weighted term frequencies (defined by Equa-
node v, its weight is wn s (v).                                                         tion (5)) and x is either cnt or ttl . The summation in the
                                                                                        denominator is over all unigrams t that appear in vx? and is
                                                                                        called the length of vx? .
4.3    Weighted Term Frequencies                                                           In each of Equations (6) and (7), the second probability in
  Next, we define the weighted terms frequencies of uni-                                the right side (which does the smoothing with the collection)
grams and unordered bigrams in a VD v ? . Given a node                                  is given by
u of v ? , the relative static weight of u in v ? , denoted by                                                        P
ws (v ? , u), is the minimum weight over all paths from v to u                                                              tf (qi , ux )
                                                                                                      P (qi |Vx ) = P u∈V                   .      (9)
in v ? ; that is,
                                                                                                                         P
                                                                                                                     u∈V   t∈ux tf (t, ux )

             ws (v ? , u) =          min        ws (pv→u ).                       (4)   where x is either cnt or ttl .
                                 pv→u is in v ?                                         Unordered bigrams. For an unordered bigram {qi , qi+1 }
                                                                                        of Q (similarly to unigrams), we use the following two po-
Note that all the nodes and edges of pv→u are in v ? . For
                                                                                        tential functions for the content and title fields.
example, in Figure 1, the relative static weight of node 4 in
the VD of node 2 is ws (2? , 4) = 0.64 + 1 + 0.76 = 2.4.                                    fUn (qi , qi+1 , v) = ln (1 − αU
                                                                                                                           n                   ?
                                                                                                                             )P ({qi , qi+1 }|vcnt )+
  Let t be either a unigram or an unordered bigram. To                                                                       n                      
                                                                                                                         + αU P ({qi , qi+1 }|Vcnt ) ,           (10)
define the frequency of t in v ? , we adapt the method used in
positional language models [18]. That is, the weight of t in a
node u of v ? is inversely proportional to ws (v ? , u)−ws (v ? , v),                       fÛn (qi , qi+1 , v) = ln (1 − αÛ
                                                                                                                            n                    ?
                                                                                                                               )P ({qi , qi+1 }|vttl )+
which is the weighted distance of u from v in the VD v ? . In                                                                   n
                                                                                                                           + αÛ P ({qi , qi+1 }|Vttl )
                                                                                                                                                        
                                                                                                                                                                 (11)
particular, a kernel serves as a discounting factor. We use a
                                                                                                 n          n
Gaussian kernel, because it was shown to be the best [18].                              Here, αU     and αÛ   are the smoothing parameters for un-
Formally, the weighted term frequency of t in field f of v ? ,                          ordered bigrams. Similarly to unigrams, we use Dirichlet
denoted by wtf (t, vf? ), is                                                            smoothing for these parameters. As earlier, the probabili-
                                                                                        ties in Equations (10) and (11) are derived according to the
                                     −(ws (v ? ,u)−ws (v ? ,v))2
       wtf (t, vf? ) =
                         X
                                 e               2σ 2              tf (t, uf ),   (5)   maximum likelihood estimate. That is, they are given by
                         u∈v ?
                                                                                        Equations (8) and (9), respectively, except that we substi-
                                                                                        tute {qi , qi+1 } for qi and assume that t denotes unordered
where tf (t, uf ) is the ordinary term frequency of t in the field                      bigrams (rather than unigrams).
f of node u and σ is a parameter that controls the spread                               Query independent. We use one query-independent po-
of the kernel.                                                                          tential function that is given by the node prior. In particular,
   Note that the sum in Equations (5) is over all nodes u                               we assume that the probability of a node v is proportional
in v ? . Observe that the weight of a single occurrence of t                            to its degree in the graph. Hence,
is at most one, and it is exactly one in v. For example, in
                                                                                                                                Deg(v)
Figure 1, the keyword bond appears twice in the VD of node                                                   fLn (v) = ln P               .                      (12)
2: once in node 2 itself and once in node 4. If we set σ = 1,                                                                  u∈V Deg(u)
then wtf (bond, 2?cnt ) = 1 + 0.21 = 1.21.                                                Overall, there are five potential functions and, thus, we
                                                                                        have to learn five parameters (i.e., λn    n     n    n       n
                                                                                                                              T , λT̂ , λU , λÛ and λL ).
4.4    Node Potential Functions                                                         In the next section, we use the five functions twice: once
   Consider a query Q = (q1 , ..., qm ). We now define poten-                           for selecting roots and a second time for choosing keyword
tial functions for unigrams, unordered bigrams and nodes.                               nodes. The learning is done separately for each one of these
In Section 5, we use Equation (1) with these functions.                                 two cases, as described in Section 7.3.
5.   GENERATING ANSWERS                                               cursively removing the root r, thereby decreasing its height.
   In this section, we rank nodes according to score(Q, v)            Second, a is not a duplicate of another answer that is al-
of Equation (1), where the potential functions are given by           ready in the output. Duplicates are removed based on an
Equations (6), (7), (10), (11) and (12).                              undirected semantics (to conform to the evaluation frame-
   We begin by selecting roots and keyword nodes. The for-            work of Section 7). Third, a does not have a relationship
mer will be roots of answers. The latter will appear in               node with fewer than two adjacent entity nodes.
answers as nodes containing keywords of the given query
Q = (q1 , . . . , qm ). We do it as follows. First, we consider all   6.     ANSWER POTENTIAL FUNCTIONS
nodes v of G, such that v ? contains every qi . We rank them
                                                                         We view an answer a as a document by concatenating the
according to score(Q, v) and select the top-n. These are the
                                                                      instances of each field over all the nodes of a. Thus, acnt
selected roots. Second, we consider the set U of all nodes v,
                                                                      and attl are ordinary documents obtained by concatenating
such that v is in r? where r is a selected root. Let Ui be the
                                                                      the content and title fields, respectively, of all the nodes
subset of U that comprises all nodes v, such that v contains
                                                                      of a. Similarly to nodes, we define potential functions for
the keyword qi of Q. For each keyword qi ∈ Q, we rank the
                                                                      unigrams, unordered bigrams and query-independent answer
nodes of Ui according to score(Q, v) and choose the top-n.
                                                                      priors. These functions are used for scoring answers with
These are the selected keyword nodes (for qi ). Let S be the
                                                                      respect to Q by means of Equation (1).
set consisting of all the selected roots and keyword nodes.
                                                                      Unigrams. Analogously to Equations (6) and (7), we use
   We will construct answers from minimal-weight paths that
                                                                      the following two potential functions for the content and
connect the selected roots and keyword nodes. We want
                                                                      title fields, respectively.
the weight of a path from a root r to a keyword node v
to reflect also the scores of its endpoints (i.e., score(Q, r)           fTa (qi , a) = ln (1 − αTa )P (qi |acnt ) + αTa P (qi |Vcnt )
                                                                                                                                       
                                                                                                                                         (15)
and score(Q, v)), rather than just the static weights of Sec-
tion 4.2. Hence, we convert scores into dynamic weights.
                                                                           fT̂a (qi , a) = ln (1 − αT̂a )P (qi |attl ) + αT̂a P (qi |Vttl )
                                                                                                                                              
When converting, we invert the scores, because lower weights                                                                                      (16)
are better (whereas it is the opposite for scores). The conver-
                                                                      Recall that Vcnt and Vttl are the collections comprising the
sion produces dynamic weights in the interval [0, 1], to make
                                                                      content and title fields, respectively, of all the nodes of the
them commensurate with the static weights. Formally, the
                                                                      data graph.
dynamic weight of a node v ∈ S, denoted by wn d (v), is
                                                                      Unordered bigrams. Similarly to Equations (10) and (11),
                                 max score(Q, u)                      for an unordered bigram {qi , qi+1 } of Q, we define two po-
                                 u∈S
                wn d (v) = 1 −                     .          (13)    tential functions for the content and title fields as follows.
                                   score(Q, v)
                                                                           fUa (qi , qi+1 , a) = ln (1 − αU
                                                                                                          a
                                                                                                            )P ({qi , qi+1 }|acnt ) +
Since score(Q, v) is negative (i.e., obtained by applying log-                                               a                        
arithm to probabilities), wn d (v) is in the interval [0, 1]. Note                                       + αU  P ({qi , qi+1 }|Vcnt )             (17)
that wn d (v) = 0 if node v has the highest score in S.
  The static weight of a path pr→v is given by Equation (3).               fÛa (qi , qi+1 , a) = ln (1 − αÛ
                                                                                                           a
                                                                                                              )P ({qi , qi+1 }|attl ) +
The combined weight of pr→v , denoted by wc (pr→v ), also                                                      a                        
incorporates the dynamic weights of r and v (while omitting                                               + αÛ  P ({qi , qi+1 }|Vttl )           (18)
their static weights). That is,                                       As in Section 4.4, we use the maximum likelihood estimate
     wc (pr→v ) = wn d (r) + wn d (v) +                               for the probabilities in the above equations. Since acnt and
                    X                  X                              attl are ordinary documents, we employ the usual term fre-
                  +      we s (e) +               wn s (x).   (14)    quencies, rather than the weighted ones. We use Dirichlet
                     e∈p            x∈p∧x∈{r,v}
                                         /                            smoothing for αTa , αT̂a , αUa
                                                                                                      and αÛa
                                                                                                               (see Section 7.3).
                                                                      Query independent. We use one query-independent po-
   Let r be a selected root. The set Uri consists of all the
                                                                      tential function fLa that is derived from an answer prior as
keyword nodes that were selected for qi and are in r? . Ob-
                                                                      follows. Let ar be an answer. The subscript of ar means
serve that every combination of m minimal-weight paths,
                                                                      that node r is the root. To derive the prior P (ar ), we con-
such that each one is from r to a keyword node of Uri
                                                                      sider the skeleton sr that is obtained by deleting the content
(1 ≤ i ≤ m), yields an answer to Q = (q1 , . . . , qm ).1 For
                                                                      of ar . Namely, sr has the same nodes and edges as ar , but
each root r and keyword qi , we generate these paths and
                                                                      without attribute-value pairs.
keep them in a separate sorted list.
                                                                          To obtain the probability P (ar ), we assume that ar is
   To generate answers, a priority queue A stores for each
                                                                      generated in two steps. First, the skeleton sr is generated
selected root r, the next best answer (with r as the root) that
                                                                      with probability P (sr ). Second, sr is instantiated to ar with
has not yet been added to the output (or discarded if it is not
                                                                      probability P (ar |sr ); this is also done in two steps. First, the
valid). Answers are removed from A by increasing height.
                                                                      root of sr is instantiated to a specific node of the data graph
Note that the height of a tree is the maximum combined
                                                                      G. Second, the following is repeated. After instantiating a
weight over all the paths from the root to some leaf.
                                                                      node v of sr to some node v 0 of G, we select a neighbor of
   An answer a is valid if it satisfies the following conditions.
                                                                      v 0 for each child of v.
First, a is non-redundant, that is, a does not have a proper
                                                                          The prior of an answer ar is P (ar ) = P (ar |sr )P (sr ). For
subtree that also contains all the keywords of the query. If a
                                                                      simplicity, we assume that P (sr ) is the same for all skele-
is redundant, we convert it to a non-redundant answer by re-
                                                                      tons, so P (ar ) = P (ar |sr ). To compute P (ar |sr ), we make
1                                                                     another simplifying assumption, namely, the probability of
  Formally, such a combination may not create a tree. How-
ever, it can be easily modified to form a tree.                       choosing a specific node u of ar depends only on its parent,
denoted by par (u). In particular, the probability of choos-
ing the root r of ar depends only on the data graph G. And                       Table 1: Index sizes (in MB)
                                                                                      raw     graph     node   virtual index
as in the derivation of Equation (12), it is proportional to
                                                                                      data    index    index   τ =1 τ =2
the degree or r. Thus,                                                   Mondial       2       22.8      4.3    6.1      137
                               Deg(r)                                    Wikipedia    220      330       316    299     23K
                    P (r) = P            .                  (19)          IMDB        304      1800      519    734     40K
                              v∈V Deg(v)

For a node u 6= r of ar , the probability of choosing u is pro-
portional to its degree when compared with all the adjacent           The second data structure, called the node index, is an
nodes of its parent. Hence,                                        Apache Lucene3 inverted index. It handles each node as a
                                                                   separate document consisting of the title and content fields,
                                     Deg(u)                        after applying stemming and stop-word removal.4 We use
             P (u|par (u)) = P                          ,   (20)
                                 v∈N (par (u)) Deg(v)              the Lucene Field class to implement those two fields. We
                                                                   keep two instances of the inverted node index, one for uni-
where N (par (u)) consists of all the neighbors of par (u).
                                                                   grams and another for unordered bigrams. The node index
Therefore, the probability of ar is
                                                                   also stores the full text of each node, because it is needed
                                                                   for building the virtual index that is described next.
                             Y
             P (ar ) = P (r)        P (u|par (u)).    (21)
                              u∈a,u6=r
                                                                      The third data structure, called the virtual index, is a
                                                                   Lucene inverted index for the VDs (virtual documents) (one
Note that the product of the conditional probabilities is over     per node). Similarly to the node index, it has title and con-
all nodes u of a, except the root. Since an answer a is an         tent fields, and two instances (for unigrams and unordered
undirected subtree, we can pick any one of its nodes as the        bigrams). The posting list of t keeps, as a Lucene payload,
root; hence, we define                                             the weighted term frequency of t (see Equation (5)).
                                              !                       Table 1 gives the sizes of the indexes for each dataset.5 For
             fLa (a) = ln         max       P (ar ) .       (22)   IMDB, the graph index is relatively big, due to the large
                            r is a node of a                       number of entities and relationships in that dataset. The
                                                                   node index and virtual index for τ = 1 have similar sizes.
7.     EXPERIMENTS                                                 This is due to the fact that the node index also stores the
                                                                   full text of the title and content fields (rather than just the
7.1      The Benchmark                                             inverted lists). The virtual index for τ = 2 is larger by two
                                                                   orders of magnitude than the one for τ = 1.
   We use the evaluation framework of [5] that was specif-            The selection of roots and keyword nodes in Section 5 is
ically developed for testing the effectiveness of systems for      efficiently implemented as follows. Let Q = (q1 , . . . , qm ) be
keyword search over data graphs. The framework consists            the given query. In Lucene, we run Q as a boolean conjunc-
of three datasets: IMDB, Wikipedia and Mondial.                    tive query on the virtual index and find all nodes v, such
   The datasets are given as relations. Each tuple of those        that the VD v ? contains all the qi . We rank those nodes
relations has a unique id and may have some foreign keys           by score(Q, v) of Equation (1) and the potential functions
pointing to other tuples. IMDB and Wikipedia contain six           of Section 4.4; the top-n are the selected roots. We select
relations each, and Mondial contains twenty four relations.        the top-n keyword nodes for each qi (1 ≤ i ≤ m) as follows.
IMDB has 1.6M tuples, Wikipedia has 206K tuples and                Using the node index and the Filter tool of Lucene, we run
Mondial—only 17K tuples. IMDB and Wikipedia contain                a Lucene query to find all the nodes that contain qi and are
relatively large chunks of text, while Mondial has only short      in the VD of some selected root, and then choose the top-n
strings, such as names of countries and cities, etc. The Mon-      according to score(Q, v).
dial data graph has on average a large number of edges per
node, compared with the other two datasets.                        7.3    Parameters Tuning
   For each dataset, the evaluation framework of [5] has fifty
queries and their qrels (i.e., query relevance judgments). The        We use Equation (1) in three places. First, to select the
average number of keywords per query is 2.91, when exclud-         top-n roots, second, to select the top-n keyword nodes and
ing five queries of IMDB that consist of very long quotations      finally to rank the top-k answers. Thus, we have to learn
from movies. The average number of answers per query is            the parameters λT , λT̂ , λU , λÛ and λL for each of the three
4.49 and the largest number of tuples in an answer is 5.           cases. We use the coordinate ascent algorithm [20], as im-
                                                                   plemented in RankLib,6 to learn the parameters.
7.2      System Implementation                                        We generated three labeled files—for roots, keyword nodes
   We translated each dataset into a data graph, as explained      and answers—with positive and negative examples for each
in Section 3.1. Each data graph is indexed into three data         query as follows. The qrels in the benchmark [5] contain only
structures. First, the graph index comprises the nodes and         3
                                                                     https://lucene.apache.org/
edges, and is kept in a Berkeley DB.2 It is used for travers-      4
                                                                     We used stop words from http://www.textfixer.com/
ing the data graph and for storing query-independent in-           resources/common-english-words.txt, added some nega-
formation that is needed for computing the potential func-         tion words, such as can’t and won’t, and removed stop words
tions. The stored information includes, for example, the           that could be names of people, such as Will.
                                                      ?            5
static weights and for each node v, the lengths of vcnt   and        The sizes of the node and virtual indexes are shown for
 ?
vttl ; the latter two are used in Equation (8).                    unigrams. The unordered-bigram indexes are between two
                                                                   to eight times larger than those of the unigrams.
2                                                                  6
    http://www.oracle.com/technetwork/products/berkeleydb            http://sourceforge.net/p/lemur/wiki/RankLib/
the correct answers that are labeled with 1. For each node            dial, MRF-KS has a MAP of 0.89 compared with 0.83 for
v of those answers, if v can be a root (i.e., its VD contains         GraphLM (an improvement of 7.6%).
all the query keywords) or a keyword node, then it is added             The advantage of MRF-KS over the second best system
to the corresponding file with the label 1. To add negative           GraphLM is statistically significant on all three datasets, as
examples, we ran our system with equal weights of 0.2 for all         measured in a one-tailed t-test (p-value < 0.05). We exclude
of the five parameters (when scoring roots, keyword nodes             the work of [3] from this comparison due to the following.
and answers). We used the default values of n = 1, 000 and            Some essential details (e.g., the algorithm for generating an-
k = 1, 000. Among the top-k answers obtained in this way,             swers in ranked order) are missing from their paper, which
we chose those that are not in the qrels and labeled them             has made it impossible for us to reproduce their results.
with 0. Each node that can be a root or a keyword node                Moreover, they declined our request to get their code or,
of an answer that is not in the qrels was also labeled with           at least, the AP (average precision) they obtained for each
0 and added to the corresponding file, provided that it had           individual query of the evaluation framework of [5]; hence,
not previously been labeled with 1.                                   verifying their results is problematic. In any case, they re-
    Our experiments use cross validation. Namely, we learn            ported that for Mondial, Wikipedia and IMDB, they got a
the parameters on two datasets and use them on the third              MAP of 0.9, 0.78 and 0.79, respectively. Our results are
one. We report the results for each dataset with those                almost the same: 0.89, 0.76 and 0.76, respectively.
learned parameters. The files with both 0 and 1 labels are
just for training. The MAP is measured on the third dataset                 0.9
by using the original qrels of the evaluation framework.                    0.8
                                                                µ           0.7
    We use Dirichlet smoothing with the parameter µ+|x|           ,         0.6


                                                                      MAP
                                       n    n    n       n
where |x| is the length of x. For αT , αT̂ , αU and αÛ in Equa-            0.5
                                                       ?
                                                                            0.4
tions (6) and (10), the variable x ranges over vcnt       , whereas         0.3
in Equations (7) and (11), it ranges over vttl   ?
                                                   . The parameter          0.2
                                     ?         ?                            0.1
µ is the average of |x| over all vcnt     and vttl when smoothing
  ?         ?
vcnt and vttl , respectively.                                                    Mondial  Wikipedia      IMDB
    For the parameters αTa , αT̂a , αU
                                     a        a
                                         and αÛ  in Equations (15)         MRF-KS  GraphLM   CD    Bidirectional          Banks
and (17), the variable x ranges over the acnt field of the top-
k answers, whereas in Equations (16) and (18), it ranges
over their attl field. We view an answer a as an ordinary             Figure 2: MAP of MRF-KS and state-of-the-art sys-
document. But since answers have a variable number of                 tems (k = 1, 000, n = 1, 000)
nodes, we define the length of af (where P          f is either cnt
or ttl ) as the average per node; namely, 1s t∈af tf (t, af ),
where s is the number of nodes of a. When smoothing af ,                          0.9
the parameter µ is the average of |vf | over all nodes v of the
                                                                                  0.8
data graph; similarly for unordered bigrams.
                                                                            MAP


    The diameter of VDs is τ = 1 (see Section 4.1). Unless                        0.7
otherwise specified, when selecting the top-n roots and key-
word nodes for each qi (see Section 5), we use n = 1, 000.                        0.6

7.4    The Setup of the Experiments                                                     Mondial   Wikipedia     IMDB
  We compare our approach with the top systems, namely,                     MRF-KS       MRF-KS(τ = 2)   noWTF(τ = 2)        noBigram
BANKS [2], Bidirectional [13] and CD [6], among those                       noTitle      noAnswerPriors  onlyAnserPriors
tested in [5], as well as with GraphLM [19]. In [7], they
extended the binary relevance set of [5] to include answers
                                                                      Figure 3: Effect of individual features (k = 1, 000,
with marginal relevance. However, they have not made that
                                                                      n = 100)
extended framework available, so we cannot use it. Similarly
to [5, 19], we use the default k = 1, 000 when producing the
top-k answers for each query.                                         7.6         The Effect of Different Components
                                                                         We now show the effect of various components of our sys-
7.5    Comparison with the State of the Art                           tem by measuring the MAP yielded by different configura-
   We compare our approach, called MRF-KS, with the state-            tions, as shown in Figure 3. Unless otherwise specified, VDs
of-the-art systems using the evaluation framework of [5].7            have the default diameter of τ = 1. To magnify the effect
   Figure 2 shows that MRF-KS outperforms the other sys-              of the different configurations, we select a relatively small
tems on each of the three datasets.8 The second best is               number (n = 100) of roots and keyword nodes. For each
GraphLM. On Wikipedia, MRF-KS achieves a MAP of 0.76                  configuration, we relearned the relevant parameters on two
compared with 0.63 for GraphLM (an improvement of 20%).               datasets and then measured the MAP on the third one, as
On IMDB, MRF-KS has a MAP of 0.76 compared with 0.68                  explained in Section 7.3.
for GraphLM (an improvement of 11.3%). And on Mon-                       The first two columns show that increasing the diameter
                                                                      τ of VDs from 1 to 2 slightly lowers the MAP. The third
7
  For all the systems, the MAP on IMDB is based on the up-            column (labeled with noWTF) is when the diameter is 2 and
dated qrels that appear in http://www.cs.virginia.edu/                ordinary term frequencies (instead of the weighted ones) are
~jmc7tp/resources.php#search.                                         used. In this case, the MAP drops on all three datasets and
8
  We show only the top performing state-of-the-art systems.           the larger effect is on Wikipedia (from 0.74 to 0.66).
   The rest of the columns show an ablation test that dis-              swers. Third, we presented an efficient algorithm for gener-
ables one feature at a time. The column noBigram is when                ating and ranking answers. The ranking of nodes (i.e., roots
disabling the unordered bigrams in the content and title                and keyword nodes) and answers is based on a combination
fields (setting λU , λÛ = 0) for roots, keyword nodes and              of query-dependent and independent features.
answers. The column noTitle denotes the effect of disabling                We compared our approach with other systems on the
the title field for unigrams and unordered bigrams (setting             evaluation framework of [5] that consists of three datasets:
λT̂ , λÛ = 0) for roots, keyword nodes and answers. The col-           IMDB, Wikipedia and Mondial. In terms of MAP, our ap-
umn noAnswerPriors is when using all features except the                proach has a statistically significant advantage over other
answer priors (setting λL = 0). The last column onlyAn-                 tested systems on each of these datasets. For example, on
swerPriors shows the MAP when using only answer priors                  Wikipedia, we achieved an improvement of 20% compared
for ranking the answers (setting λT , λU , λT̂ , λÛ = 0 for an-        with the best state-of-the-art system.
swers and keeping all features for roots and keyword nodes).               We performed an extensive analysis of the contribution
   We can see that the most dominant feature is the answer              of the various components. The most significant feature is
priors. When disabled, the MAP drops sharply on all three               the answer priors. We further showed that that the MAP
datasets. For example, on Wikipedia it drops from 0.75 to               remains almost the same even when selecting a small number
0.6 and on IMDB—from 0.76 to 0.56. The contribution of                  of keyword nodes and roots, thereby reducing the search
the answer priors is statistically significant. The title field         space and increasing the efficiency.
has the second largest effect on IMDB and Wikipedia. The
unordered bigrams have a moderate effect on Mondial and                 9.    ACKNOWLEDGMENTS
Wikipedia, but a larger one on IMDB.                                      The authors thank Oren Kurland for helpful comments.
7.7     Efficiency vs. Effectiveness
   In this section, we show that our method offers a useful
                                                                        10.    REFERENCES
                                                                         [1] M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased
trade-off between effectiveness and efficiency that can be                   ranking of web documents. In WSDM, 2011.
easily tuned, thereby substantially improving the running                [2] G. Bhalotia, A. Hulgeri, C. Nakhe, and S. Chakrabarti.
time while only slightly lowering the MAP. This ability is                   Keyword searching and browsing in databases using banks. In
                                                                             ICDE, pages 431–440, 2002.
hardly found in any other system. Figure 4 gives the average
                                                                         [3] V. Bicer, T. Tran, and R. Nedkov. Ranking support for
running time per query and the MAP for different values of n                 keyword search on structured data using relevance models. In
(i.e., the number of selected roots and keyword nodes). The                  CIKM, 2011.
results are for k = 100 rather than the default k = 1, 000,              [4] R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity
                                                                             search in rdf data. In ISWC, 2011.
thereby making the results more significant.
                                                                         [5] J. Coffman and A. C. Weaver. A framework for evaluating
   Figure 4 shows that the running time drops at a much                      database keyword search strategies. In CIKM, 2010.
faster rate than the MAP, as n gets smaller. On the three                [6] J. Coffman and A. C. Weaver. Structured data retrieval using
datasets, even for the small value of n = 50, the MAP is                     cover density ranking. In KEYS, pages 115–126, 2010.
almost the same as for n = 1, 000. Hence, the parameter                  [7] J. Coffman and A. C. Weaver. Learning to rank results in
                                                                             relational keyword search. In CIKM, 2011.
n enables us to substantially increase the efficiency without            [8] S. Elbassuoni and R. Blanco. Keyword search over rdf graphs.
sacrificing effectiveness.                                                   In CIKM, 2011.
                                                                         [9] B. He and I. Ounis. Combining fields for query expansion and
                                                                             adaptive query expansion. In Information Processing and
                                                                             Management, volume 43, pages 1294–1307, 2007.
      0.8                                              3
                                                           Time (sec)


                                                                        [10] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks: Ranked
MAP


      0.6                                                                    keyword searches on graphs. In SIGMOD, 2007.
                                                       2                [11] D. Himestra. Statistical language models for intelligent XML
      0.4                                                                    retrieval. In Intelligent Search on XML Data, LNCS 2818.
                                                       1                     Springer-Verlag Berlin Heidelberg, 2003.
      0.2                                              0
                                                                        [12] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient
                                                                             ir-style keyword search over relational databases. In VLDB,
            n=10       n=50      n=100        n=1000                         pages 850–861, 2003.
        Mondial-MAP       Wikipedia-MAP         IMDB-MAP                [13] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, and
                                                                             R. Desai. Bidirectional expansion for keyword search on graph
        Mondial-Time      Wikipedia-Time        IMDB-Time                    databases. In VLDB, pages 505–516, 2005.
                                                                        [14] J. Kamps, G. Mishne, and M. de Rijke. Language models for
                                                                             searching in Web corpora. In TREC, 2004.
Figure 4: Effect of n (number of selected roots and                     [15] J. Y. Kim and W. B. Croft. A field relevance model for
                                                                             structured document retrieval. In European Conference on
keyword nodes) for k = 100                                                   Information Retrieval (ECIR), pages 97–108, 2012.
                                                                        [16] N. Lao and W. W. Cohen. Relational retrieval using a
                                                                             combination of path-constrained random walks. In Mach
                                                                             Learn, volume 81, pages 53–67, 2010.
8.     CONCLUSIONS                                                      [17] Y. Luo, X. Lin, W. Wang, and X. Zhou. Top-k keyword query
   We presented a novel approach, couched in probability                     in relational databases. In SIGMOD, pages 115–126, 2007.
                                                                        [18] Y. Lv and C. Zhai. Positional language models for information
theory, to finding the top-k answers in keyword search over                  retrieval. In SIGIR, 2009.
data graphs. It is based on new ideas and concepts. First, we           [19] Y. Mass and Y. Sagiv. Language models for keyword search
showed how to estimate the prior of an answer (i.e., subtree)                over data graphs. In WSDM, pages 363–372, 2012.
and showed its contribution to the final ranking of answers.            [20] D. Metzler and W. B. Croft. A markov random field model for
                                                                             term dependencies. In SIGIR, 2005.
Second, we defined virtual documents with weighted term
                                                                        [21] Q. Su and J. Widom. Indexing relational database content
frequencies and showed their effectiveness in selecting the                  offline for efficient keyword-based search. In IDEAS, pages
most promising roots and keyword nodes of candidate an-                      297–306, 2005.