=Paper= {{Paper |id=Vol-1810/KARS_paper_01 |storemode=property |title=Towards Open-Source Shared Implementations of Keyword-Based Access Systems to Relational Data |pdfUrl=https://ceur-ws.org/Vol-1810/KARS_paper_01.pdf |volume=Vol-1810 |authors=Alex Badan,Luca Benvegnù,Matteo Biasetton,Giovanni Bonato,Alessandro Brighente,Alberto Cenzato,Piergiorgio Ceron,Giovanni Cogato,Stefano Marchesin,Alberto Minetto,Leonardo Pellegrina,Alberto Purpura,Riccardo Simionato,Nicolò Soleti,Matteo Tessarotto,Andrea Tonon,Federico Vendramin,Nicola Ferro |dblpUrl=https://dblp.org/rec/conf/edbt/BadanBBBBCCCMMP17 }} ==Towards Open-Source Shared Implementations of Keyword-Based Access Systems to Relational Data== https://ceur-ws.org/Vol-1810/KARS_paper_01.pdf
          Towards open-source shared implementations of
          keyword-based access systems to relational data

           Alex Badan, Luca Benvegnù, Matteo Biasetton, Giovanni Bonato, Alessandro
       Brighente, Alberto Cenzato, Piergiorgio Ceron, Giovanni Cogato, Stefano Marchesin,
        Alberto Minetto, Leonardo Pellegrina, Alberto Purpura, Riccardo Simionato, Nicolò
            Soleti, Matteo Tessarotto, Andrea Tonon, Federico Vendramin, Nicola Ferro
                                                    University of Padua, Italy
                  {alex.badan, luca.benvegnu.2, matteo.biasetton, giovanni.bonato,
             alessandro.brighente, alberto.cenzato, piergiorgio.ceron, giovanni.cogato.1,
              stefano.marchesin, alberto.minetto, leonardo.pellegrina, alberto.purpura,
               riccardo.simionato.1, nicolo.soleti, matteo.tessarotto.1, andrea.tonon.3,
                                federico.vendramin}@studenti.unipd.it
                                          ferro@dei.unipd.it

ABSTRACT                                                           access systems. Reproducibility is becoming a primary con-
Keyword-based access systems to relational data address a          cern in many areas of science and it is a key both for fully
challenging and important issue, i.e. letting users to exploit     understanding state-of-the-art solutions and for being able
natural language to access databases whose schema and in-          to compare against and improve over them [6, 7]. However,
stance are possibly unknown. Unfortunately, there are al-          there is a lack of commonly shared open source platforms
most no shared implementations of such systems and this            implementing state-of-the-art algorithms for keyword-based
hampers the reproducibility of experimental results. We ex-        access to relational data as, for example, Terrier1 is in the
plore the difficulties in reproducing such systems and share       information retrieval field; on the contrary, you mostly need
implementations of several state-of-the-art algorithms.            to re-implement each system from scratch [4].
                                                                      Therefore, as part of a student project during the course
                                                                   on databases of the master degree in computer science at the
1.   INTRODUCTION                                                  University of Padua, we considered several state-of-the-art
   Keyword-based access to relational data is a key technol-       algorithms and we implemented them from scratch, in order
ogy for lowering the barriers of access to the huge amounts        to understand the difficulties and pitfalls in implementing
of data managed by databases. It is an extremely difficult         them and to go towards a shared implementation of them.
and open challenge [1] since it has to face the “conflict of          The paper is organized as follows: Section 2 presents the
impedance” between vague and imprecise user information            related work; Section 3 introduces the implementation of the
needs and rigorously structured data, allowing users to ex-        different algorithm; Section 4 reports the lessons learned;
press their queries in natural language against a potentially      finally Section 5 wraps up the discussion and outlooks some
unknown database.                                                  future work.
   Improving a technology is greatly eased by the existence
of a reference architecture paired with a proper evaluation
framework in order to have a coherent vision of the compo-
                                                                   2.     RELATED WORKS
nents we have to leverage on and to precisely measure and            In the following, we briefly summarize the algorithms im-
track performances over time. Unfortunately, the situation         plemented in this paper. For a broader perspective on key-
is very fragmented in the case of keyword-based access to          word search over relational database, please refer to [13, 14].
relational data, since both a fully-fledged reference architec-
ture and an evaluation methodology are still missing [2].          2.1      Graph-based Approaches
   In this paper, we start to investigate the problem of the          This type of algorithms starts with the construction of
reproducibility of the experimental results for keyword-based      a graph where nodes are tuples and edges are the forward
                                                                   (foreign) key constraints and their goal is to find a set of
                                                                   trees that connects a group of nodes containing all the key-
2017, Copyright is with the authors. Published in the Workshop     words in an input query. Weights between nodes and graph
Proceedings of the EDBT/ICDT 2017 Joint Conference (March          type (direct or indirect) are dependent on each specific ap-
21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073).          proach. Moreover, this type of algorithms differ for the use
Distribution of this paper is permitted under the terms of the     of Dijkstra or Steiner methods to visit the graph: the former
Creative Commons license CC-by-nc-nd 4.0                           allows you to find the shortest spanning tree using only the
                                                                   selected nodes, i.e. those that match with the query key-
                                                                   1
                                                                       http://www.terrier.org/
words; the latter allows you to use other extra nodes which
are not included in the previous selection but exist in the           DISCOVER-I [9] Given a relational database and a set of
original graph.                                                     keyword K, Discover exploits the database schema to gen-
                                                                                                                 k
   Browsing ANd Keyword Searching I (BANKS-I) [3] The               erate a graph GT S : from Basic Tuple Sets Ri j , set of tuples
algorithm adds a backward weighted edge for each directed           of the relation Ri containing the keyword kj ∈ K, ∀i, j, it
edge in the graph and assigns a relevance score to each node        creates a node for each Tuple Set
in it. Then, it proceeds by finding the shortest path from                          0    \ k       [ k
each keyword node to all the other nodes in the graph. Each                      RiK =       Ri −     Ri , ∀K 0 ⊆ K           (2.1)
path is computed by running Dijkstra’s single source short-                               k∈K 0        / 0
                                                                                                      k∈K
est path algorithm starting from each one of the keyword            sets of tuples of the relation Ri containing all and only
nodes and traversing all the edges in reverse direction. This       the keywords of K 0 , and one for each Free Tuple Sets, the
process is called backward expanding search. When multiple          database relations. Edges are representative of each PK to
paths intersect at a common node r in the graph, the result-        FK relationship. The algorithm then computes a complete
ing tree with root r and the nodes containing the keywords          non-redundant set of Candidate Networks up to a maximum
as leaves is examined. If the tree contains all the keywords in     number of joints T, pruning all the joining networks of tu-
the input query, then its weight is computed and the tree is        ples that are not minimal (i.e no tuples without keywords
added to a heap data structure of fixed size, which contains        as leaves and do not contain the same tuple twice) and total
all the result trees in decreasing order of relevance. When         (i.e. contain every keyword ∈ K, AND semantics).To evalu-
the heap is full or all the trees have been generated, the most     ate the Candidate Networks efficiently, Discover eventually
relevant of them are returned until a predefined number of          computes and stores a list of intermediate results of com-
results has been returned.                                          mon join subexpressions. Finding these intermediate results
   Browsing ANd Keyword Searching II (BANKS-II) [10]                is an NP-complete problem, so it uses an iterative greedy
Bidirectional Search improves on BANKS-I by allowing for-           algorithm based on frequencies: the most frequent joint ex-
ward search from potential roots towards other keyword              pression is stored in a view at every iteration and then reused
nodes. The algorithm uses two concurrent iterators, called          when possible. The results are finally displayed to the user
outgoing and ingoing to explore the graph in both direc-            ordered by size.
tions. Both the iterators use the Dijkstra’s single source
shortest path approach. BANKS-II also allows for pref-                 DISCOVER-II [8] The algorithm uses an IR engine mod-
erential expansion of paths that have less branching us-            ule when a query Q is issued, in this way it extracts from
ing a spreading activation mechanism to prioritize the most         each relation R the tuple set RQ with Score > 0, where
promising paths and to avoid wasteful exploration of the            Score is a value for each tuple using a IR-style relevance
graph.                                                              scoring function. Candidate Networks are hence generated
   Dynamic Programming Best-First (DPBF) [5]: it tries to           based on the tuple sets returned by the IR engine and the
build a minimum cost Steiner tree: for each neighbour v of          database schema, which are join expressions used to create
the root in the graph, it creates a new tree with v as the new      joining trees of tuples which are potantial answer to the is-
root; the algorithm tries to merge trees with the same root         sued query. The final step is to identify the top-k results,
and a different set of keywords. If a tree contains all the         task which is performed by the execution engine module
searched keywords, it is a solution for the algorithm. Given        where it uses: Naive Algorithm which takes the top-k re-
a root and a specific set of keywords, DPBF keeps only the          sults from each CN and it combine in a sort-merge manner to
tree with lower weight, i.e. the best one for the specific root     identify the final top-k results of the given query; Sparse Al-
node and set of keywords, and it guarantees to be a total,          gorithm computes a bound on the maximum possible score
minimal and controlled in size solution.                            of the tuple tree derived from a CN.
   Steiner-Tree Approximation in Relationship graphs (STAR)
[11]: it initially looks for tuples containing one or more of the
query’s keywords (let’s call this set V 0 ⊂ V ) and then tries
                                                                    3.      IMPLEMENTATION
to connect these tuples into a least-weight tree, i.e. com-           Students split up in groups of 3-4 people and each group
puting the bestPSteiner tree between the given nodes which          was responsible for the implementation of one algorithm. All
total weight is e∈E w(e). In order to build a first intercon-       the algorithms have been implemented in Java while Post-
necting tree, STAR relies on a strategy similar to BANKS-I          greSQL has been used as relational engine.
but, instead of running single source shortest path iterators         The implementation of the different algorithms is available
from each node of V 0 , STAR runs simple breadth-first-search       as open source under BSD license2 .
from each terminal, called in a round-robin manner. As soon           We briefly summarize the main features of each imple-
as the iterators meet, a result is constructed. In the second       mentation:
phase, STAR aims at improving the current tree iteratively                • BANKS-I The required graph is created prior to the
by replacing certain paths in the tree by new paths with a                  execution of the algorithm executing multiple queries
lower weight from the underlying graph.                                     on the selected database without any prior knowledge
                                                                            of its schema. In our implementation, for each node in
2.2    Schema-based Approaches                                              the graph, only the ID of the tuple is saved and an ex-
   Both DISCOVER-I and II use generalized inverted-index,                   ternal file is used to store the information. The graph is
called Master Index and IR Engine respectively, to retrieve                 represented using adjacency lists. In order to speed up
tuples containing the input keywords, given by the user,                    the creation of the graph, we used the fastutil library3 ,
                                                                    2
from the database. Both output a list of Candidate Net-                 https://bitbucket.org/ks-bd-2015-2016/ks-unipd
                                                                    3
works, albeit defined slightly differently.                             http://fastutil.di.unimi.it/
  in addition to the Java Standard Library. Finally, to      • DISCOVER-II The graph was built using the infor-
  optimize the performance while executing multiple in-        mation schema and the algorithm was implemented in
  stances of Dijkstra’s algorithm (one for each keyword        java using native collections. The IR Engine module
  node), we chose to compute the next node to be re-           identifies all database tuples that have a non-zero score
  turned by the algorithm only when it was required. In        for a given query, then each occurrence of a keyword
  fact, most of the times, only a handful of nodes were        is recorded as a tuple-attribute pair. Moreover, the
  required to build a valid result tree. With this ap-         IR Engine relies an inverted index that associates each
  proach we avoided executing Dijkstra’s algorithm on          keyword that appears in the database with a list of oc-
  the whole graph, optimizing the execution time and           currences of the keyword, which we implemented using
  the memory usage. For the execution of each query            the GIN index of PostreSQL. To obtain better perfor-
  the memory limit was set to 5 Gb while the maximum           mances, we store as a view the tuple sets identified for
  running time was set to 1 hour.                              a query Q. The Candidate Network Generator involves
                                                               a combination of free and not-free tuple set, the former
• BANKS-II We divided our implementation in two phases:        represent the tuple set with no occurrences of the query
  pre-processing and the BANKS-II execution. We im-            keyboards, that help via foreign-key join the connec-
  plemented the graph in a OO manner. The vertexes             tion of non-free tuple set, the latter contains at least
  are stored using a two-layered HashMap, to improve           one keyword. The Execution Engine module receives
  performances and to reduce as much as possible the           as input the set of CNs along with the non-free tuple
  probability of collisions. The first layer refers to the     sets, it contacts the RDBMS’s query execution engine
  relation name, while the second layer uses as index the      repeatedly to identify the top-k query results. We im-
  primary key of the tuple. The paper suggested a set          plemented two algorithm: the Naive algorithm, con-
  of different weight assignment strategies (i.e. Pager-       sidering a Boolean-OR semantics, issues a SQL query
  ank); we decided to use equally weighted edges for           for each CN to answer a top-k query, then combines
  simplicity. Once the user provides the keywords, the         the results in a sort-merge manner to identify the final
  pre-processing ends with the creation of the matching        top-k results of the query; the Sparse algorithm dis-
  sets. These sets are also organized in a two-layered         cards at any point in time any (unprocessed) CN that
  HashMap data structure: vertexes are inserted in the         is guaranteed not to produce a top-k match for the
  respective matching sets using SQL queries to find           query, it computes a bound M P Si on the maximum
  tuples with attributes having matching values. Dur-          possible score of a tuple tree derived from a CN Ci . If
  ing the BANKS-II execution, outgoing and ingoing             M P Si does not exceed the actual score of k already
  iterators start to explore the graph, using two prior-       produced tuple trees, then CN Ci can be safely re-
  ity queues to extract the highest activated node. We         moved from further consideration. M P Si is the score
  created these queues using the standard Java Prior-          of a hypothetical joining tree of tuples T that contains
  ityQueue class, defining a custom compare method.            the top tuples from every non-free tuple set in Ci . The
  When a new solution (a rooted tree connecting all the        CNs for a query are evaluated in ascending size order,
  matching sets) is found, we store it in a PriorityQueue;     so that smallest CNs, i.e. the least expensive to pro-
  these results are ranked with a score given by the over-     cess and the most likely to produce high-score tuple
  all sum of the edge weights. Our implementation cre-         trees, are evaluated first.
  ates the graph from the DB once, and then accepts
  keywords from the user to perform multiple searches.       • DPBF The project has been divided into two phases:
                                                               creation and serialization of the database-graph and
• DISCOVER-I We first implemented the Master Index             dpbf implementation. To get all vertexes containing
  to perform the full text search through the function         query’s keywords quickly, we used hashmap data struc-
  to tsvector provided by PostgreSQL and then we cre-          ture to store the graph and sets of vertexes associated
  ated the initial graph GT S using an adjacency matrix.       to their words. Completed the graph, our algorithm
  We observed that a lot of queries were repeated while        gets the vertexes whence start to build the resulting
  computing Tuple Sets because the same Basic Tuple            trees. (i) For each of these we move up one level, re-
  Sets were created many times by the DBMS ”on the             moving similar trees with higher costs and then (ii)
  fly”. Thus we decided to store a new table for each          all trees with same root node and different sets of key-
  Basic Tuple Set to be reused while performing unions         words are merged. Points (i) and (ii) are repeated until
  and interceptions as in (2.1).To further improve per-        at least one result is found. We had to implement some
  formances of full text search we preprocessed the main       Java data structure to do that correctly and efficiently.
  search fields to remove english stopwords and to con-
  vert all tokens to lowercase characters. We then stored    • STAR has been implemented in a schema-agnostic way
  the lexemes in a dedicated Generalized Inverted Index.       so the first step of the algorithm loads all DB metadata
  The Candidate Network algorithm then prune, expand           needed to build the tree. Given an i-keywords query we
  and accept the Candidate Networks using static meth-         look for the starting tuples in a multithreaded manner.
  ods. The accepted ones (treated as graphs) were rewrit-      All those tuples are stored in an ad-hoc graph data-
  ten, by the Plan Generator, replacing repeated joins         structure based on Java TreeSet class. Considering
  inside them (i.e. the same couples of nodes and the          these tuples as expanding trees we try to connect them
  connecting edge) with a reference (i.e. a new node)          following both ingoing and outgoing foreign keys, dy-
  to a new temporary table. Finally queries were eval-         namically loading referenced tuples from DB, until we
  uated from the resulted graphs with a BFS algorithm          get a tree containing all the query’s keywords. Given
  in order to guarantee the correct order of joins.            that our test DBs were not weighted we decided to set
       w(e) = 1 ∀e ∈ E. At this point we worked on this                   forcing us to change them to be unique. Eventually
       tree to optimize its weight reshaping its structure: we            there were no suggestion about how to evaluate the
       iteratively remove the highest weight path replacing it            final queries: there is the need of an order in consid-
       with a lower weight one.                                           ering the tables, with join operations on the correct
                                                                          columns. For this reason we decided to evaluate ev-
4.    LESSONS LEARNED                                                     ery Candidate Network with a BFS algorithm, that
                                                                          guaranties correctness.
   For each examined system, we briefly summarize the dif-
ficulties and lessons learned in reproducing it.                        • DISCOVER-II We observed that in a dense dataset
                                                                          as IMDB the execution time using OR semantic grows
     • BANKS-I During the implementation of the algorithm,
                                                                          exponentially then queries don’t give results in a use-
       we encountered some difficulties due to missing infor-
                                                                          ful time with a number of keywords greater than 5. If
       mation in the descriptions of crucial parts of the algo-
                                                                          we used an AND semantic, since it keeps more CNs,
       rithm. The chosen technique to run efficiently multiple
                                                                          a post-processing on the stream of results needed and
       instances of Dijkstra’s single source shortest path al-
                                                                          so worsens the performance. Furthermore, not clear if
       gorithm was not described in the original paper, even
                                                                          the authors keep the duplicate CNs queries or if they
       though the choice of one implementation over another
                                                                          remove before (not specified if Set or Bag semantic)
       influenced considerably the memory usage and the ex-
                                                                          and we found biased queries with the inclusion of re-
       ecution time. We also encountered some difficulties
                                                                          lation names in them, e.g. char name. We chose to
       due to the lack of precision in the explanation of how
                                                                          not implement the AND semantic because we consider
       to evaluate the result trees. In fact, the way to han-
                                                                          such method highly inefficient. For better performance
       dle some particular cases such as nodes containing all
                                                                          we decided to use hashMap in the score function that
       the keywords in the input query or trees composed by
                                                                          was filled when the lists were created.
       only two nodes, were not described. It became notice-
       able that the major weakness of the algorithm was the            • DPBF During the algorithm implementation we have
       memory occupation that became more evident when                    faced three different problems: making the graph cre-
       testing it on greater datasets.                                    ation and the algorithm implementation efficient and
                                                                          using Java and its data-structures. We tried to make
     • BANKS-II We found the paper sufficiently clear in
                                                                          the graph creation as fast as possible, assuming that
       defining the Bidirectional Search algorithm and the
                                                                          it is created only once. So we serialized the graph into
       search approach; however, the implementation of some
                                                                          a file. The main cause of problems was the lack of
       procedures has not been trivial. As an example, the
                                                                          details in the paper. In particular only some general
       paper does not state a proper algorithm to update
                                                                          cases were described, ignoring specific ones. The last
       the reached-ancestors’ distances and activations when
                                                                          class of problems deals with Java. DPBF main data
       adding a new node to new the better path (ATTACH
                                                                          structure is a priority queue, based on trees’ weights.
       and ACTIVATE function). Therefore, we opted for a
                                                                          Unfortunately Java PriorityQueue hasn’t a useful in-
       recursive solution for these procedures. To increase the
                                                                          terface compared to our needs; furthermore it gave us
       performances of the searches, we opted for two heuris-
                                                                          a lot of problems in sorting weights.
       tic solutions: we set a lower maximum distance of the
       answer trees than the one suggested by the paper; we             • STAR In our work we had to deal with some difficul-
       set a minimum threshold value for the activation of a              ties. The lack of implementation details in [11] gave
       node, to avoid wasteful spreading towards the graph.               us a wide range of possibilities especially in the tree
                                                                          building phase. The most interesting and challenging
     • DISCOVER-I The paper provided was heavily focused
                                                                          issue concerned the choice of the keyword-containing
       on a theoretical perspective therefore including a lot
                                                                          tuples: if some of the DB’s columns contain repeated
       of formal definitions of problems and structures, but
                                                                          strings searching a keyword results with high proba-
       lacking of practical directives. Certain parts of the al-
                                                                          bility in many matching tuples, which causes longer
       gorithm, like the Master Index, were almost treated
                                                                          execution times and lower accuracy for some queries.
       like a blackbox. Specifically Oracle8i interMedia Text
                                                                          To avoid these issues we tried to add some heuristics
       were used but we had to implement our own Master
                                                                          sorting the expanding trees by number of query’s key-
       Index using full text search provided by PostgreSQL.
                                                                          words contained, i.e. trees with more keywords are
       We made our own assumptions to implement it, for ex-
                                                                          expanded first. Furthermore, to increase DB through-
       ample by searching and indexing text fields only and
                                                                          put, we used multiple threads and connections to the
       removing English stop words. This probably lead to
                                                                          DB during the queries execution. Java standard li-
       different execution times, making time performances
                                                                          brary does not have a graph data structure we first
       comparison with the ones in [4] less significant. This
                                                                          tried JGraphT library, but, after some benchmarks,
       is reasonable because of the larger impact that the in-
                                                                          we decided to write our own graph class based on Java
       dexes have on the queries computation compared to
                                                                          TreeSet.
       the advantages brought by the introduction of the Plan
       Generator and its execution. Moreover the Plan Gen-
       erator was challenging to implement, because Candi-         5.    CONCLUSIONS AND FUTURE WORK
       date Networks may contain joins of tuples from the             The paper briefly summarized an ongoing project for re-
       same relation through an entry of another one. But          leasing an open source Java implementation of the most
       this operation is refused from the DBMS because the         common algorithms for keyword-based search over relational
       resulting table would have repeated column names,           data. The project has been initiated as a Master students
project where teams of 3-4 students implemented each algo-         [8] V. Hristidis, L. Gravano, and Y. Papakonstantinou.
rithm and noted the difficulties and issues in trying to repro-        Efficient IR-Style Keyword Search over Relational
duce them. All the developed implementations are available             Databases. In J. C. Freytag, P. C. Lockemann,
as open source at: https://bitbucket.org/ks-bd-2015-2016/              S. Abiteboul, M. J. Carey, P. G. Selinger, and
ks-unipd.                                                              A. Heuer, editors, Proc. 29th International Conference
   We are currently working on testing all these different             on Very Large Data Bases (VLDB 2003), pages
implementations on the same hardware in order to compare               850–861. Morgan Kaufmann Publisher, Inc., San
their efficiency. We are also processing the datasets prepared         Francisco, CA, USA, 2003.
by [4] in order to make them closer to the standard practices      [9] V. Hristidis and Y. Papakonstantinou. DISCOVER:
adopted in Information Retrieval (IR) evaluation [12] and              Keyword Search in Relational Databases. In P. A.
then we will systematically investigate the effectiveness of           Bernstein, Y. E. Ioannidis, R. Ramakrishnan, and
the implemented algorithms.                                            D. Papadias, editors, Proc. 28th International
   Finally, we will work on harmonizing the different imple-           Conference on Very Large Data Bases (VLDB 2002),
mentations, factoring out commonly used data structures                pages 670–681. Morgan Kaufmann Publisher, Inc.,
and shared functionalities, in order to go towards a unified           San Francisco, CA, USA, 2002.
open source framework for keyword-based search over rela-         [10] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,
tional data.                                                           R. Desai, and H. Karambelkar. Bidirectional
                                                                       Expansion For Keyword Search on Graph Databases.
6.   REFERENCES                                                        In K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten,
                                                                       P.-A. Larson, and B. Chin Ooi, editors, Proc. 31st
 [1] D. Abadi, R. Agrawal, A. Ailamaki, M. Balazinska,                 International Conference on Very Large Data Bases
     P. A. Bernstein, M. J. Carey, S. Chaudhuri, J. Dean,              (VLDB 2005), pages 505–516. ACM Press, New York,
     A. Doan, M. J. Franklin, J. Gehrke, L. M. Haas, A. Y.             USA, 2004.
     Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. V.            [11] G. Kasneci, M. Ramanath, M. Sozio, F. M. Suchanek,
     Jagadish, D. Kossmann, S. Madden, S. Mehrotra,                    and G. Weikum. STAR: Steiner-Tree Approximation
     T. Milo, J. F. Naughton, R. Ramakrishnan, V. Markl,               in Relationship Graphs. In Z. Li, P. S. Yu, Y. E.
     C. Olston, B. C. Ooi, C. Rè, D. Suciu,                           Ioannidis, D. Lee, and R. Ng, editors, Proc. 25th
     M. Stonebraker, T. Walter, and J. Widom. The                      International Conference on Data Engineering (ICDE
     Beckman Report on Database Research. ACM                          2009), pages 868–879. IEEE Computer Society, Los
     SIGMOD Record, 43(3):61–70, September 2014.                       Alamitos, CA, USA, 2009.
 [2] S. Bergamaschi, N. Ferro, F. Guerra, and G. Silvello.        [12] M. Sanderson. Test Collection Based Evaluation of
     Keyword-based Search over Databases: A Roadmap                    Information Retrieval Systems. Foundations and
     for a Reference Architecture Paired with an                       Trends in Information Retrieval (FnTIR),
     Evaluation Framework. LNCS Transactions on                        4(4):247–375, 2010.
     Computational Collective Intelligence (TCCI),
                                                                  [13] H. Wang and C. C. Aggarwal. A Survey of Algorithms
     9630:1–20, 2016.
                                                                       for Keyword Search on Graph Data. In C. C.
 [3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti,                Aggarwal and H. Wang, editors, Managing and
     and S. Sudarshan. Keyword Searching and Browsing                  Mining Graph Data, pages 249–273. Springer-Verlag,
     in Databases using BANKS. In R. Agrawal,                          New York, USA, 2010.
     K. Dittrich, and A. H. H. Ngu, editors, Proc. 18th
                                                                  [14] J. X. Yu, L. Qin, and L. Chang. Keyword Search in
     International Conference on Data Engineering (ICDE
                                                                       Databases. Morgan & Claypool Publishers, USA, 2010.
     2002), pages 431–440. IEEE Computer Society, Los
     Alamitos, CA, USA, 2002.
 [4] J. Coffman and A. C. Weaver. An Empirical
     Performance Evaluation of Relational Keyword Search
     Techniques. IEEE Transactions on Knowledge and
     Data Engineering (TKDE), 1(26):30–42, 2014.
 [5] B. Ding, J. Xu Yu, S. Wang, L. Qin, X. Zhang, and
     X. Lin. Finding Top-k Min-Cost Connected Trees in
     Databases. In R. Chirkova, A. Dogac, M. T. Özsu, and
     T. Sellis, editors, Proc. 23rd International Conference
     on Data Engineering (ICDE 2007), pages 836–845.
     IEEE Computer Society, Los Alamitos, CA, USA,
     2007.
 [6] N. Ferro. Reproducibility Challenges in Information
     Retrieval Evaluation. ACM Journal of Data and
     Information Quality (JDIQ), 8(2):8:1–8:4, January
     2017.
 [7] N. Ferro, N. Fuhr, K. Järvelin, N. Kando, M. Lippold,
     and J. Zobel. Increasing Reproducibility in IR:
     Findings from the Dagstuhl Seminar on
     “Reproducibility of Data-Oriented Experiments in
     e-Science”. SIGIR Forum, 50(1):68–82, June 2016.