Towards open-source shared implementations of keyword-based access systems to relational data Alex Badan, Luca Benvegnù, Matteo Biasetton, Giovanni Bonato, Alessandro Brighente, Alberto Cenzato, Piergiorgio Ceron, Giovanni Cogato, Stefano Marchesin, Alberto Minetto, Leonardo Pellegrina, Alberto Purpura, Riccardo Simionato, Nicolò Soleti, Matteo Tessarotto, Andrea Tonon, Federico Vendramin, Nicola Ferro University of Padua, Italy {alex.badan, luca.benvegnu.2, matteo.biasetton, giovanni.bonato, alessandro.brighente, alberto.cenzato, piergiorgio.ceron, giovanni.cogato.1, stefano.marchesin, alberto.minetto, leonardo.pellegrina, alberto.purpura, riccardo.simionato.1, nicolo.soleti, matteo.tessarotto.1, andrea.tonon.3, federico.vendramin}@studenti.unipd.it ferro@dei.unipd.it ABSTRACT access systems. Reproducibility is becoming a primary con- Keyword-based access systems to relational data address a cern in many areas of science and it is a key both for fully challenging and important issue, i.e. letting users to exploit understanding state-of-the-art solutions and for being able natural language to access databases whose schema and in- to compare against and improve over them [6, 7]. However, stance are possibly unknown. Unfortunately, there are al- there is a lack of commonly shared open source platforms most no shared implementations of such systems and this implementing state-of-the-art algorithms for keyword-based hampers the reproducibility of experimental results. We ex- access to relational data as, for example, Terrier1 is in the plore the difficulties in reproducing such systems and share information retrieval field; on the contrary, you mostly need implementations of several state-of-the-art algorithms. to re-implement each system from scratch [4]. Therefore, as part of a student project during the course on databases of the master degree in computer science at the 1. INTRODUCTION University of Padua, we considered several state-of-the-art Keyword-based access to relational data is a key technol- algorithms and we implemented them from scratch, in order ogy for lowering the barriers of access to the huge amounts to understand the difficulties and pitfalls in implementing of data managed by databases. It is an extremely difficult them and to go towards a shared implementation of them. and open challenge [1] since it has to face the “conflict of The paper is organized as follows: Section 2 presents the impedance” between vague and imprecise user information related work; Section 3 introduces the implementation of the needs and rigorously structured data, allowing users to ex- different algorithm; Section 4 reports the lessons learned; press their queries in natural language against a potentially finally Section 5 wraps up the discussion and outlooks some unknown database. future work. Improving a technology is greatly eased by the existence of a reference architecture paired with a proper evaluation framework in order to have a coherent vision of the compo- 2. RELATED WORKS nents we have to leverage on and to precisely measure and In the following, we briefly summarize the algorithms im- track performances over time. Unfortunately, the situation plemented in this paper. For a broader perspective on key- is very fragmented in the case of keyword-based access to word search over relational database, please refer to [13, 14]. relational data, since both a fully-fledged reference architec- ture and an evaluation methodology are still missing [2]. 2.1 Graph-based Approaches In this paper, we start to investigate the problem of the This type of algorithms starts with the construction of reproducibility of the experimental results for keyword-based a graph where nodes are tuples and edges are the forward (foreign) key constraints and their goal is to find a set of trees that connects a group of nodes containing all the key- 2017, Copyright is with the authors. Published in the Workshop words in an input query. Weights between nodes and graph Proceedings of the EDBT/ICDT 2017 Joint Conference (March type (direct or indirect) are dependent on each specific ap- 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). proach. Moreover, this type of algorithms differ for the use Distribution of this paper is permitted under the terms of the of Dijkstra or Steiner methods to visit the graph: the former Creative Commons license CC-by-nc-nd 4.0 allows you to find the shortest spanning tree using only the selected nodes, i.e. those that match with the query key- 1 http://www.terrier.org/ words; the latter allows you to use other extra nodes which are not included in the previous selection but exist in the DISCOVER-I [9] Given a relational database and a set of original graph. keyword K, Discover exploits the database schema to gen- k Browsing ANd Keyword Searching I (BANKS-I) [3] The erate a graph GT S : from Basic Tuple Sets Ri j , set of tuples algorithm adds a backward weighted edge for each directed of the relation Ri containing the keyword kj ∈ K, ∀i, j, it edge in the graph and assigns a relevance score to each node creates a node for each Tuple Set in it. Then, it proceeds by finding the shortest path from 0 \ k [ k each keyword node to all the other nodes in the graph. Each RiK = Ri − Ri , ∀K 0 ⊆ K (2.1) path is computed by running Dijkstra’s single source short- k∈K 0 / 0 k∈K est path algorithm starting from each one of the keyword sets of tuples of the relation Ri containing all and only nodes and traversing all the edges in reverse direction. This the keywords of K 0 , and one for each Free Tuple Sets, the process is called backward expanding search. When multiple database relations. Edges are representative of each PK to paths intersect at a common node r in the graph, the result- FK relationship. The algorithm then computes a complete ing tree with root r and the nodes containing the keywords non-redundant set of Candidate Networks up to a maximum as leaves is examined. If the tree contains all the keywords in number of joints T, pruning all the joining networks of tu- the input query, then its weight is computed and the tree is ples that are not minimal (i.e no tuples without keywords added to a heap data structure of fixed size, which contains as leaves and do not contain the same tuple twice) and total all the result trees in decreasing order of relevance. When (i.e. contain every keyword ∈ K, AND semantics).To evalu- the heap is full or all the trees have been generated, the most ate the Candidate Networks efficiently, Discover eventually relevant of them are returned until a predefined number of computes and stores a list of intermediate results of com- results has been returned. mon join subexpressions. Finding these intermediate results Browsing ANd Keyword Searching II (BANKS-II) [10] is an NP-complete problem, so it uses an iterative greedy Bidirectional Search improves on BANKS-I by allowing for- algorithm based on frequencies: the most frequent joint ex- ward search from potential roots towards other keyword pression is stored in a view at every iteration and then reused nodes. The algorithm uses two concurrent iterators, called when possible. The results are finally displayed to the user outgoing and ingoing to explore the graph in both direc- ordered by size. tions. Both the iterators use the Dijkstra’s single source shortest path approach. BANKS-II also allows for pref- DISCOVER-II [8] The algorithm uses an IR engine mod- erential expansion of paths that have less branching us- ule when a query Q is issued, in this way it extracts from ing a spreading activation mechanism to prioritize the most each relation R the tuple set RQ with Score > 0, where promising paths and to avoid wasteful exploration of the Score is a value for each tuple using a IR-style relevance graph. scoring function. Candidate Networks are hence generated Dynamic Programming Best-First (DPBF) [5]: it tries to based on the tuple sets returned by the IR engine and the build a minimum cost Steiner tree: for each neighbour v of database schema, which are join expressions used to create the root in the graph, it creates a new tree with v as the new joining trees of tuples which are potantial answer to the is- root; the algorithm tries to merge trees with the same root sued query. The final step is to identify the top-k results, and a different set of keywords. If a tree contains all the task which is performed by the execution engine module searched keywords, it is a solution for the algorithm. Given where it uses: Naive Algorithm which takes the top-k re- a root and a specific set of keywords, DPBF keeps only the sults from each CN and it combine in a sort-merge manner to tree with lower weight, i.e. the best one for the specific root identify the final top-k results of the given query; Sparse Al- node and set of keywords, and it guarantees to be a total, gorithm computes a bound on the maximum possible score minimal and controlled in size solution. of the tuple tree derived from a CN. Steiner-Tree Approximation in Relationship graphs (STAR) [11]: it initially looks for tuples containing one or more of the query’s keywords (let’s call this set V 0 ⊂ V ) and then tries 3. IMPLEMENTATION to connect these tuples into a least-weight tree, i.e. com- Students split up in groups of 3-4 people and each group puting the bestPSteiner tree between the given nodes which was responsible for the implementation of one algorithm. All total weight is e∈E w(e). In order to build a first intercon- the algorithms have been implemented in Java while Post- necting tree, STAR relies on a strategy similar to BANKS-I greSQL has been used as relational engine. but, instead of running single source shortest path iterators The implementation of the different algorithms is available from each node of V 0 , STAR runs simple breadth-first-search as open source under BSD license2 . from each terminal, called in a round-robin manner. As soon We briefly summarize the main features of each imple- as the iterators meet, a result is constructed. In the second mentation: phase, STAR aims at improving the current tree iteratively • BANKS-I The required graph is created prior to the by replacing certain paths in the tree by new paths with a execution of the algorithm executing multiple queries lower weight from the underlying graph. on the selected database without any prior knowledge of its schema. In our implementation, for each node in 2.2 Schema-based Approaches the graph, only the ID of the tuple is saved and an ex- Both DISCOVER-I and II use generalized inverted-index, ternal file is used to store the information. The graph is called Master Index and IR Engine respectively, to retrieve represented using adjacency lists. In order to speed up tuples containing the input keywords, given by the user, the creation of the graph, we used the fastutil library3 , 2 from the database. Both output a list of Candidate Net- https://bitbucket.org/ks-bd-2015-2016/ks-unipd 3 works, albeit defined slightly differently. http://fastutil.di.unimi.it/ in addition to the Java Standard Library. Finally, to • DISCOVER-II The graph was built using the infor- optimize the performance while executing multiple in- mation schema and the algorithm was implemented in stances of Dijkstra’s algorithm (one for each keyword java using native collections. The IR Engine module node), we chose to compute the next node to be re- identifies all database tuples that have a non-zero score turned by the algorithm only when it was required. In for a given query, then each occurrence of a keyword fact, most of the times, only a handful of nodes were is recorded as a tuple-attribute pair. Moreover, the required to build a valid result tree. With this ap- IR Engine relies an inverted index that associates each proach we avoided executing Dijkstra’s algorithm on keyword that appears in the database with a list of oc- the whole graph, optimizing the execution time and currences of the keyword, which we implemented using the memory usage. For the execution of each query the GIN index of PostreSQL. To obtain better perfor- the memory limit was set to 5 Gb while the maximum mances, we store as a view the tuple sets identified for running time was set to 1 hour. a query Q. The Candidate Network Generator involves a combination of free and not-free tuple set, the former • BANKS-II We divided our implementation in two phases: represent the tuple set with no occurrences of the query pre-processing and the BANKS-II execution. We im- keyboards, that help via foreign-key join the connec- plemented the graph in a OO manner. The vertexes tion of non-free tuple set, the latter contains at least are stored using a two-layered HashMap, to improve one keyword. The Execution Engine module receives performances and to reduce as much as possible the as input the set of CNs along with the non-free tuple probability of collisions. The first layer refers to the sets, it contacts the RDBMS’s query execution engine relation name, while the second layer uses as index the repeatedly to identify the top-k query results. We im- primary key of the tuple. The paper suggested a set plemented two algorithm: the Naive algorithm, con- of different weight assignment strategies (i.e. Pager- sidering a Boolean-OR semantics, issues a SQL query ank); we decided to use equally weighted edges for for each CN to answer a top-k query, then combines simplicity. Once the user provides the keywords, the the results in a sort-merge manner to identify the final pre-processing ends with the creation of the matching top-k results of the query; the Sparse algorithm dis- sets. These sets are also organized in a two-layered cards at any point in time any (unprocessed) CN that HashMap data structure: vertexes are inserted in the is guaranteed not to produce a top-k match for the respective matching sets using SQL queries to find query, it computes a bound M P Si on the maximum tuples with attributes having matching values. Dur- possible score of a tuple tree derived from a CN Ci . If ing the BANKS-II execution, outgoing and ingoing M P Si does not exceed the actual score of k already iterators start to explore the graph, using two prior- produced tuple trees, then CN Ci can be safely re- ity queues to extract the highest activated node. We moved from further consideration. M P Si is the score created these queues using the standard Java Prior- of a hypothetical joining tree of tuples T that contains ityQueue class, defining a custom compare method. the top tuples from every non-free tuple set in Ci . The When a new solution (a rooted tree connecting all the CNs for a query are evaluated in ascending size order, matching sets) is found, we store it in a PriorityQueue; so that smallest CNs, i.e. the least expensive to pro- these results are ranked with a score given by the over- cess and the most likely to produce high-score tuple all sum of the edge weights. Our implementation cre- trees, are evaluated first. ates the graph from the DB once, and then accepts keywords from the user to perform multiple searches. • DPBF The project has been divided into two phases: creation and serialization of the database-graph and • DISCOVER-I We first implemented the Master Index dpbf implementation. To get all vertexes containing to perform the full text search through the function query’s keywords quickly, we used hashmap data struc- to tsvector provided by PostgreSQL and then we cre- ture to store the graph and sets of vertexes associated ated the initial graph GT S using an adjacency matrix. to their words. Completed the graph, our algorithm We observed that a lot of queries were repeated while gets the vertexes whence start to build the resulting computing Tuple Sets because the same Basic Tuple trees. (i) For each of these we move up one level, re- Sets were created many times by the DBMS ”on the moving similar trees with higher costs and then (ii) fly”. Thus we decided to store a new table for each all trees with same root node and different sets of key- Basic Tuple Set to be reused while performing unions words are merged. Points (i) and (ii) are repeated until and interceptions as in (2.1).To further improve per- at least one result is found. We had to implement some formances of full text search we preprocessed the main Java data structure to do that correctly and efficiently. search fields to remove english stopwords and to con- vert all tokens to lowercase characters. We then stored • STAR has been implemented in a schema-agnostic way the lexemes in a dedicated Generalized Inverted Index. so the first step of the algorithm loads all DB metadata The Candidate Network algorithm then prune, expand needed to build the tree. Given an i-keywords query we and accept the Candidate Networks using static meth- look for the starting tuples in a multithreaded manner. ods. The accepted ones (treated as graphs) were rewrit- All those tuples are stored in an ad-hoc graph data- ten, by the Plan Generator, replacing repeated joins structure based on Java TreeSet class. Considering inside them (i.e. the same couples of nodes and the these tuples as expanding trees we try to connect them connecting edge) with a reference (i.e. a new node) following both ingoing and outgoing foreign keys, dy- to a new temporary table. Finally queries were eval- namically loading referenced tuples from DB, until we uated from the resulted graphs with a BFS algorithm get a tree containing all the query’s keywords. Given in order to guarantee the correct order of joins. that our test DBs were not weighted we decided to set w(e) = 1 ∀e ∈ E. At this point we worked on this forcing us to change them to be unique. Eventually tree to optimize its weight reshaping its structure: we there were no suggestion about how to evaluate the iteratively remove the highest weight path replacing it final queries: there is the need of an order in consid- with a lower weight one. ering the tables, with join operations on the correct columns. For this reason we decided to evaluate ev- 4. LESSONS LEARNED ery Candidate Network with a BFS algorithm, that guaranties correctness. For each examined system, we briefly summarize the dif- ficulties and lessons learned in reproducing it. • DISCOVER-II We observed that in a dense dataset as IMDB the execution time using OR semantic grows • BANKS-I During the implementation of the algorithm, exponentially then queries don’t give results in a use- we encountered some difficulties due to missing infor- ful time with a number of keywords greater than 5. If mation in the descriptions of crucial parts of the algo- we used an AND semantic, since it keeps more CNs, rithm. The chosen technique to run efficiently multiple a post-processing on the stream of results needed and instances of Dijkstra’s single source shortest path al- so worsens the performance. Furthermore, not clear if gorithm was not described in the original paper, even the authors keep the duplicate CNs queries or if they though the choice of one implementation over another remove before (not specified if Set or Bag semantic) influenced considerably the memory usage and the ex- and we found biased queries with the inclusion of re- ecution time. We also encountered some difficulties lation names in them, e.g. char name. We chose to due to the lack of precision in the explanation of how not implement the AND semantic because we consider to evaluate the result trees. In fact, the way to han- such method highly inefficient. For better performance dle some particular cases such as nodes containing all we decided to use hashMap in the score function that the keywords in the input query or trees composed by was filled when the lists were created. only two nodes, were not described. It became notice- able that the major weakness of the algorithm was the • DPBF During the algorithm implementation we have memory occupation that became more evident when faced three different problems: making the graph cre- testing it on greater datasets. ation and the algorithm implementation efficient and using Java and its data-structures. We tried to make • BANKS-II We found the paper sufficiently clear in the graph creation as fast as possible, assuming that defining the Bidirectional Search algorithm and the it is created only once. So we serialized the graph into search approach; however, the implementation of some a file. The main cause of problems was the lack of procedures has not been trivial. As an example, the details in the paper. In particular only some general paper does not state a proper algorithm to update cases were described, ignoring specific ones. The last the reached-ancestors’ distances and activations when class of problems deals with Java. DPBF main data adding a new node to new the better path (ATTACH structure is a priority queue, based on trees’ weights. and ACTIVATE function). Therefore, we opted for a Unfortunately Java PriorityQueue hasn’t a useful in- recursive solution for these procedures. To increase the terface compared to our needs; furthermore it gave us performances of the searches, we opted for two heuris- a lot of problems in sorting weights. tic solutions: we set a lower maximum distance of the answer trees than the one suggested by the paper; we • STAR In our work we had to deal with some difficul- set a minimum threshold value for the activation of a ties. The lack of implementation details in [11] gave node, to avoid wasteful spreading towards the graph. us a wide range of possibilities especially in the tree building phase. The most interesting and challenging • DISCOVER-I The paper provided was heavily focused issue concerned the choice of the keyword-containing on a theoretical perspective therefore including a lot tuples: if some of the DB’s columns contain repeated of formal definitions of problems and structures, but strings searching a keyword results with high proba- lacking of practical directives. Certain parts of the al- bility in many matching tuples, which causes longer gorithm, like the Master Index, were almost treated execution times and lower accuracy for some queries. like a blackbox. Specifically Oracle8i interMedia Text To avoid these issues we tried to add some heuristics were used but we had to implement our own Master sorting the expanding trees by number of query’s key- Index using full text search provided by PostgreSQL. words contained, i.e. trees with more keywords are We made our own assumptions to implement it, for ex- expanded first. Furthermore, to increase DB through- ample by searching and indexing text fields only and put, we used multiple threads and connections to the removing English stop words. This probably lead to DB during the queries execution. Java standard li- different execution times, making time performances brary does not have a graph data structure we first comparison with the ones in [4] less significant. This tried JGraphT library, but, after some benchmarks, is reasonable because of the larger impact that the in- we decided to write our own graph class based on Java dexes have on the queries computation compared to TreeSet. the advantages brought by the introduction of the Plan Generator and its execution. Moreover the Plan Gen- erator was challenging to implement, because Candi- 5. CONCLUSIONS AND FUTURE WORK date Networks may contain joins of tuples from the The paper briefly summarized an ongoing project for re- same relation through an entry of another one. But leasing an open source Java implementation of the most this operation is refused from the DBMS because the common algorithms for keyword-based search over relational resulting table would have repeated column names, data. The project has been initiated as a Master students project where teams of 3-4 students implemented each algo- [8] V. Hristidis, L. Gravano, and Y. Papakonstantinou. rithm and noted the difficulties and issues in trying to repro- Efficient IR-Style Keyword Search over Relational duce them. All the developed implementations are available Databases. In J. C. Freytag, P. C. Lockemann, as open source at: https://bitbucket.org/ks-bd-2015-2016/ S. Abiteboul, M. J. Carey, P. G. Selinger, and ks-unipd. A. Heuer, editors, Proc. 29th International Conference We are currently working on testing all these different on Very Large Data Bases (VLDB 2003), pages implementations on the same hardware in order to compare 850–861. Morgan Kaufmann Publisher, Inc., San their efficiency. We are also processing the datasets prepared Francisco, CA, USA, 2003. by [4] in order to make them closer to the standard practices [9] V. Hristidis and Y. Papakonstantinou. DISCOVER: adopted in Information Retrieval (IR) evaluation [12] and Keyword Search in Relational Databases. In P. A. then we will systematically investigate the effectiveness of Bernstein, Y. E. Ioannidis, R. Ramakrishnan, and the implemented algorithms. D. Papadias, editors, Proc. 28th International Finally, we will work on harmonizing the different imple- Conference on Very Large Data Bases (VLDB 2002), mentations, factoring out commonly used data structures pages 670–681. Morgan Kaufmann Publisher, Inc., and shared functionalities, in order to go towards a unified San Francisco, CA, USA, 2002. open source framework for keyword-based search over rela- [10] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, tional data. R. Desai, and H. Karambelkar. Bidirectional Expansion For Keyword Search on Graph Databases. 6. REFERENCES In K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-A. Larson, and B. Chin Ooi, editors, Proc. 31st [1] D. Abadi, R. Agrawal, A. Ailamaki, M. Balazinska, International Conference on Very Large Data Bases P. A. Bernstein, M. J. Carey, S. Chaudhuri, J. Dean, (VLDB 2005), pages 505–516. ACM Press, New York, A. Doan, M. J. Franklin, J. Gehrke, L. M. Haas, A. Y. USA, 2004. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. V. [11] G. Kasneci, M. Ramanath, M. Sozio, F. M. Suchanek, Jagadish, D. Kossmann, S. Madden, S. Mehrotra, and G. Weikum. STAR: Steiner-Tree Approximation T. Milo, J. F. Naughton, R. Ramakrishnan, V. Markl, in Relationship Graphs. In Z. Li, P. S. Yu, Y. E. C. Olston, B. C. Ooi, C. Rè, D. Suciu, Ioannidis, D. Lee, and R. Ng, editors, Proc. 25th M. Stonebraker, T. Walter, and J. Widom. The International Conference on Data Engineering (ICDE Beckman Report on Database Research. ACM 2009), pages 868–879. IEEE Computer Society, Los SIGMOD Record, 43(3):61–70, September 2014. Alamitos, CA, USA, 2009. [2] S. Bergamaschi, N. Ferro, F. Guerra, and G. Silvello. [12] M. Sanderson. Test Collection Based Evaluation of Keyword-based Search over Databases: A Roadmap Information Retrieval Systems. Foundations and for a Reference Architecture Paired with an Trends in Information Retrieval (FnTIR), Evaluation Framework. LNCS Transactions on 4(4):247–375, 2010. Computational Collective Intelligence (TCCI), [13] H. Wang and C. C. Aggarwal. A Survey of Algorithms 9630:1–20, 2016. for Keyword Search on Graph Data. In C. C. [3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, Aggarwal and H. Wang, editors, Managing and and S. Sudarshan. Keyword Searching and Browsing Mining Graph Data, pages 249–273. Springer-Verlag, in Databases using BANKS. In R. Agrawal, New York, USA, 2010. K. Dittrich, and A. H. H. Ngu, editors, Proc. 18th [14] J. X. Yu, L. Qin, and L. Chang. Keyword Search in International Conference on Data Engineering (ICDE Databases. Morgan & Claypool Publishers, USA, 2010. 2002), pages 431–440. IEEE Computer Society, Los Alamitos, CA, USA, 2002. [4] J. Coffman and A. C. Weaver. An Empirical Performance Evaluation of Relational Keyword Search Techniques. IEEE Transactions on Knowledge and Data Engineering (TKDE), 1(26):30–42, 2014. [5] B. Ding, J. Xu Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding Top-k Min-Cost Connected Trees in Databases. In R. Chirkova, A. Dogac, M. T. Özsu, and T. Sellis, editors, Proc. 23rd International Conference on Data Engineering (ICDE 2007), pages 836–845. IEEE Computer Society, Los Alamitos, CA, USA, 2007. [6] N. Ferro. Reproducibility Challenges in Information Retrieval Evaluation. ACM Journal of Data and Information Quality (JDIQ), 8(2):8:1–8:4, January 2017. [7] N. Ferro, N. Fuhr, K. Järvelin, N. Kando, M. Lippold, and J. Zobel. Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on “Reproducibility of Data-Oriented Experiments in e-Science”. SIGIR Forum, 50(1):68–82, June 2016.