=Paper=
{{Paper
|id=Vol-2175/paper11
|storemode=property
|title=Progressive Indices: Indexing Without Prejudice
|pdfUrl=https://ceur-ws.org/Vol-2175/paper11.pdf
|volume=Vol-2175
|authors=Pedro Holanda
|dblpUrl=https://dblp.org/rec/conf/vldb/Holanda18
}}
==Progressive Indices: Indexing Without Prejudice==
Progressive Indices: Indexing without prejudice Pedro Holanda supervised by Stefan Manegold Centrum Wiskunde & Informatica Amsterdam, Netherlands holanda@cwi.nl After 4 Queries After 10 Queries ABSTRACT After 3 Queries Database cracking is a method to create partial indices as a side-effect of processing queries. Cracking efficiently smears ⇢ ⇢ out the cost of creating a full index over a stream of queries, creating an index that is overfitted to queried parts of the data. This core characteristic of cracking leads to unpre- ⇢ dictable performance and unreliable convergence towards a full index. These problems are aggravated when considering updates and multidimensional queries. We envision a new indexing technique, Progressive Indexing that improves database cracking by strictly limiting per-query indexing cost to a budget (e.g., a user-defined fraction of scan costs), allowing the first and subsequent queries to complete Figure 1: Progressive indexing. without heavy penalties. At the same time, all indexing effort is spent towards predictable convergence towards a full index. We discuss different algorithms to deal with adaptive partial indexing approach for relational databases. multidimensional queries and updates while maintaining a It works by building a partial index as a co-product of query robust and convergent index, we then explore the research processing. The index is built the first time a column is space and new challenges that arise from this new technique. queried and is continuously refined as subsequent queries are executed. This way the cost of creating an index is 1. INTRODUCTION distributed over a stream of queries. Index creation is one of the major difficult decisions in However, database cracking and its variations are not ro- database schema design [4]. Based on the workload, the bust against varying workload patterns. Since the index is database administrator needs to decide whether creating a only refined in the areas targeted by the workload. Queries specific index is worth the overhead of creating and main- that deviate from them will target unrefined sections of the in- taining it. This considerable up-front cost creates a trade-off dex, leading to large and unpredictable spikes in performance. that requires careful consideration and experimentation. This problem is exacerbated when dealing with updates. In Automatic physical design tuning [2] aims to release the case the most refined area is also the most updated, and mul- user of having to manually choose which indexes to cre- tidimensional queries (i.e., queries with selections in multiple ate. They attempt to find the optimal set of indices given columns) since one must maintain multiple indices, one for a query workload, by balancing the benefits of having an each column. index versus the added costs of creating the index and main- To address these needs we propose a novel approach for in- taining it during modifications to the database. However, cremental indexes, Progressive Indexing. Where every query these tools are not able to work on dynamic systems due to that is issued to the database results in a fixed refinement the unpredictability and lack of idle time for a priori index of the index, leading to a robust query execution and full creation. convergence towards a full index. While maintaining a low Adaptive indexing techniques, such as database crack- extra cost per query. ing [11], attempt to solve this problem by presenting an Figure 1 depicts how progressive indexing works. Consid- ering an unindexed column, a pivot and a budget δ = 0.1, we start by indexing a δ fraction of the data while scanning the remaining 1 − δ piece of data. After 3 queries, 30% of our column is already indexed around . The fourth query starts by performing an index lookup on the ρ fraction of the data that has already been indexed while scanning the unindexed 1 − ρ piece and expanding the index by another δ Proceedings of the VLDB 2018 PhD Workshop, August 27, 2018. Rio de fraction of the total column. Finally, after 10 queries, we fully Janeiro, Brazil. Copyright (C) 2018 for this paper by its authors. Copying permitted for indexed and we continue this process by selecting another . private and academic purposes. 1 Paper Structure. The rest of this paper is structured applied to tuple reconstruction and not to multidimensional as follows. Section 2 provides an overview of related work. queries. Then, section 3, describes progressive indexing. Section 4 Robustness. Stochastic cracking [7, 15] address the un- presents a brief proof of concept and experimental analysis. predictable performance problem by creating partitions using Finally, in section 5, we discuss our research plan. a random pivot element instead of pivoting around the query predicates. However, the actual cracking still occurs in the pieces where the query predicates fall into. 2. RELATED WORK Generalization. Adaptive adaptive indexing [14] at- Automatic physical database design has been an active tempts to be a general-purpose algorithm for adaptive index- research field for the past twenty years. These work resulted ing. It has multiple parameters that can be tuned to mimic in two different areas; self-tuning tools and adaptive indexing. the data access of multiple adaptive indexing techniques (e.g., Self-Tuning Tools [1, 3, 16, 6] attempt to solve this database cracking, sideways cracking, hybrid cracking). problem by automatically recommending a set of indexes to Updates. SPST-Index [8] extends the original cracking optimize a known workload of the system. However, these work on updates [9] by rotating nodes when they are accessed systems depend on previous workload knowledge and are in order to cluster cold-data on the leaves. When updates only able to create full indexes. Unsuitable for unpredictable are executed the leaves are pruned, consequently the index workloads or when there is no idle time to be invested in a constraints are relaxed resulting in faster updates. However priori index creation. this work still alleviates the cost of updates by increasing the cost of cracking for the subsequent queries, bringing more unpredictability to the query costs. 3. PROGRESSIVE INDICES The previous section gave us the necessary motivation for progressive indexing. Motivated by: (1) The workload dependent pivot selection causes performance spikes and does not guarantee convergence towards a full index. (2) Cracking is not able to efficiently handle multidimensional queries since it must maintain multiple indices and intersect their points, and (3) the robustness is penalized when dealing with updates due to partial index removal and the possibility of updates being focused in refined cracked pieces. We propose progressive indexing. Progressive indexing is designed to be efficient when dealing with multiple types Figure 2: Database cracking. of queries and updates without sacrificing robustness or convergence. We believe that the following modifications Adaptive Indexing [15] is an alternative to the self- from adaptive indexing must be taken: (1) the pivot-selection tuning tools. It is especially useful in scenarios where the does not need to be workload dependent, every query must workload is unpredictable and there is no idle time to in- have a budget to spend on indexing creation or maintenance vest in index creation. It tackles these problems by creating and when an indexed piece is smaller than L1 cache it is indexes that are workload dependent in an incremental fash- fully ordered. (2) Progressive indexing generates a unique ion. Figure 2 depicts an example of database cracking [11]. index for multiple columns, and (3) other sorting algorithms, Query Q1 starts by triggering the creation of the cracker besides quick-sort adaptations, are exploited in order to column(i.e., initially a copy of column A) where the tuples efficiently merge updates while maintaining robustness and are clustered in three pieces reflecting the range predicate of convergence. Q1 . The result of Q1 is then retrieved as a view on the Piece As a result of the small initial cost, progressive indexing colored in red (i.e., 10 < A < 14). Later, query Q2 requires occurs without significant impact on query performance and a refinement of Pieces 1 and 3 (i.e., respectively indexing has a near-immediate return on investment over performing A > 7 and A ≤ 16), splitting each in two new pieces. naive scans. Even if the column is only queried a few times, Database cracking has multiple issues: (1) poor conver- progressive indexing will still provide a performance benefit. gence towards a full index, (2) inefficient tuple reconstruction, On the other hand, if the column is queried thousands of (3) unpredictable performance and (4) inefficient updates. times, the index will reliably converge towards a full index Below we briefly discuss the research that addresses these and queries will be answered at the same performance as if issues. a full index had been built. Convergence. Hybrid cracking [5, 12, 15] mitigate the Single Column. Figure 3 depicts an example of progres- issue of poor convergence towards a full index by executing sive quick-sort. In this example, we define a budget δ = 0.5. many initial cracking runs with random pivots. Although Query Q1 starts by triggering the initialize phase from pro- this provides better convergence and higher robustness it gressive quick-sort. First, it allocates an uninitialized column greatly impacts the cost of the first query. of the same size of the original column and then selects 9 as Tuple Reconstruction. Sideways cracking [10, 15] ad- a pivot . The original column is scanned and n ∗ δ, where n dress the inefficient tuple reconstruction problem. It mini- is the column size, elements are copied either to the top or mizes the tuple reconstruction cost by using a “cracker maps” the bottom of the copied column, depending on the pivot, data structure. They provide a mapping between attributes while doing so we also select the elements that fulfill Q1 that are combined in queries. However, the strategy is only predicates. A binary search tree (BST) is also formed to 2 13 Initialize 4 4 Refine 4 X,3 X,3 >= < >= 16 9 ≤9 9 3 < 4 2 2 2 ≤4 Y,5 Y,5 ≤9 ≤9 < < >= 9 ? 7 1 Pos:2 Pos:2 >= Uninitialized 2 ? 1 7 >4 Pos:1 Pos:2 Pos:5 Pos:6 12 ? 3 9 7 ? 11 11 1 ? 14 14 ID X Y Full Iteration ID X Y Full Iteration ID X Y 19 ? 19 >9 19 >9 1 7 3 2 1 5 9 2 4 3 12 12 12 14 16 > 9 16 16 2 1 5 9 2 4 2 1 5 11 13 13 13 3 6 2 7 3 9 3 6 2 4 8 7 8 5 6 1 7 3 5 9 1 6 4 8 5 9 1 Figure 3: Progressive quick-sort. 6 4 8 3 6 2 4 8 7 7 3 9 1 7 3 6 4 8 8 5 6 4 8 7 8 5 6 keep track of the pivot points. The subsequent queries can 9 2 4 5 9 1 7 3 9 already leverage from the sorted data by performing lookups in the BST. Later, query Q2 triggers the refine phase fully indexing the column around . Q3 then select another , and Figure 5: Multidimensional progressive quick-sort. the refinement process continues until we reach a full index. The main disadvantage of progressive quick-sort is that fast convergence relies entirely on good pivot point selection. Multi-column. We adapt progressive quick-sort to work with multidimensional queries by generating a KD-Tree on 13 Initialize 2 2 top of the columns. In a KD-Tree every level of the tree Refine 1 16 4 4 2 consists of only one column. To maintain this property we 4 9 9 3 interleave our pivots through the columns. Multidimensional 9 12 12 4 2 13 13 7 progressive quick-sort can be visualized in figure 5. For 12 16 16 9 simplicity in our example we set δ = 1 (i.e., every iteration 7 1 11 fully indexes one pivot). In the first iteration we select a 1 3 12 = 3 to use as a pivot for column X. Since our δ = 1 at the 19 7 13 3 11 14 end of the first query, we will have fully copied the columns 14 14 16 and fully indexed the column X around 3, while keeping the 11 19 19 alignment with column Y . A KD-Tree is then created to keep track of the indexed pivot. Later, when Q2 is executed it triggers another iteration of progressive indexing. However, Figure 4: Progressive merge-sort. this time we are going to reorder the column Y over a new = 5. Since we have already indexed X on 3, we need to reorder Y in the two pieces of X to maintain the alignment.T Updates. Updates are stored in an extra vector. When he main disadvantage of multidimensional progressive quick- a query is executed this extra column is also fully scanned. sort is that it does not provide progressive usage of space, When our original column is fully sorted, we drop the BST, after fully indexing the first pivot, we have already made full since we can now perform a binary search on the ordered copies of all the columns. column, and start the merging process with the column that holds the updates. To efficiently merge the updates into our original column we use progressive merge-sort. Progressive 4. PRELIMINARY RESULTS merge-sort consists of two build phases. In the first build In this section, we present a brief experimental analysis phase, one unsorted chunk of size n ∗ δ is sorted, where n to demonstrate the strong potential benefits of progressive is the column size and δ is our budget. In order to answer indexing. queries, we perform a binary search in the sorted chunks Setup. We implemented progressive quick-sort in a stand- while scanning the unsorted data. In the second build phase, alone program written in C++ and optimized using level we merge the sorted chunks together using a cascading two- 3. All experiments were conducted on a machine running way merge. Note that we do not necessarily complete a Fedora 26, with an Intel Core i7-2600K CPU @ 3.40 GHz full merge in one query, as this would result in large drops with 8 cores, 16 GB of main memory and 8192 KB L3 cache in performance when we merge two large chunks together. size. Instead, we merge at most n ∗ 2δ elements and keep track of We use an 8-byte integer array with 108 uniformly dis- how far along the merge we are. After a merge is completed, tributed values as our dataset. All queries are of the form: we replace the two original chunks with the merged chunk. SELECT SUM(R.A) FROM R WHERE R.A BETWEEN V1 AND V2 . Figure 4 depicts the progressive merge-sort algorithm with All the queries have selectivity equal to 0.1 and we define δ = 0.5, in the first phase half the column is ordered in our budget δ = 0.1. All the progressive quick-sort pivots are a chunk. The second phase orders the remaining of the randomly selected. column in another chunk and finally both chunks are merged In Figure 6 we depict the per-query performance evalua- resulting in a fully ordered column. The main disadvantage tion of progressive quick-sort, database cracking, stochastic of merge-sort is that while we have many sorted chunks, we cracking, coarse-granular index (i.e., a stochastic cracking are performing many random accesses, as we are doing a full variant), a B+tree as a full index and column scans. We binary search in each of the 1/δ chunks. can observe that adaptive indexing techniques show a very 3 Coarse 1. Adapt other sorting algorithms to work progressively Query Time (log(s)) Cracking and analyze their advantages and disadvantages defin- Quick Stochastic ing a cost-model with a decision tree to unify all solu- 1.0 tions; 2. Explore how progressive indexing can be leveraged in Scan more complex database operations, such as joins and aggregations. Acknowledgments. This work was funded by the Nether- Index lands Organisation for Scientific Research (NWO), project 0 100 200 300 “Data Mining on High-Volume Simulation Output” (Holanda). Query (#) 6. REFERENCES [1] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Figure 6: Progressive Quick-sort. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000. [2] N. Bruno. Automated Physical Database Design and volatile performance with much more spikes throughout our Tunning. CRC-Press, 2011. query stream. This is due to the index refined by the work- [3] S. Chaudhuri and V. R. Narasayya. An Efficient, load. One can observe that progressive quick-sort has a more Cost-Driven Index Selection Tool for Microsoft SQL robust performance but still presents performance spikes Server. In VLDB, volume 97, 1997. due to the partial index creation depending entirely on the [4] D. Comer. The Difficulty of Optimum Index Selection. pivot selection. Additionally one can observe that progres- TODS, 3(4):440–445, 1978. sive quick-sort converges to a full index around query 120, [5] G. Graefe and H. Kuno. Self-selecting, self-tuning, whereas the adaptive index techniques never converge. incrementally optimized indexes. In EDBT, pages 371–381. ACM, 2010. 5. FUTURE RESEARCH [6] H. Gupta, V. Harinarayan, A. Rajaraman, and J. D. Progressive indexing introduces new aspects that were Ullman. Index Selection for OLAP. In Data unexplored by adaptive indexing and that require further Engineering, 1997. investigation. We envision, as our ultimate goal, the creation [7] F. Halim, S. Idreos, P. Karras, and R. H. Yap. of a formal cost-model with a decision tree that is able to Stochastic Database Cracking: Towards Robust choose and interleave from different progressive indexes. In Adaptive Indexing in Main-Memory Column-Stores. order to optimize the query response time without penalizing VLDB, 5(6):502–513, 2012. robustness or convergence. We describe the following as the [8] P. Holanda and E. C. de Almeida. SPST-Index: A aspects that shall be explored: Self-Pruning Splay Tree Index for Caching Database Cracking. In EDBT, pages 458–461, 2017. • Sorting-algorithms. We discussed two different al- [9] S. Idreos, M. L. Kersten, and S. Manegold. Updating a gorithms adapted to work as progressive indexes. How- cracked database. In SIGMOD, 2007. ever, many other well-known algorithms (e.g., radix- [10] S. Idreos, M. L. Kersten, and S. Manegold. sort, bucket-sort, heap-sort) can be adapted to work in Self-organizing Tuple Reconstruction in Column-stores. a progressive fashion as well. SIGMOD, pages 297–308, 2009. [11] S. Idreos, M. L. Kersten, S. Manegold, et al. Database • Pivot-selection. Some sorting algorithms (e.g., quick- Cracking. In CIDR, volume 3, pages 1–8, 2007. sort) require a good pivot point selection. Different [12] S. Idreos, S. Manegold, H. Kuno, and G. Graefe. strategies can be applied to select a pivot. It can Merging What’s Cracked, Cracking What’s Merged: be workload dependent, data dependent, completely Adaptive Indexing in Main-Memory Column-Stores. random or even a mix from all of the above; VLDB, 4(9):586–597, 2011. [13] V. Leis, A. Kemper, and T. Neumann. The adaptive • Data Structure. Different data structures can be radix tree: Artful indexing for main-memory databases. used to exploit modern processes and boost access to In ICDE, 2013. the ordered data, like the ART-tree [13], or even just having a fully ordered vector and dropping any extra [14] F. M. Schuhknecht, J. Dittrich, and L. Linden. structure might be beneficial for faster updates; Adaptive adaptive indexing. ICDE, 2018. [15] F. M. Schuhknecht, A. Jindal, and J. Dittrich. The • Index Budget. We used as a budget a fixed user- Uncracked Pieces in Database Cracking. Proc. VLDB defined constant from a scan as our budget. However, Endow., 7(2):97–108, Oct. 2013. this budget can adapt itself to keep a more robust cost, [16] G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and above the scan, until the index is fully generated and A. Skelley. DB2 Advisor: An Optimizer Smart Enough ready to be completely exploited. to Recommend Its Own Indexes. In Data Engineering, 2000. We point out the following as the research steps that we will follow in the next coming years: 4