=Paper=
{{Paper
|id=Vol-2175/paper11
|storemode=property
|title=Progressive Indices: Indexing Without Prejudice
|pdfUrl=https://ceur-ws.org/Vol-2175/paper11.pdf
|volume=Vol-2175
|authors=Pedro Holanda
|dblpUrl=https://dblp.org/rec/conf/vldb/Holanda18
}}
==Progressive Indices: Indexing Without Prejudice==
<pdf width="1500px">https://ceur-ws.org/Vol-2175/paper11.pdf</pdf>
<pre>
             Progressive Indices: Indexing without prejudice

                                                              Pedro Holanda
                                                    supervised by Stefan Manegold
                                                    Centrum Wiskunde & Informatica
                                                       Amsterdam, Netherlands
                                                             holanda@cwi.nl


                                                                                                           After 4 Queries   After 10 Queries
ABSTRACT                                                                                 After 3 Queries


Database cracking is a method to create partial indices as a
side-effect of processing queries. Cracking efficiently smears                                       ⇢                 ⇢
out the cost of creating a full index over a stream of queries,
creating an index that is overfitted to queried parts of the
data. This core characteristic of cracking leads to unpre-                                                                                ⇢
dictable performance and unreliable convergence towards a
full index. These problems are aggravated when considering
updates and multidimensional queries.
   We envision a new indexing technique, Progressive Indexing
that improves database cracking by strictly limiting per-query
indexing cost to a budget (e.g., a user-defined fraction of scan
costs), allowing the first and subsequent queries to complete                             Figure 1: Progressive indexing.
without heavy penalties. At the same time, all indexing
effort is spent towards predictable convergence towards a
full index. We discuss different algorithms to deal with
                                                                              adaptive partial indexing approach for relational databases.
multidimensional queries and updates while maintaining a
                                                                              It works by building a partial index as a co-product of query
robust and convergent index, we then explore the research
                                                                              processing. The index is built the first time a column is
space and new challenges that arise from this new technique.
                                                                              queried and is continuously refined as subsequent queries
                                                                              are executed. This way the cost of creating an index is
1.    INTRODUCTION                                                            distributed over a stream of queries.
  Index creation is one of the major difficult decisions in                      However, database cracking and its variations are not ro-
database schema design [4]. Based on the workload, the                        bust against varying workload patterns. Since the index is
database administrator needs to decide whether creating a                     only refined in the areas targeted by the workload. Queries
specific index is worth the overhead of creating and main-                    that deviate from them will target unrefined sections of the in-
taining it. This considerable up-front cost creates a trade-off               dex, leading to large and unpredictable spikes in performance.
that requires careful consideration and experimentation.                      This problem is exacerbated when dealing with updates. In
  Automatic physical design tuning [2] aims to release the                    case the most refined area is also the most updated, and mul-
user of having to manually choose which indexes to cre-                       tidimensional queries (i.e., queries with selections in multiple
ate. They attempt to find the optimal set of indices given                    columns) since one must maintain multiple indices, one for
a query workload, by balancing the benefits of having an                      each column.
index versus the added costs of creating the index and main-                     To address these needs we propose a novel approach for in-
taining it during modifications to the database. However,                     cremental indexes, Progressive Indexing. Where every query
these tools are not able to work on dynamic systems due to                    that is issued to the database results in a fixed refinement
the unpredictability and lack of idle time for a priori index                 of the index, leading to a robust query execution and full
creation.                                                                     convergence towards a full index. While maintaining a low
  Adaptive indexing techniques, such as database crack-                       extra cost per query.
ing [11], attempt to solve this problem by presenting an                         Figure 1 depicts how progressive indexing works. Consid-
                                                                              ering an unindexed column, a pivot  and a budget δ = 0.1,
                                                                              we start by indexing a δ fraction of the data while scanning
                                                                              the remaining 1 − δ piece of data. After 3 queries, 30% of
                                                                              our column is already indexed around . The fourth query
                                                                              starts by performing an index lookup on the ρ fraction of
                                                                              the data that has already been indexed while scanning the
                                                                              unindexed 1 − ρ piece and expanding the index by another δ
Proceedings of the VLDB 2018 PhD Workshop, August 27, 2018. Rio de            fraction of the total column. Finally, after 10 queries, we fully
Janeiro, Brazil.
Copyright (C) 2018 for this paper by its authors. Copying permitted for       indexed  and we continue this process by selecting another .
private and academic purposes.


                                                                          1
  Paper Structure. The rest of this paper is structured                 applied to tuple reconstruction and not to multidimensional
as follows. Section 2 provides an overview of related work.             queries.
Then, section 3, describes progressive indexing. Section 4                Robustness. Stochastic cracking [7, 15] address the un-
presents a brief proof of concept and experimental analysis.            predictable performance problem by creating partitions using
Finally, in section 5, we discuss our research plan.                    a random pivot element instead of pivoting around the query
                                                                        predicates. However, the actual cracking still occurs in the
                                                                        pieces where the query predicates fall into.
2.   RELATED WORK                                                         Generalization. Adaptive adaptive indexing [14] at-
   Automatic physical database design has been an active                tempts to be a general-purpose algorithm for adaptive index-
research field for the past twenty years. These work resulted           ing. It has multiple parameters that can be tuned to mimic
in two different areas; self-tuning tools and adaptive indexing.        the data access of multiple adaptive indexing techniques (e.g.,
   Self-Tuning Tools [1, 3, 16, 6] attempt to solve this                database cracking, sideways cracking, hybrid cracking).
problem by automatically recommending a set of indexes to                 Updates. SPST-Index [8] extends the original cracking
optimize a known workload of the system. However, these                 work on updates [9] by rotating nodes when they are accessed
systems depend on previous workload knowledge and are                   in order to cluster cold-data on the leaves. When updates
only able to create full indexes. Unsuitable for unpredictable          are executed the leaves are pruned, consequently the index
workloads or when there is no idle time to be invested in a             constraints are relaxed resulting in faster updates. However
priori index creation.                                                  this work still alleviates the cost of updates by increasing the
                                                                        cost of cracking for the subsequent queries, bringing more
                                                                        unpredictability to the query costs.

                                                                        3.   PROGRESSIVE INDICES
                                                                           The previous section gave us the necessary motivation
                                                                        for progressive indexing. Motivated by: (1) The workload
                                                                        dependent pivot selection causes performance spikes and
                                                                        does not guarantee convergence towards a full index. (2)
                                                                        Cracking is not able to efficiently handle multidimensional
                                                                        queries since it must maintain multiple indices and intersect
                                                                        their points, and (3) the robustness is penalized when dealing
                                                                        with updates due to partial index removal and the possibility
                                                                        of updates being focused in refined cracked pieces.
                                                                           We propose progressive indexing. Progressive indexing is
                                                                        designed to be efficient when dealing with multiple types
              Figure 2: Database cracking.                              of queries and updates without sacrificing robustness or
                                                                        convergence. We believe that the following modifications
   Adaptive Indexing [15] is an alternative to the self-                from adaptive indexing must be taken: (1) the pivot-selection
tuning tools. It is especially useful in scenarios where the            does not need to be workload dependent, every query must
workload is unpredictable and there is no idle time to in-              have a budget to spend on indexing creation or maintenance
vest in index creation. It tackles these problems by creating           and when an indexed piece is smaller than L1 cache it is
indexes that are workload dependent in an incremental fash-             fully ordered. (2) Progressive indexing generates a unique
ion. Figure 2 depicts an example of database cracking [11].             index for multiple columns, and (3) other sorting algorithms,
Query Q1 starts by triggering the creation of the cracker               besides quick-sort adaptations, are exploited in order to
column(i.e., initially a copy of column A) where the tuples             efficiently merge updates while maintaining robustness and
are clustered in three pieces reflecting the range predicate of         convergence.
Q1 . The result of Q1 is then retrieved as a view on the Piece             As a result of the small initial cost, progressive indexing
colored in red (i.e., 10 < A < 14). Later, query Q2 requires            occurs without significant impact on query performance and
a refinement of Pieces 1 and 3 (i.e., respectively indexing             has a near-immediate return on investment over performing
A > 7 and A ≤ 16), splitting each in two new pieces.                    naive scans. Even if the column is only queried a few times,
   Database cracking has multiple issues: (1) poor conver-              progressive indexing will still provide a performance benefit.
gence towards a full index, (2) inefficient tuple reconstruction,       On the other hand, if the column is queried thousands of
(3) unpredictable performance and (4) inefficient updates.              times, the index will reliably converge towards a full index
Below we briefly discuss the research that addresses these              and queries will be answered at the same performance as if
issues.                                                                 a full index had been built.
   Convergence. Hybrid cracking [5, 12, 15] mitigate the                   Single Column. Figure 3 depicts an example of progres-
issue of poor convergence towards a full index by executing             sive quick-sort. In this example, we define a budget δ = 0.5.
many initial cracking runs with random pivots. Although                 Query Q1 starts by triggering the initialize phase from pro-
this provides better convergence and higher robustness it               gressive quick-sort. First, it allocates an uninitialized column
greatly impacts the cost of the first query.                            of the same size of the original column and then selects 9 as
   Tuple Reconstruction. Sideways cracking [10, 15] ad-                 a pivot . The original column is scanned and n ∗ δ, where n
dress the inefficient tuple reconstruction problem. It mini-            is the column size, elements are copied either to the top or
mizes the tuple reconstruction cost by using a “cracker maps”           the bottom of the copied column, depending on the pivot,
data structure. They provide a mapping between attributes               while doing so we also select the elements that fulfill Q1
that are combined in queries. However, the strategy is only             predicates. A binary search tree (BST) is also formed to


                                                                    2
         13 Initialize 4                 4   Refine    4                                                                                           X,3
                                                                                                 X,3 >=                                    <               >=
         16            9 ≤9              9             3                                    <
         4             2                 2             2 ≤4                                                                          Y,5                         Y,5
                                            ≤9                    ≤9                                                          <                            <            >=
         9             ?                 7             1                                 Pos:2           Pos:2                              >=


                         Uninitialized
         2             ?                 1             7 >4
                                                                                                                            Pos:1          Pos:2         Pos:5         Pos:6
         12            ?                 3             9
         7             ?                 11            11
         1             ?                 14            14                       ID X Y      Full Iteration       ID X   Y           Full Iteration             ID X Y
         19            ?                 19
                                            >9
                                                       19
                                                                  >9             1 7 3                            2 1   5                                      9 2 4
         3            12                 12            12
         14           16 > 9             16            16                        2 1 5                            9 2   4                                      2 1 5
         11           13                 13            13                        3 6 2                            7 3   9                                      3 6 2
                                                                                 4 8 7                            8 5   6                                      1 7 3
                                                                                 5 9 1                            6 4   8                                      5 9 1
           Figure 3: Progressive quick-sort.                                     6 4 8                            3 6   2                                      4 8 7
                                                                                 7 3 9                            1 7   3                                      6 4 8
                                                                                 8 5 6                            4 8   7                                      8 5 6
keep track of the pivot points. The subsequent queries can                       9 2 4                            5 9   1                                      7 3 9
already leverage from the sorted data by performing lookups
in the BST. Later, query Q2 triggers the refine phase fully
indexing the column around . Q3 then select another , and                Figure 5: Multidimensional progressive quick-sort.
the refinement process continues until we reach a full index.
The main disadvantage of progressive quick-sort is that fast
convergence relies entirely on good pivot point selection.                    Multi-column. We adapt progressive quick-sort to work
                                                                           with multidimensional queries by generating a KD-Tree on
              13 Initialize 2                   2
                                                                           top of the columns. In a KD-Tree every level of the tree
                                                      Refine 1
              16            4                   4            2             consists of only one column. To maintain this property we
              4             9                   9            3             interleave our pivots through the columns. Multidimensional
              9             12                  12           4
              2             13                  13           7
                                                                           progressive quick-sort can be visualized in figure 5. For
              12            16                  16           9             simplicity in our example we set δ = 1 (i.e., every iteration
              7                                 1            11            fully indexes one pivot). In the first iteration we select a
              1                                 3            12             = 3 to use as a pivot for column X. Since our δ = 1 at the
              19                                7            13
              3                                 11           14            end of the first query, we will have fully copied the columns
              14                                14           16            and fully indexed the column X around 3, while keeping the
              11                                19           19            alignment with column Y . A KD-Tree is then created to
                                                                           keep track of the indexed pivot. Later, when Q2 is executed
                                                                           it triggers another iteration of progressive indexing. However,
          Figure 4: Progressive merge-sort.                                this time we are going to reorder the column Y over a new
                                                                            = 5. Since we have already indexed X on 3, we need to
                                                                           reorder Y in the two pieces of X to maintain the alignment.T
   Updates. Updates are stored in an extra vector. When
                                                                           he main disadvantage of multidimensional progressive quick-
a query is executed this extra column is also fully scanned.
                                                                           sort is that it does not provide progressive usage of space,
When our original column is fully sorted, we drop the BST,
                                                                           after fully indexing the first pivot, we have already made full
since we can now perform a binary search on the ordered
                                                                           copies of all the columns.
column, and start the merging process with the column that
holds the updates. To efficiently merge the updates into our
original column we use progressive merge-sort. Progressive                 4.    PRELIMINARY RESULTS
merge-sort consists of two build phases. In the first build                   In this section, we present a brief experimental analysis
phase, one unsorted chunk of size n ∗ δ is sorted, where n                 to demonstrate the strong potential benefits of progressive
is the column size and δ is our budget. In order to answer                 indexing.
queries, we perform a binary search in the sorted chunks                      Setup. We implemented progressive quick-sort in a stand-
while scanning the unsorted data. In the second build phase,               alone program written in C++ and optimized using level
we merge the sorted chunks together using a cascading two-                 3. All experiments were conducted on a machine running
way merge. Note that we do not necessarily complete a                      Fedora 26, with an Intel Core i7-2600K CPU @ 3.40 GHz
full merge in one query, as this would result in large drops               with 8 cores, 16 GB of main memory and 8192 KB L3 cache
in performance when we merge two large chunks together.                    size.
Instead, we merge at most n ∗ 2δ elements and keep track of                   We use an 8-byte integer array with 108 uniformly dis-
how far along the merge we are. After a merge is completed,                tributed values as our dataset. All queries are of the form:
we replace the two original chunks with the merged chunk.                  SELECT SUM(R.A) FROM R WHERE R.A BETWEEN V1 AND V2 .
Figure 4 depicts the progressive merge-sort algorithm with                 All the queries have selectivity equal to 0.1 and we define
δ = 0.5, in the first phase half the column is ordered in                  our budget δ = 0.1. All the progressive quick-sort pivots are
a chunk. The second phase orders the remaining of the                      randomly selected.
column in another chunk and finally both chunks are merged                    In Figure 6 we depict the per-query performance evalua-
resulting in a fully ordered column. The main disadvantage                 tion of progressive quick-sort, database cracking, stochastic
of merge-sort is that while we have many sorted chunks, we                 cracking, coarse-granular index (i.e., a stochastic cracking
are performing many random accesses, as we are doing a full                variant), a B+tree as a full index and column scans. We
binary search in each of the 1/δ chunks.                                   can observe that adaptive indexing techniques show a very


                                                                       3
                                                              Coarse
                                                                                    1. Adapt other sorting algorithms to work progressively
     Query Time (log(s))                                      Cracking                 and analyze their advantages and disadvantages defin-
                                                              Quick
                                                              Stochastic               ing a cost-model with a decision tree to unify all solu-
                             1.0
                                                                                       tions;
                                                                                    2. Explore how progressive indexing can be leveraged in
                           Scan                                                        more complex database operations, such as joins and
                                                                                       aggregations.
                                                                                  Acknowledgments. This work was funded by the Nether-
                           Index                                                lands Organisation for Scientific Research (NWO), project
                                   0      100        200     300               “Data Mining on High-Volume Simulation Output” (Holanda).
                                                Query (#)
                                                                               6.     REFERENCES
                                                                                [1] S. Agrawal, S. Chaudhuri, and V. R. Narasayya.
                               Figure 6: Progressive Quick-sort.                    Automated Selection of Materialized Views and
                                                                                    Indexes in SQL Databases. In VLDB, 2000.
                                                                                [2] N. Bruno. Automated Physical Database Design and
volatile performance with much more spikes throughout our                           Tunning. CRC-Press, 2011.
query stream. This is due to the index refined by the work-                     [3] S. Chaudhuri and V. R. Narasayya. An Efficient,
load. One can observe that progressive quick-sort has a more                        Cost-Driven Index Selection Tool for Microsoft SQL
robust performance but still presents performance spikes                            Server. In VLDB, volume 97, 1997.
due to the partial index creation depending entirely on the
                                                                                [4] D. Comer. The Difficulty of Optimum Index Selection.
pivot selection. Additionally one can observe that progres-
                                                                                    TODS, 3(4):440–445, 1978.
sive quick-sort converges to a full index around query 120,
                                                                                [5] G. Graefe and H. Kuno. Self-selecting, self-tuning,
whereas the adaptive index techniques never converge.
                                                                                    incrementally optimized indexes. In EDBT, pages
                                                                                    371–381. ACM, 2010.
5.             FUTURE RESEARCH                                                  [6] H. Gupta, V. Harinarayan, A. Rajaraman, and J. D.
  Progressive indexing introduces new aspects that were                             Ullman. Index Selection for OLAP. In Data
unexplored by adaptive indexing and that require further                            Engineering, 1997.
investigation. We envision, as our ultimate goal, the creation                  [7] F. Halim, S. Idreos, P. Karras, and R. H. Yap.
of a formal cost-model with a decision tree that is able to                         Stochastic Database Cracking: Towards Robust
choose and interleave from different progressive indexes. In                        Adaptive Indexing in Main-Memory Column-Stores.
order to optimize the query response time without penalizing                        VLDB, 5(6):502–513, 2012.
robustness or convergence. We describe the following as the                     [8] P. Holanda and E. C. de Almeida. SPST-Index: A
aspects that shall be explored:                                                     Self-Pruning Splay Tree Index for Caching Database
                                                                                    Cracking. In EDBT, pages 458–461, 2017.
     • Sorting-algorithms. We discussed two different al-                       [9] S. Idreos, M. L. Kersten, and S. Manegold. Updating a
       gorithms adapted to work as progressive indexes. How-                        cracked database. In SIGMOD, 2007.
       ever, many other well-known algorithms (e.g., radix-                    [10] S. Idreos, M. L. Kersten, and S. Manegold.
       sort, bucket-sort, heap-sort) can be adapted to work in                      Self-organizing Tuple Reconstruction in Column-stores.
       a progressive fashion as well.                                               SIGMOD, pages 297–308, 2009.
                                                                               [11] S. Idreos, M. L. Kersten, S. Manegold, et al. Database
     • Pivot-selection. Some sorting algorithms (e.g., quick-                       Cracking. In CIDR, volume 3, pages 1–8, 2007.
       sort) require a good pivot point selection. Different
                                                                               [12] S. Idreos, S. Manegold, H. Kuno, and G. Graefe.
       strategies can be applied to select a pivot. It can
                                                                                    Merging What’s Cracked, Cracking What’s Merged:
       be workload dependent, data dependent, completely
                                                                                    Adaptive Indexing in Main-Memory Column-Stores.
       random or even a mix from all of the above;
                                                                                    VLDB, 4(9):586–597, 2011.
                                                                               [13] V. Leis, A. Kemper, and T. Neumann. The adaptive
     • Data Structure. Different data structures can be
                                                                                    radix tree: Artful indexing for main-memory databases.
       used to exploit modern processes and boost access to
                                                                                    In ICDE, 2013.
       the ordered data, like the ART-tree [13], or even just
       having a fully ordered vector and dropping any extra                    [14] F. M. Schuhknecht, J. Dittrich, and L. Linden.
       structure might be beneficial for faster updates;                            Adaptive adaptive indexing. ICDE, 2018.
                                                                               [15] F. M. Schuhknecht, A. Jindal, and J. Dittrich. The
     • Index Budget. We used as a budget a fixed user-                              Uncracked Pieces in Database Cracking. Proc. VLDB
       defined constant from a scan as our budget. However,                         Endow., 7(2):97–108, Oct. 2013.
       this budget can adapt itself to keep a more robust cost,                [16] G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and
       above the scan, until the index is fully generated and                       A. Skelley. DB2 Advisor: An Optimizer Smart Enough
       ready to be completely exploited.                                            to Recommend Its Own Indexes. In Data Engineering,
                                                                                    2000.
  We point out the following as the research steps that we
will follow in the next coming years:


                                                                           4

</pre>