=Paper= {{Paper |id=Vol-1594/paper3 |storemode=property |title=Challenges of Index Recommendation for Databases |pdfUrl=https://ceur-ws.org/Vol-1594/paper3.pdf |volume=Vol-1594 |authors=Parinaz Ameri |dblpUrl=https://dblp.org/rec/conf/gvd/Ameri16 }} ==Challenges of Index Recommendation for Databases== https://ceur-ws.org/Vol-1594/paper3.pdf
       Challenges of Index Recommendation for Databases
                            [With Specific Evaluation on a NoSQL Database]
                                                           Parinaz Ameri
                                            Karlsruhe Institute of Technology (KIT)
                                          Hermann-von-Helmholtz-Platz 1, Bldg. 449
                                          76344 Eggenstein-Leopoldshafen, Germany
                                                     parinaz.ameri@kit.edu

ABSTRACT                                                                  The index recommendation problem is defined as the course
One important aspect of physical database design is the se-            of determining a subset of indexes so that the benefit of cre-
lection of a proper set of indexes for a workload. Creation            ating them for a particular workload is maximized concern-
of indexes in a database system is subject to storage con-             ing the storage capacity.
straints. It is also affected by the ratio of update operations           Each database normally has a query optimizer that ex-
in the workload. Therefore, the cost and benefit of each               ecutes the cost of running a query with different scenarios
set of indexes should be evaluated by a proper optimiza-               based on available indexes in the database. Each query op-
tion method. The large number of index sets that must                  timizer itself has a cost function to evaluate different sce-
be assessed and the iterative nature of such optimization              narios. The recent trend in developing index recommen-
methods impose an additional load on the database sys-                 dation solution is to benefit from this cost function. This
tem. Therefore, an efficient algorithm is needed to develop a          approach eliminates the risk of developing an entirely sep-
practical framework for automating index recommendation.               arated cost function for the index recommendation system
Furthermore, due to the fundamental differences between                that would recommend indexes which might not even be
data models and query languages of NoSQL databases to                  considered by the optimizer. The estimated cost by the op-
each other and the relational databases, evaluation of such            timizer can directly be used in benefit function of the opti-
index recommendation system for NoSQL databases faces                  mization method.
many challenges. This paper reviews the challenges and my                 On the other hand, utilizing the estimated cost of the
proposed solutions for developing a self-tuning index recom-           query optimizer requires provoking it by the index recom-
mendation system, especially for a document-based NoSQL                mendation system for all of the candidate indexes. A large
database instance.                                                     number of calls to the optimizer can affect the performance
                                                                       of the database itself in response to its applications. There-
                                                                       fore, a good framework design is needed to avoid putting lots
General Terms                                                          of overhead on the in-production database and having bet-
Index Recommendation                                                   ter performance for the index recommendation system. To
                                                                       prevent this problem, we introduce a virtual environment
Keywords                                                               consisting of a sample of the targeted database. The can-
                                                                       didate indexes can be estimated in this environment rather
Indexing, Query Optimizer, NoSQL Databases, Optimiza-                  than in the in-production system itself.
tion, Representative Sampling, Synthetic Workload, Gener-                 The number of candidate indexes can grow drastically in
ation, Benchmark                                                       proportion to the size of the database. Consecutively, a
                                                                       method is needed to eliminate candidate indexes considered
1.   INTRODUCTION                                                      by the recommendation system without removing the most
   The performance of a database is dependent on its phys-             relevant indexes. Our proposed solution is discussed more
ical design. A crucial part of the physical design is the se-          in Section 3.
lection of the proper set of indexes concerning a particular              Furthermore, the performance of the index recommenda-
workload. Given the large size of conventional in-production           tion system needs to be tested by a series of workloads.
databases and the heavy workload of queries on them, au-               This evaluation on NoSQL databases runs to a lack of well-
tomating the process of choosing proper indexes for them is            defined benchmarks. The reason is the various data model
necessary.                                                             of NoSQL databases and the vast variety of their query lan-
                                                                       guages. These differences result in the ineffectiveness of
                                                                       well-known benchmarking models of traditional relational
                                                                       database models for NoSQL databases. This issue and our
                                                                       developed solution for this challenge is discussed more in
                                                                       Section 5.
                                                                          The rest of this paper is organized as following: some re-
                                                                       lated work are surveyed in Section 2. In Section 3, some of
28th GI-Workshop on Foundations of Databases (Grundlagen von Daten-    the practical challenges in developing a framework for a self-
banken), 24.05.2016 - 27.05.2016, Nörten-Hardenberg, Germany.          tuning index recommendation system are explained. Addi-
Copyright is held by the author/owner(s).




                                                                  10
         Challenges of Index Recommendation for Databases



tionally, some of our solutions for these difficulties and the      to the single and repeatedly present combinations of the fre-
proposed architecture for such a framework are presented.           quent queries in the workload. The Frequent Itemset [1] as a
The Section 4 provides a mathematical definition of index           data mining algorithm is used to build single and compound
recommendation problem followed by a discussion about               indexes related to the most frequent queries of the workload.
possible optimization methods to solve the problem. The                However, as described in [3], the frequency of queries
Section 5 provides a brief discussion on difficulties of eval-      should not be the only determining factor for defining can-
uating such database related systems for NoSQL databases            didate index proposal. For example, consider a query that
and a solution that we developed for this problem. At the           user wishes to issue frequently, but the response time of it
end, Section 6 concludes the work.                                  on a large database takes very long time. Due to the long
                                                                    response time, its frequency might fall below the configured
                                                                    threshold of the frequent itemset. To avoid missing such
2.   RELATED WORK                                                   important queries for candidate index evaluation, a proper
   There are several optimization methods to solve this issue.      combinatorial algorithm is needed to combine the frequency
Some recent approaches [7] neglect storage limitations and          and length of a query as two determining factors for gener-
instead calculate a lower bound for the cost of a workload          ating candidate indexes.
based on each query’s individual optimal index. Despite                Extracting attributes from the most frequent queries and
being advantageous, the derived bounds are not pragmatic            setting order for them in a compound index should be done
in a context of the real storage limitations.                       automatically. This is the responsibility of Syntax Analyzer
   In the literature, many different optimization methods are       in Figure 1. This component is crucial for analyzing queries
used to find the optimal solution for this problem such as          and also for scalability of the system to work with large
Knapsack problem usage [16], genetic algorithm [12] or even         workloads. The functionality of the syntax analyzer com-
linear programming optimization techniques [8] and branch-          ponent is directly affected by different query languages and
and-bound [18]. However, this problem is often solved by            data models of NoSQL and relational databases.
using a greedy algorithm [6, 9, 2]. In order to estimate better        The mostly denormalized structure of NoSQL databases
how close we get to the optimal solution, other solutions such      allows each attribute to contain non-scalar values, e.g. ar-
as Integer Linear Program (ILP) can be used [14].                   rays and nested documents in document-based databases.
   Instead, we propose using ILP, because it not only enables       Such data models eliminate usage of join operations while
us to explore more cases than for example the mostly used           querying the data. Consecutively, most NoSQL databases
greedy algorithm, but it also allows us to evaluate the qual-       developed their query language (normally an object-oriented
ity of the optimal solution. Also, by applying Linear Pro-          one) instead of using SQL queries.
gramming relaxation, we can have useful information about              The query for different data models influences the access
approximate solutions that had optimal performance, but             path to the data. Successively, the query optimizer plans are
due to lack of storage space are not chosen.                        affected. Therefore, while developing the syntax analyzer to
                                                                    gather the attributes and order them after each other, the
3.   FRAMEWORK DESIGN AND ITS CHAL-                                 order that the query optimizer of each database arrange its
                                                                    data should be taken into account. We considered develop-
     LENGES                                                         ing the syntax analyzer for the particular case of MongoDB
   The practical challenges of developing a framework for           based on the rules explained in [5].
recommending indexes and our solutions are presented in                The input of this component is the workload of the database
this section.                                                       that is logged in the Profiles. Its output is an ordered set of
   We refer to the set of indexes chosen for cost evaluation        attributes and their corresponding index type that is passed
as candidate indexes. The first challenge in designing a good       to the Miner. Then, the Miner component generates a set
framework for recommending indexes is limiting the search           of most frequent single and compound attributes and sends
space for candidate indexes.                                        it to the Config Evaluator.
   On modern databases, there are not only traditional as-             The new tendency in developing index recommendation
cending and descending index types, but also many new in-           systems is to use the query optimizer of the database itself
dex types, e.g. spatial, text indexes, etc. Consider there          to estimate the costs of running a query with a configuration
are s types of indexes in a database with n attributes in a         of indexes. This approach eliminates the effort to develop
collection. The following equation gives number of possible         an additional cost function for index recommendation. Also,
single and multi-attribute indexes on that collection:              utilizing the query optimizer of the database itself as op-
                                                                    posed to an external cost function prevents recommending
                         n
                         X                                          indexes that in reality are not considered by the database to
                            (sk )n!
                                                            (1)     execute a query.
                           (n − k)!                                    In our proposed and examined approach in [5], all of the
                         k=1
                                                                    candidate indexes are created on a sample set of the targeted
   where k is number of fields and k ≤ n [5]. On a collection
                                                                    data set. Then the chosen indexes by the query optimizer are
of only five attributes and the possibility to create four types
                                                                    returned as the recommended set of indexes. However, this
of indexes, the number of possible indexes exceeds 150,000.
                                                                    approach relies entirely on the cost function of the database.
   Therefore, it is important to limit the search space for
                                                                    The estimation of the distance of the recommended indexes
candidate indexes to the most relevant ones. As presented
                                                                    to the optimal set of indexes is not possible.
in [5], we consider the most relevant indexes as the ones
                                                                       Therefore, developing an ILP optimization method (dis-
derived from attributes of the most frequent queries in the
                                                                    cussed in Section 4) enables us to compare the performance
workload. Accordingly, the search space reduces from all the
                                                                    of these techniques with each other and also evaluate the
possible combinations of the whole attributes in the data set




                                                               11
         Challenges of Index Recommendation for Databases




                         Figure 1: The Architecture of the Index Recommendation System.


recommended set of indexes. To avoid being separated from          read to write operations to the database, the size of the data
the database query optimizer and its internal evaluations          set and the overhead on the in-production system.
while using our defined cost function, the cost of running            Evidently an evaluation of the performance of such index
queries in formula (2) are taken from the query optimizer.         recommendation system is required. The assessment of this
   For this purpose, the cost of running the query with or         system for a NoSQL database in comparison to the simi-
without each of candidate indexes should be inquired from          lar systems for traditional relational databases is subject to
the optimizer. Accordingly, the Config Evaluator in our            many difficulties that are discussed in Section 5.
design in Figure 1 creates all of the candidate indexes on
a sample set of targeted collection and inquiries the cost         4.   FORMULATION OF THE INDEX REC-
of running each query in the workload with and without
indexes. The obtained information from the optimizer along              OMMENDATION PROBLEM
with the required storage space estimation for each index are         In this section, a description of the mathematical form of
passed to the ILP optimizer. This component - which can be         ISP and discussion on optimization methods to solve it are
replaced with any other optimization method - is responsible       given.
for recommending the optimal set of indexes to the Creator            The objective of index recommendation is to recommend
component to create them on the In-Production Database.            the optimal set of indexes for a given database and a work-
   This approach brings us to the second major challenge           load. The workload is a set of m queries as Q = {Q1 , Q2 , ..., Qm }.
in developing index recommendation framework: number of            Let I = {I1 , I2 , ..., In } be the set of all possible indexes. In-
calls to the query optimizer. Each call to the optimizer           dexes can be single- or multi-attribute. Each index has a
for estimating the cost of a query applies an overhead on          corresponding size of s1 to sn .
the system. For a large number of the candidate indexes,              The execution cost of each query Qi is different, depending
the process can interfere with the work of the in-production       on the indexes that are used by it. Not only a single index
system. Moreover, creating all of the candidate indexes on a       but a set of indexes can be used to run one query.
large in-production system takes lots of resources and effects        A configuration is defined as a subset of indexes that are
the performance of the system negatively.                          all used for executing some query, Ck = {Ik1 , Ik2 , ...}. This
   Thus, our index recommendation system contains a Sam-           is known as atomic configuration for a workload [9]. There is
ple System that all of its characteristics are the same as         a set P of all the possible configurations that can be derived
the original database system only smaller. Sand-boxing the         from indexes in I and potentially be used by some query, as
data set in the sample system enables us with obtaining the        P = {C1 , C2 , ...Cp }. Accordingly, each configuration Ck ∈
query costs from the optimizer for a larger number of candi-       P is associated with some subset of Iτ ⊂ I. A configuration
date indexes. This cost estimation and also estimating the         is known as active if all of its indexes are built.
required storage space can be done without disturbing the             Our objective is to find the optimal configuration which
performance of the in-production system.                           its indexes have the maximum benefit for a specific workload
   However, utilizing a sample of the data set imposes some        under the storage constraints of the system. Each query Qi
challenges itself that require careful research. One problem       has a corresponding cost of running with usage of configu-
is that the ratio of presence of each attribute in the original    ration Ck which we call as cost(i, Ck ). Likewise, the cost
data set to its appearance in the sample changes with each         of running query i without any index is cost(i, ∅). There-
write operation (i.e. insert, update and delete) to the sys-       fore, the benefit of running each query with a configuration
tem. The challenge is to keep the sample representative of         is defined as the following:
the current state of the database over time.
                                                                                   bik = cost(i, ∅) − cost(i, Ck )              (2)
   Therefore, an appropriate interval for updating the sam-
ple set should be chosen. Since determining the sample set           It should be considered that having indexes for update
requires reading the entire data set and it implies an over-       queries enforces a maintenance cost on the system. Each
head on the database, the interval should not be too short.        update operation consists of two parts: finding the proper
The optimization of selecting the appropriate interval for         data unit and modifying it. The finding part is yet another
sampling should be done with consideration of the ratio of         query whose benefit can be calculated from formula (2). The




                                                              12
         Challenges of Index Recommendation for Databases



modification part can be considered as an insert (or a delete)              To compare the performance of databases and database-
that does not benefit from having indexes due to the lack                related systems, different sets of workloads are required to
of finding statement. Moreover, each update operation en-                represent various applications. Some examples of such ap-
forces maintenance of indexes that are associated with that              plications are Online Analytical Processing (OLAP) appli-
update operation. In general, a negative benefit −fj can                 cations with their long and aggregated read-mostly set of
be associated with each index Ij that is related to the to-              queries and Online Transaction Processing (OLTP) applica-
tal overhead of that index in association with m0 update                 tions with their short update-intensive set of queries.
operations.                                                                 Evaluation of traditional relational databases is mostly
  Therefore the objective function can be defined as                     done by using variances of the well-known Transaction Pro-
                      p
                    M X                     n
                                                                         cessing Performance Council (TPC) [15] benchmarks. How-
                    X                       X
             max(             bik · xik −         fj · yj )       (3)    ever, utilizing TPC benchmarks for evaluating NoSQL data-
                    i=1 k=1                 j=1
                                                                         bases runs into at least two major challenges: first map-
                                                                         ping of the tabular data format of the relational model to
  where M = m + m0 . This objective function is subject to               a specific NoSQL data format (e.g. documents, key-values,
the following constraints:                                               graphs, etc.) and second translating SQL queries to the com-
                                      p
                                      X                                  monly object-oriented query language of a NoSQL database.
                ∀i ≤ M :                    xik ≤ 1               (4)       The first challenge arises from the fact that the relational
                                      k=1                                data models are highly normalized. Mapping such normal-
                                                                         ized model to the often denormalized models of NoSQL
                                                                         databases requires careful study. For example, in document-
         ∀k : Ij ∈ Ck , ∀i ≤ M, j ≤ n :              xik ≤ yj     (5)
                                                                         based databases, there is the danger that the mapped model
                                                                         either overdo the usage of nested documents or not using
                                                                         this capability at all. Both of these cases directly affect the
        ∀i ≤ M, j ≤ n, k ≤ p :                xik , yj ∈ {0, 1}   (6)
                                                                         query performance, because it is highly dependent on the
                                                                         data access path.
                          n
                          X                                                 In the case of overdoing usage of nested documents, the
                                sj · yj ≤ S                       (7)    access path to the deepest documents is long. Hence, the
                          j=1                                            database performance is negatively affected. Not using the
                                                                         capability of nested documents at all enforces usage of refer-
  Constraint (4) ensures that at most one configuration is               ences in documents to the related documents. In this case,
used for any query. Constraint (5) represents the fact that              due to the lack of join operation in most NoSQL database,
a configuration can not be used unless all of its indexes are            more queries should be issued to fetch the required data. It
built. Constraint (7) enforce the limitation of available stor-          would also affect the performance of the database in com-
age S on the number of indexes that can be built.                        parison to the relational model. A proper mapping between
  Constraint 6 defines the binary nature of the two intro-               of these databases is hard and currently missing.
duced decision variables xik and yj . For each pair of query                There are however other solutions targeting the problem of
Qi and configuration Ck , xik is defined as:                             generating different workloads for various databases. One of
          
             1 query Qi uses conf iguration Ck                           them is the Yahoo Cloud Serving Benchmark (YCSB) [17].
    xik =                                                   (8)          Usage of this solution for evaluating our index recommen-
             0 otherwise
                                                                         dation system which is developed in combination with Mon-
  In the same way, the variable yj is associated with built              goDb [13], a document-base database had a major difficulty:
indexes and can be defined as:                                           the data set and all queries are handled by the MongoDB
                                                                        specific primary key, id field, which is by default indexed
                          1 index Ij is built                            in MongoDB.
               yj =                                               (9)       Another available solution for this problem is the Apache
                          0 otherwise
                                                                         JMeter [10]. Although JMeter provides a good environment
  We consider using ILP methods so that we can estimate                  for benchmarking, it requires the user to enter manually
the quality of the optimal solution in a tight bound, similar            the set of queries for benchmarking. This setup does not
to the approach take by [14]. Also, usage of methods such as             provide an easy possibility of producing a large and diverse
branch-and-bound enables us to improve the quality of an                 set of queries that are distributed over a time interval.
approximate solution that would have optimal performance,                   Therefore, in order to overcome this problem, we pro-
but cannot be built and used due to storage limitation. In               vided a generic workload generator named Not only Work-
general, usage of ILP can provide the same performance                   load Generator (NoWog) which is available in [4]. NoWog
as the other optimization methods while it examines much                 provides many features to fulfill its main objective: gener-
more alternative solutions.                                              ating synthetic workloads similar to realistic workload in an
                                                                         integrated layer. This layer should make NoWog indepen-
5.   EVALUATION CHALLENGES ON NOSQL                                      dent from the data model of the database.
     DATABASES                                                              Some of these features are the possibility to generate large
                                                                         workloads by defining simple rules that configure the dis-
   In the previous section, the theory to design the fitting
                                                                         tribution of different query types in various time intervals,
objective function and our choice for the proper optimiza-
                                                                         usage of arbitrary keys for querying.
tion method were discussed. This section explains the diffi-
                                                                            Development of such flexible and generic workload gen-
culty of evaluating such system, in particular on a document-
                                                                         erator was absolutely necessary for enabling the evaluation
based NoSQL database.




                                                                    13
         Challenges of Index Recommendation for Databases



process of our index recommendation system.                            Conference on Management of Data, SIGMOD ’05,
                                                                       pages 227–238, New York, NY, USA, 2005. ACM.
6.   CONCLUSION AND FUTURE WORK                                    [8] A. Caprara, M. Fischetti, and D. Maio. Exact and
                                                                       approximate algorithms for the index selection
   In this paper, first, we reviewed many challenges in devel-
                                                                       problem in physical database design. IEEE Trans. on
oping a framework for index recommendation system. Some
                                                                       Knowl. and Data Eng., 7(6):955–967, Dec. 1995.
of the major challenges are providing a solution for reduc-
                                                                       http://dx.doi.org/10.1109/69.476501.
ing the search space for candidate indexes and also reducing
the number of calls to the query optimizer for obtaining the       [9] S. Chaudhuri and V. Narasayya. An Efficient,
costs of running a query with different indexes. Our solu-             Cost-Driven Index Selection Tool for Microsoft SQL
tion for the former is to only consider the evaluation of the          Server. In VLDB. Very Large Data Bases Endowment
most relevant query attributes (most frequent and longest              Inc., August 1997.
queries). We solved the latter with introducing a virtual         [10] JMeter. Apache jmeter. http://jmeter.apache.org/
environment containing a sample of the targeted data set.              accessed 22-April-2016.
   Then, a mathematical definition of index recommendation        [11] C. Jung, M. Gasthuber, A. Giesler, M. Hardt,
is presented and some of the possible optimization methods             J. Meyer, F. Rigoll, K. Schwarz, R. Stotzka, and
to solve the problem are discussed. At the end, the chal-              A. Streit. Optimization of data life cycles. Journal of
lenges of evaluating such system for NoSQL database are                Physics: Conference Series, 513(3):032047, 2014.
briefly discussed. The NoWog is introduced as a solution               http:
for generating synthetic workloads with different distribu-            //stacks.iop.org/1742-6596/513/i=3/a=032047.
tions over time.                                                  [12] J. Kratica, I. Ljubic, and D. Tosic. A genetic algorithm
   Investigating similarity of new queries to the previous one         for the index selection problem. Technical report, In
in order to estimate the cost of recommending and creating             Applications of Evolutionary Computing, 2003.
new indexes in contrast to using the already existing ones is     [13] MongoDB. Mongodb for giant ideas.
part of future plans to expand this work.                              https://www.mongodb.com/ accessed 22-April-2016.
                                                                  [14] S. Papadomanolakis and A. Ailamaki. An integer
7.   ACKNOWLEDGMENTS                                                   linear programming approach to database design. In
                                                                       In ICDE Workshop on Self-Managing Databases, 2007.
  The author like to thank Large-Scale Data Management
                                                                  [15] tpc, 2016.
and Analysis project [11] by the German Helmholtz Associ-
                                                                       http://www.tpc.org/information/benchmarks.asp
ation for funding this research.
                                                                       accessed 22-April-2016.
                                                                  [16] G. Valentin, M. Zuliani, D. C. Zilio, G. M. Lohman,
8.   REFERENCES                                                        and A. Skelley. Db2 advisor: An optimizer smart
 [1] R. Agrawal and R. Srikant. Fast algorithms for mining             enough to recommend its own indexes. In Proceedings
     association rules in large databases. In Proceedings of           of the 16th International Conference on Data
     the 20th International Conference on Very Large Data              Engineering, ICDE ’00, pages 101–, Washington, DC,
     Bases, VLDB ’94, pages 487–499, San Francisco, CA,                USA, 2000. IEEE Computer Society. http:
     USA, 1994. Morgan Kaufmann Publishers Inc. http:                  //dl.acm.org/citation.cfm?id=846219.847390.
     //dl.acm.org/citation.cfm?id=645920.672836.                  [17] YCSB. Home brianfrankcooper/ycsb wiki github.
 [2] S. Agrawal, S. Chaudhuri, and V. R. Narasayya.                    https://github.com/brianfrankcooper/YCSB/wiki
     Automated selection of materialized views and indexes             accessed 22-April-2016.
     in sql databases. In Proceedings of the 26th                 [18] D. C. Zilio. Physical database design decision
     International Conference on Very Large Data Bases,                algorithms and concurrent reorganization for parallel
     VLDB ’00, pages 496–505, San Francisco, CA, USA,                  database systems. Technical report, 1997.
     2000. Morgan Kaufmann Publishers Inc.
 [3] P. Ameri. On a self-tuning index recommendation
     approach for databases. Manuscript is accepted to be
     published by the IEEE International Conference on
     Data Engineering (ICDE) PhD Symposium, 2016.
 [4] P. Ameri and H. Guan. Nowog.
     https://github.com/ParinazAmeri/NoWog accessed
     22-April-2016.
 [5] P. Ameri, J. Meyer, and A. Streit. On a new approach
     to the index selection problem using mining
     algorithms. In Big Data (Big Data), 2015 IEEE
     International Conference on, pages 2801–2810, Oct
     2015.
 [6] K. Aouiche and J. Darmont. Data mining-based
     materialized view and index selection in data
     warehouses. CoRR, abs/0707.1548, 2007.
 [7] N. Bruno and S. Chaudhuri. Automatic physical
     database tuning: A relaxation-based approach. In
     Proceedings of the 2005 ACM SIGMOD International




                                                             14