=Paper=
{{Paper
|id=Vol-1594/paper3
|storemode=property
|title=Challenges of Index Recommendation for Databases
|pdfUrl=https://ceur-ws.org/Vol-1594/paper3.pdf
|volume=Vol-1594
|authors=Parinaz Ameri
|dblpUrl=https://dblp.org/rec/conf/gvd/Ameri16
}}
==Challenges of Index Recommendation for Databases==
Challenges of Index Recommendation for Databases
[With Specific Evaluation on a NoSQL Database]
Parinaz Ameri
Karlsruhe Institute of Technology (KIT)
Hermann-von-Helmholtz-Platz 1, Bldg. 449
76344 Eggenstein-Leopoldshafen, Germany
parinaz.ameri@kit.edu
ABSTRACT The index recommendation problem is defined as the course
One important aspect of physical database design is the se- of determining a subset of indexes so that the benefit of cre-
lection of a proper set of indexes for a workload. Creation ating them for a particular workload is maximized concern-
of indexes in a database system is subject to storage con- ing the storage capacity.
straints. It is also affected by the ratio of update operations Each database normally has a query optimizer that ex-
in the workload. Therefore, the cost and benefit of each ecutes the cost of running a query with different scenarios
set of indexes should be evaluated by a proper optimiza- based on available indexes in the database. Each query op-
tion method. The large number of index sets that must timizer itself has a cost function to evaluate different sce-
be assessed and the iterative nature of such optimization narios. The recent trend in developing index recommen-
methods impose an additional load on the database sys- dation solution is to benefit from this cost function. This
tem. Therefore, an efficient algorithm is needed to develop a approach eliminates the risk of developing an entirely sep-
practical framework for automating index recommendation. arated cost function for the index recommendation system
Furthermore, due to the fundamental differences between that would recommend indexes which might not even be
data models and query languages of NoSQL databases to considered by the optimizer. The estimated cost by the op-
each other and the relational databases, evaluation of such timizer can directly be used in benefit function of the opti-
index recommendation system for NoSQL databases faces mization method.
many challenges. This paper reviews the challenges and my On the other hand, utilizing the estimated cost of the
proposed solutions for developing a self-tuning index recom- query optimizer requires provoking it by the index recom-
mendation system, especially for a document-based NoSQL mendation system for all of the candidate indexes. A large
database instance. number of calls to the optimizer can affect the performance
of the database itself in response to its applications. There-
fore, a good framework design is needed to avoid putting lots
General Terms of overhead on the in-production database and having bet-
Index Recommendation ter performance for the index recommendation system. To
prevent this problem, we introduce a virtual environment
Keywords consisting of a sample of the targeted database. The can-
didate indexes can be estimated in this environment rather
Indexing, Query Optimizer, NoSQL Databases, Optimiza- than in the in-production system itself.
tion, Representative Sampling, Synthetic Workload, Gener- The number of candidate indexes can grow drastically in
ation, Benchmark proportion to the size of the database. Consecutively, a
method is needed to eliminate candidate indexes considered
1. INTRODUCTION by the recommendation system without removing the most
The performance of a database is dependent on its phys- relevant indexes. Our proposed solution is discussed more
ical design. A crucial part of the physical design is the se- in Section 3.
lection of the proper set of indexes concerning a particular Furthermore, the performance of the index recommenda-
workload. Given the large size of conventional in-production tion system needs to be tested by a series of workloads.
databases and the heavy workload of queries on them, au- This evaluation on NoSQL databases runs to a lack of well-
tomating the process of choosing proper indexes for them is defined benchmarks. The reason is the various data model
necessary. of NoSQL databases and the vast variety of their query lan-
guages. These differences result in the ineffectiveness of
well-known benchmarking models of traditional relational
database models for NoSQL databases. This issue and our
developed solution for this challenge is discussed more in
Section 5.
The rest of this paper is organized as following: some re-
lated work are surveyed in Section 2. In Section 3, some of
28th GI-Workshop on Foundations of Databases (Grundlagen von Daten- the practical challenges in developing a framework for a self-
banken), 24.05.2016 - 27.05.2016, Nörten-Hardenberg, Germany. tuning index recommendation system are explained. Addi-
Copyright is held by the author/owner(s).
10
Challenges of Index Recommendation for Databases
tionally, some of our solutions for these difficulties and the to the single and repeatedly present combinations of the fre-
proposed architecture for such a framework are presented. quent queries in the workload. The Frequent Itemset [1] as a
The Section 4 provides a mathematical definition of index data mining algorithm is used to build single and compound
recommendation problem followed by a discussion about indexes related to the most frequent queries of the workload.
possible optimization methods to solve the problem. The However, as described in [3], the frequency of queries
Section 5 provides a brief discussion on difficulties of eval- should not be the only determining factor for defining can-
uating such database related systems for NoSQL databases didate index proposal. For example, consider a query that
and a solution that we developed for this problem. At the user wishes to issue frequently, but the response time of it
end, Section 6 concludes the work. on a large database takes very long time. Due to the long
response time, its frequency might fall below the configured
threshold of the frequent itemset. To avoid missing such
2. RELATED WORK important queries for candidate index evaluation, a proper
There are several optimization methods to solve this issue. combinatorial algorithm is needed to combine the frequency
Some recent approaches [7] neglect storage limitations and and length of a query as two determining factors for gener-
instead calculate a lower bound for the cost of a workload ating candidate indexes.
based on each query’s individual optimal index. Despite Extracting attributes from the most frequent queries and
being advantageous, the derived bounds are not pragmatic setting order for them in a compound index should be done
in a context of the real storage limitations. automatically. This is the responsibility of Syntax Analyzer
In the literature, many different optimization methods are in Figure 1. This component is crucial for analyzing queries
used to find the optimal solution for this problem such as and also for scalability of the system to work with large
Knapsack problem usage [16], genetic algorithm [12] or even workloads. The functionality of the syntax analyzer com-
linear programming optimization techniques [8] and branch- ponent is directly affected by different query languages and
and-bound [18]. However, this problem is often solved by data models of NoSQL and relational databases.
using a greedy algorithm [6, 9, 2]. In order to estimate better The mostly denormalized structure of NoSQL databases
how close we get to the optimal solution, other solutions such allows each attribute to contain non-scalar values, e.g. ar-
as Integer Linear Program (ILP) can be used [14]. rays and nested documents in document-based databases.
Instead, we propose using ILP, because it not only enables Such data models eliminate usage of join operations while
us to explore more cases than for example the mostly used querying the data. Consecutively, most NoSQL databases
greedy algorithm, but it also allows us to evaluate the qual- developed their query language (normally an object-oriented
ity of the optimal solution. Also, by applying Linear Pro- one) instead of using SQL queries.
gramming relaxation, we can have useful information about The query for different data models influences the access
approximate solutions that had optimal performance, but path to the data. Successively, the query optimizer plans are
due to lack of storage space are not chosen. affected. Therefore, while developing the syntax analyzer to
gather the attributes and order them after each other, the
3. FRAMEWORK DESIGN AND ITS CHAL- order that the query optimizer of each database arrange its
data should be taken into account. We considered develop-
LENGES ing the syntax analyzer for the particular case of MongoDB
The practical challenges of developing a framework for based on the rules explained in [5].
recommending indexes and our solutions are presented in The input of this component is the workload of the database
this section. that is logged in the Profiles. Its output is an ordered set of
We refer to the set of indexes chosen for cost evaluation attributes and their corresponding index type that is passed
as candidate indexes. The first challenge in designing a good to the Miner. Then, the Miner component generates a set
framework for recommending indexes is limiting the search of most frequent single and compound attributes and sends
space for candidate indexes. it to the Config Evaluator.
On modern databases, there are not only traditional as- The new tendency in developing index recommendation
cending and descending index types, but also many new in- systems is to use the query optimizer of the database itself
dex types, e.g. spatial, text indexes, etc. Consider there to estimate the costs of running a query with a configuration
are s types of indexes in a database with n attributes in a of indexes. This approach eliminates the effort to develop
collection. The following equation gives number of possible an additional cost function for index recommendation. Also,
single and multi-attribute indexes on that collection: utilizing the query optimizer of the database itself as op-
posed to an external cost function prevents recommending
n
X indexes that in reality are not considered by the database to
(sk )n!
(1) execute a query.
(n − k)! In our proposed and examined approach in [5], all of the
k=1
candidate indexes are created on a sample set of the targeted
where k is number of fields and k ≤ n [5]. On a collection
data set. Then the chosen indexes by the query optimizer are
of only five attributes and the possibility to create four types
returned as the recommended set of indexes. However, this
of indexes, the number of possible indexes exceeds 150,000.
approach relies entirely on the cost function of the database.
Therefore, it is important to limit the search space for
The estimation of the distance of the recommended indexes
candidate indexes to the most relevant ones. As presented
to the optimal set of indexes is not possible.
in [5], we consider the most relevant indexes as the ones
Therefore, developing an ILP optimization method (dis-
derived from attributes of the most frequent queries in the
cussed in Section 4) enables us to compare the performance
workload. Accordingly, the search space reduces from all the
of these techniques with each other and also evaluate the
possible combinations of the whole attributes in the data set
11
Challenges of Index Recommendation for Databases
Figure 1: The Architecture of the Index Recommendation System.
recommended set of indexes. To avoid being separated from read to write operations to the database, the size of the data
the database query optimizer and its internal evaluations set and the overhead on the in-production system.
while using our defined cost function, the cost of running Evidently an evaluation of the performance of such index
queries in formula (2) are taken from the query optimizer. recommendation system is required. The assessment of this
For this purpose, the cost of running the query with or system for a NoSQL database in comparison to the simi-
without each of candidate indexes should be inquired from lar systems for traditional relational databases is subject to
the optimizer. Accordingly, the Config Evaluator in our many difficulties that are discussed in Section 5.
design in Figure 1 creates all of the candidate indexes on
a sample set of targeted collection and inquiries the cost 4. FORMULATION OF THE INDEX REC-
of running each query in the workload with and without
indexes. The obtained information from the optimizer along OMMENDATION PROBLEM
with the required storage space estimation for each index are In this section, a description of the mathematical form of
passed to the ILP optimizer. This component - which can be ISP and discussion on optimization methods to solve it are
replaced with any other optimization method - is responsible given.
for recommending the optimal set of indexes to the Creator The objective of index recommendation is to recommend
component to create them on the In-Production Database. the optimal set of indexes for a given database and a work-
This approach brings us to the second major challenge load. The workload is a set of m queries as Q = {Q1 , Q2 , ..., Qm }.
in developing index recommendation framework: number of Let I = {I1 , I2 , ..., In } be the set of all possible indexes. In-
calls to the query optimizer. Each call to the optimizer dexes can be single- or multi-attribute. Each index has a
for estimating the cost of a query applies an overhead on corresponding size of s1 to sn .
the system. For a large number of the candidate indexes, The execution cost of each query Qi is different, depending
the process can interfere with the work of the in-production on the indexes that are used by it. Not only a single index
system. Moreover, creating all of the candidate indexes on a but a set of indexes can be used to run one query.
large in-production system takes lots of resources and effects A configuration is defined as a subset of indexes that are
the performance of the system negatively. all used for executing some query, Ck = {Ik1 , Ik2 , ...}. This
Thus, our index recommendation system contains a Sam- is known as atomic configuration for a workload [9]. There is
ple System that all of its characteristics are the same as a set P of all the possible configurations that can be derived
the original database system only smaller. Sand-boxing the from indexes in I and potentially be used by some query, as
data set in the sample system enables us with obtaining the P = {C1 , C2 , ...Cp }. Accordingly, each configuration Ck ∈
query costs from the optimizer for a larger number of candi- P is associated with some subset of Iτ ⊂ I. A configuration
date indexes. This cost estimation and also estimating the is known as active if all of its indexes are built.
required storage space can be done without disturbing the Our objective is to find the optimal configuration which
performance of the in-production system. its indexes have the maximum benefit for a specific workload
However, utilizing a sample of the data set imposes some under the storage constraints of the system. Each query Qi
challenges itself that require careful research. One problem has a corresponding cost of running with usage of configu-
is that the ratio of presence of each attribute in the original ration Ck which we call as cost(i, Ck ). Likewise, the cost
data set to its appearance in the sample changes with each of running query i without any index is cost(i, ∅). There-
write operation (i.e. insert, update and delete) to the sys- fore, the benefit of running each query with a configuration
tem. The challenge is to keep the sample representative of is defined as the following:
the current state of the database over time.
bik = cost(i, ∅) − cost(i, Ck ) (2)
Therefore, an appropriate interval for updating the sam-
ple set should be chosen. Since determining the sample set It should be considered that having indexes for update
requires reading the entire data set and it implies an over- queries enforces a maintenance cost on the system. Each
head on the database, the interval should not be too short. update operation consists of two parts: finding the proper
The optimization of selecting the appropriate interval for data unit and modifying it. The finding part is yet another
sampling should be done with consideration of the ratio of query whose benefit can be calculated from formula (2). The
12
Challenges of Index Recommendation for Databases
modification part can be considered as an insert (or a delete) To compare the performance of databases and database-
that does not benefit from having indexes due to the lack related systems, different sets of workloads are required to
of finding statement. Moreover, each update operation en- represent various applications. Some examples of such ap-
forces maintenance of indexes that are associated with that plications are Online Analytical Processing (OLAP) appli-
update operation. In general, a negative benefit −fj can cations with their long and aggregated read-mostly set of
be associated with each index Ij that is related to the to- queries and Online Transaction Processing (OLTP) applica-
tal overhead of that index in association with m0 update tions with their short update-intensive set of queries.
operations. Evaluation of traditional relational databases is mostly
Therefore the objective function can be defined as done by using variances of the well-known Transaction Pro-
p
M X n
cessing Performance Council (TPC) [15] benchmarks. How-
X X
max( bik · xik − fj · yj ) (3) ever, utilizing TPC benchmarks for evaluating NoSQL data-
i=1 k=1 j=1
bases runs into at least two major challenges: first map-
ping of the tabular data format of the relational model to
where M = m + m0 . This objective function is subject to a specific NoSQL data format (e.g. documents, key-values,
the following constraints: graphs, etc.) and second translating SQL queries to the com-
p
X monly object-oriented query language of a NoSQL database.
∀i ≤ M : xik ≤ 1 (4) The first challenge arises from the fact that the relational
k=1 data models are highly normalized. Mapping such normal-
ized model to the often denormalized models of NoSQL
databases requires careful study. For example, in document-
∀k : Ij ∈ Ck , ∀i ≤ M, j ≤ n : xik ≤ yj (5)
based databases, there is the danger that the mapped model
either overdo the usage of nested documents or not using
this capability at all. Both of these cases directly affect the
∀i ≤ M, j ≤ n, k ≤ p : xik , yj ∈ {0, 1} (6)
query performance, because it is highly dependent on the
data access path.
n
X In the case of overdoing usage of nested documents, the
sj · yj ≤ S (7) access path to the deepest documents is long. Hence, the
j=1 database performance is negatively affected. Not using the
capability of nested documents at all enforces usage of refer-
Constraint (4) ensures that at most one configuration is ences in documents to the related documents. In this case,
used for any query. Constraint (5) represents the fact that due to the lack of join operation in most NoSQL database,
a configuration can not be used unless all of its indexes are more queries should be issued to fetch the required data. It
built. Constraint (7) enforce the limitation of available stor- would also affect the performance of the database in com-
age S on the number of indexes that can be built. parison to the relational model. A proper mapping between
Constraint 6 defines the binary nature of the two intro- of these databases is hard and currently missing.
duced decision variables xik and yj . For each pair of query There are however other solutions targeting the problem of
Qi and configuration Ck , xik is defined as: generating different workloads for various databases. One of
1 query Qi uses conf iguration Ck them is the Yahoo Cloud Serving Benchmark (YCSB) [17].
xik = (8) Usage of this solution for evaluating our index recommen-
0 otherwise
dation system which is developed in combination with Mon-
In the same way, the variable yj is associated with built goDb [13], a document-base database had a major difficulty:
indexes and can be defined as: the data set and all queries are handled by the MongoDB
specific primary key, id field, which is by default indexed
1 index Ij is built in MongoDB.
yj = (9) Another available solution for this problem is the Apache
0 otherwise
JMeter [10]. Although JMeter provides a good environment
We consider using ILP methods so that we can estimate for benchmarking, it requires the user to enter manually
the quality of the optimal solution in a tight bound, similar the set of queries for benchmarking. This setup does not
to the approach take by [14]. Also, usage of methods such as provide an easy possibility of producing a large and diverse
branch-and-bound enables us to improve the quality of an set of queries that are distributed over a time interval.
approximate solution that would have optimal performance, Therefore, in order to overcome this problem, we pro-
but cannot be built and used due to storage limitation. In vided a generic workload generator named Not only Work-
general, usage of ILP can provide the same performance load Generator (NoWog) which is available in [4]. NoWog
as the other optimization methods while it examines much provides many features to fulfill its main objective: gener-
more alternative solutions. ating synthetic workloads similar to realistic workload in an
integrated layer. This layer should make NoWog indepen-
5. EVALUATION CHALLENGES ON NOSQL dent from the data model of the database.
DATABASES Some of these features are the possibility to generate large
workloads by defining simple rules that configure the dis-
In the previous section, the theory to design the fitting
tribution of different query types in various time intervals,
objective function and our choice for the proper optimiza-
usage of arbitrary keys for querying.
tion method were discussed. This section explains the diffi-
Development of such flexible and generic workload gen-
culty of evaluating such system, in particular on a document-
erator was absolutely necessary for enabling the evaluation
based NoSQL database.
13
Challenges of Index Recommendation for Databases
process of our index recommendation system. Conference on Management of Data, SIGMOD ’05,
pages 227–238, New York, NY, USA, 2005. ACM.
6. CONCLUSION AND FUTURE WORK [8] A. Caprara, M. Fischetti, and D. Maio. Exact and
approximate algorithms for the index selection
In this paper, first, we reviewed many challenges in devel-
problem in physical database design. IEEE Trans. on
oping a framework for index recommendation system. Some
Knowl. and Data Eng., 7(6):955–967, Dec. 1995.
of the major challenges are providing a solution for reduc-
http://dx.doi.org/10.1109/69.476501.
ing the search space for candidate indexes and also reducing
the number of calls to the query optimizer for obtaining the [9] S. Chaudhuri and V. Narasayya. An Efficient,
costs of running a query with different indexes. Our solu- Cost-Driven Index Selection Tool for Microsoft SQL
tion for the former is to only consider the evaluation of the Server. In VLDB. Very Large Data Bases Endowment
most relevant query attributes (most frequent and longest Inc., August 1997.
queries). We solved the latter with introducing a virtual [10] JMeter. Apache jmeter. http://jmeter.apache.org/
environment containing a sample of the targeted data set. accessed 22-April-2016.
Then, a mathematical definition of index recommendation [11] C. Jung, M. Gasthuber, A. Giesler, M. Hardt,
is presented and some of the possible optimization methods J. Meyer, F. Rigoll, K. Schwarz, R. Stotzka, and
to solve the problem are discussed. At the end, the chal- A. Streit. Optimization of data life cycles. Journal of
lenges of evaluating such system for NoSQL database are Physics: Conference Series, 513(3):032047, 2014.
briefly discussed. The NoWog is introduced as a solution http:
for generating synthetic workloads with different distribu- //stacks.iop.org/1742-6596/513/i=3/a=032047.
tions over time. [12] J. Kratica, I. Ljubic, and D. Tosic. A genetic algorithm
Investigating similarity of new queries to the previous one for the index selection problem. Technical report, In
in order to estimate the cost of recommending and creating Applications of Evolutionary Computing, 2003.
new indexes in contrast to using the already existing ones is [13] MongoDB. Mongodb for giant ideas.
part of future plans to expand this work. https://www.mongodb.com/ accessed 22-April-2016.
[14] S. Papadomanolakis and A. Ailamaki. An integer
7. ACKNOWLEDGMENTS linear programming approach to database design. In
In ICDE Workshop on Self-Managing Databases, 2007.
The author like to thank Large-Scale Data Management
[15] tpc, 2016.
and Analysis project [11] by the German Helmholtz Associ-
http://www.tpc.org/information/benchmarks.asp
ation for funding this research.
accessed 22-April-2016.
[16] G. Valentin, M. Zuliani, D. C. Zilio, G. M. Lohman,
8. REFERENCES and A. Skelley. Db2 advisor: An optimizer smart
[1] R. Agrawal and R. Srikant. Fast algorithms for mining enough to recommend its own indexes. In Proceedings
association rules in large databases. In Proceedings of of the 16th International Conference on Data
the 20th International Conference on Very Large Data Engineering, ICDE ’00, pages 101–, Washington, DC,
Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 2000. IEEE Computer Society. http:
USA, 1994. Morgan Kaufmann Publishers Inc. http: //dl.acm.org/citation.cfm?id=846219.847390.
//dl.acm.org/citation.cfm?id=645920.672836. [17] YCSB. Home brianfrankcooper/ycsb wiki github.
[2] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. https://github.com/brianfrankcooper/YCSB/wiki
Automated selection of materialized views and indexes accessed 22-April-2016.
in sql databases. In Proceedings of the 26th [18] D. C. Zilio. Physical database design decision
International Conference on Very Large Data Bases, algorithms and concurrent reorganization for parallel
VLDB ’00, pages 496–505, San Francisco, CA, USA, database systems. Technical report, 1997.
2000. Morgan Kaufmann Publishers Inc.
[3] P. Ameri. On a self-tuning index recommendation
approach for databases. Manuscript is accepted to be
published by the IEEE International Conference on
Data Engineering (ICDE) PhD Symposium, 2016.
[4] P. Ameri and H. Guan. Nowog.
https://github.com/ParinazAmeri/NoWog accessed
22-April-2016.
[5] P. Ameri, J. Meyer, and A. Streit. On a new approach
to the index selection problem using mining
algorithms. In Big Data (Big Data), 2015 IEEE
International Conference on, pages 2801–2810, Oct
2015.
[6] K. Aouiche and J. Darmont. Data mining-based
materialized view and index selection in data
warehouses. CoRR, abs/0707.1548, 2007.
[7] N. Bruno and S. Chaudhuri. Automatic physical
database tuning: A relaxation-based approach. In
Proceedings of the 2005 ACM SIGMOD International
14