=Paper= {{Paper |id=None |storemode=property |title=Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases |pdfUrl=https://ceur-ws.org/Vol-850/paper_grebhahn.pdf |volume=Vol-850 |dblpUrl=https://dblp.org/rec/conf/gvd/GrebhahnBSSKS12 }} ==Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases== https://ceur-ws.org/Vol-850/paper_grebhahn.pdf
     Challenges in Finding an Appropriate Multi-Dimensional
       Index Structure with Respect to Specific Use Cases

              Alexander Grebhahn                          David Broneske                      Martin Schäler
             Institute of Technical and              Institute of Technical and          Institute of Technical and
           Business Information Systems            Business Information Systems        Business Information Systems
             University of Magdeburg                 University of Magdeburg             University of Magdeburg
             grebhahn@st.ovgu.de                      dbronesk@st.ovgu.de                  schaeler@ovgu.de
                Reimar Schröter                           Veit Köppen                        Gunter Saake
             Institute of Technical and             Center for Digital Engineering       Institute of Technical and
           Business Information Systems               University of Magdeburg          Business Information Systems
             University of Magdeburg                   vkoeppen@ovgu.de                  University of Magdeburg
              rschroet@st.ovgu.de                                                            saake@ovgu.de

ABSTRACT                                                              It is possible to extract feature vectors from an item to man-
In recent years, index structures for managing multi-dimen-           age the data in a compressed and meaningful way. For man-
sional data became increasingly important. Due to hetero-             aging these feature vectors, multi-dimensional index struc-
geneous systems and specific use cases, it is a complex chal-         tures can be used. In general, the question arises, which
lenge to find an appropriate index structure for specific prob-       index structure supports managing data best. Throughout
lems, such as finding similar fingerprints or micro traces in a       this paper, index structure performance describes suitability
database. One aspect that should be considered in general is          with respect to a specific use case. However, we analyze core
the dimensionality and the related curse of dimensionality.           aspects that have to be considered, if trying to answer this
   However, dimensionality of data is just one component              for a specific use case. In order to achieve a reconstructible
that have to be considered. To address the challenges of              and valid comparison, we present the idea of a framework
finding the appropriate index, we motivate the necessity of           that allows the comparison of different index structures in a
a framework to evaluate indexes for specific use cases. Fur-          homogeneous test environment.
thermore, we discuss core components of a framework that                 This paper is organized as follows: In Section 2, we give
supports users in finding the most appropriate index struc-           a short overview of basic components that have to be con-
ture for their use case.                                              sidered for evaluating the performance of index structures.
                                                                      Within Section 3, we give an overview of additional chal-
                                                                      lenges, which have to be handled by using index structures
Keywords                                                              in a specific use case. Finally, in Section 4, we present core
index structures, evaluation, multi-dimensional data                  components that a framework needs for quantitatively eval-
                                                                      uation of multi-dimensional index structures with respect to
                                                                      different use cases.
1.     INTRODUCTION
   In the last years, data storage and management in com-             2.    BASIC CHALLENGES
puter-aided systems became more advanced, because of an
increasing amount of unstructured data being stored. For                 Querying multi-dimensional data in an efficient way is
example, in multimedia databases images or videos are stored          a complex challenge. Within the last decades, new index
and analyzed to find similar data items. A special use case           structures are proposed and existing once are improved to
is the Digi-Dak Database Project1 , where multi-dimensional           solve this challenge. Regarding a specific use case, it is not
feature vectors of fingerprints and micro traces are stored in        suitable to consider an index structure in isolation. Addi-
a database. To manage these data items, methods are re-               tionally data properties, used query types, and underlying
quired to handle unstructured data in an appropriate way.             distance metrics have to be taken into account. In this sec-
                                                                      tion, we give a short overview of these four basic challenges.
1
    https://omen.cs.uni-magdeburg.de/digi-dak
                                                                      2.1   Data Properties
                                                                         Characteristics of data cause main challenges of querying
                                                                      data within a database system. For instance, data dimen-
                                                                      sionality has to be considered, because existing index struc-
                                                                      tures are generally effected by the curse of dimensionality
                                                                      [7, 23]. As a result, index structures, that are suitable for
                                                                      a small number of dimensions are not necessarily suitable
                                                                      for a larger amount of dimensions. An additional important
24th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany.                  property is the data distribution, because some index struc-
Copyright is held by the author/owner(s).                             tures are more practicable for clustered data than others.
Furthermore, value domain and the type of the data has to                         R1              R4
be considered.                                                                                                                    R7


2.2    Query Types                                                                R5         R6

                                                                                                                   R2
   Based on the work of Böhm et al. [8], query types can be                                                                 R8

categorized into two groups: -similarity queries and Nearest-                         R3
Neighbor-similarity (NN-similarity) queries. The former de-                                                  R10
                                                                                                                   A    R9   B
scribes a query, resulting in a set of data points being situ-                         R11
                                                                                                                                  C
ated in a defined -distance to the query point, whereas the
latter results in a data point being the nearest item to the                                           R12
query point. Describing these two groups, the -similarity
and NN-similarity has to be defined.
                                                                        Figure 1: R-Tree with overlapping MBRs.
Definition: -similarity Query.
   Two data points p1 and p2 are -similar if and only if
d(p1 , p2 ) ≤ . The function d defines a similarity measure
                                                                  2.4     Index Structures
for two points. In literature, similarity measures are of-           Since we aim at providing a comprehensive set of indexes,
ten replaced by distance metrics, which we review in Sec-         we want to consider different types of index structures. Thus,
tion 2.3. For finding all points in the data base being -        we use the classification of Weber et al. [23] to address a
similar, an -similarity query is executed. A special case of     broad variety of different approaches. Thus, index struc-
the -similarity is represented for  = 0, because this implies   tures are classified by partitioning of the data space. Index
two identical points and an exact match is executed [8].          structures that partition the whole space are called space
                                                                  partitioning methods, whereas data partitioning methods
Definition: NN-similarity Query.                                  partition the necessary space according to the location of
  The data point p1 is NN-similar to p2 with respect to a         data points [8, 23]. Consequently, there are regions that
data base of points DB if and only if ∀p ∈ DB, p 6= p1 :          are not taken into account by performing a query on data
d(p2 , p1 ) ≤ d(p2 , p). For NN-similarity queries, all points    partitioning methods.
in a database are retrieved that are NN-similar to the query         Alternatively, Andoni and Indyk [2] classify index struc-
point. An extension to the NN-similarity query is presented,      tures by query results. There are exact index structures that
when instead of a nearest neighbor, k nearest neighbors have      guarantee to retrieve the exact result of a query. Although,
to be retrieved. In this paper, we call the resulting query       this behavior is usually preferred, there are approximation-
k-NN query.                                                       based index structures, guaranteeing to retrieve points that
                                                                  are similar to the correct result of a query. For instance for k-
   Apart from the mentioned similarity range query, window        NN queries, approximation-based index structures provide k
queries are common queries and often called range queries in      near neighbors to the query point instead of all exact nearest
literature [24]. These window queries are defined by intervals    neighbors. Hereby, the quality of the retrieved results, called
for every queried dimension.                                      precision, can differ significantly, because approximate in-
2.3    Distance Metrics                                           dex structures aim at improving the query performance by
                                                                  decreasing the precision. Nevertheless, an approximation-
   To execute similarity queries, we require a function com-
                                                                  based index should hold a threshold, because resulting data
puting the similarity of two data items. To this end, similar-
                                                                  would not be useful. In the following sections, we present
ity for equal points is 1 whereas the maximum dissimilarity
                                                                  some representatives of index structures. First, exact index
is expressed by 0. Equivalent information is delivered from
                                                                  structures, such as R-Tree [11], Pyramid Technique [5], and
distance metrics, whereupon two data items are more simi-
                                                                  VA-File [22] are introduced. Subsequently, p-stable Local-
lar, the smaller their distance is.
                                                                  ity Sensitive Hashing [12] as an approximation-based index
   The most common distance metrics are Minkowsky class
                                                                  structure is presented.
metrics, also called Lp distance metrics. The distance of two
data items x and y is computed by:                                2.4.1     Exact Index Structures
                             d
                            X                  1/p                Giving an overview of existing index structures, we intro-
              Lp (x, y) =         (xi − yi )p          .          duce promising exact index structures in this section. Fur-
                            i=1                                   thermore, the difference between space partitioning and data
By choosing different values for p, different representatives     partitioning methods is stated by presenting at least one in-
of this class are produced. For p = 2, the Euclidean distance     dex structure for each category.
metric is generated, which dominates common database sys-
tems according to Bugatti et al. [9].                             R-Tree.
   Beneath these distance metrics, there are many other met-        One of the most important multi-dimensional index struc-
rics, such as Canberra [9] or Dynamical-Partial [16] dis-         tures is the R-Tree [11], introduced by Guttmann in 1984.
tance function. In contrast to Minkowsky distance func-           Since this time, many new index structures are proposed
tions, Dynamical-Partial distance metric dm uses only the         based on the ideas used in the R-Tree. For instance R+ -
m smallest distances for the computation of the distance of       Tree [21], R∗ -Tree [4], X-Tree [6], A-Tree [19], and SR-
data items [16]. As a result, in some specific use cases, it      Tree [13]. Beside these structures, there are many more
can be a great benefit using the Dynamical-Partial distance       index structures which are not mentioned here. For fur-
metric, because the influence of particular dimensions can        ther informations, see Samet [20], giving a comprehensive
deteriorate the distance of data items.                           overview of existing index structures.
   (0; 1)                          (0; 1)                                                                                                     Approximation File

                                                                                                                                          A   00 11      N    11 01
                                                                                   A                    D
                                                                                       B                                                  B   01 11      O    11 01
                                                                  11                                                 J
                                            (0,5; 0,5)                                     C                                              C   01 11      P    11 00
                                                                                                                                      K
             (0,5; 0,5)                                                                                                                   D   10 11      Q    10 01
                                                                                                E                        L
                                                                               H                                                          E   01 10      R    11 00
                                                                                       F                                     M
                                                                  10
                                                                                           G                                              F   01 10      S    10 00
                                                                           I
                                                                                                        W                                 G   01 10      T    01 00

                                                                                                                                     N    H   00 10      U    10 00
                                                                  01               X                                                      I   00 10      V    10 01
                                                                                           Y                V        Q           O
                                                                                                                                          J   10 11      W    10 10
                                                                               Z                            U                         P   K   11 11      X    00 01
                                                                  00                                                                      L   11 10      Y    01 01
   (0; 0)                 (1; 0)   (0; 0)                (1; 0)                                     T            S
                                                                                                                         R
                                                                                                                                          M   11 10      Z    00 00
               (a)                            (b)                         00               01               10                   11



Figure 2: Space partition of a 2-d space by Berchtold                     Figure 3: Partitioning of the VA-File.
et al. [5] (a) and Lee and Kim [15] (b).

   However, the basic idea of these index structures is to ad-    degeneration of most index structures to a sequential scan, if
ministrate points hierarchically in a tree. The R-Tree par-       the dimensionality of data points exceeds a certain limit [23].
titions the data space using minimum bounding rectangles          Hence, the authors propose to accelerate the sequential scan
(MBR). A minimum bounding rectangle can be described              by using vector approximation.
by two points, being the end of the diagonal of the rectan-          The VA-File divides each dimension of the space into 2b
gle. Stepwise, the space is partitioned by MBRs, so that the      equally filled cells, where b is an user defined amount of bits
superordinate MBR encloses all of its subordinate MBRs, as        per dimension. Each cell is labeled with a unique bit string,
we visualize in Figure 1.                                         being the concatenation of the corresponding bit strings for
   With increasing dimensionality, R-Trees face the challenge     every dimension. For every point, the bit string of the cell is
of overlapping MBRs. A query rectangle, situated in a re-         stored, which the point is inserted into. Thus, the VA-File
gion, where two or more MBRs overlap (like the MBR R2             uses two lists: an approximation file that stores the bit string
and R3 in Figure 1), forces the R-Tree to follow up two or        of the cells for every point and a vector file with the vector
more different routes in the tree. Thus, the query perfor-        data for each point. An exemplary space partitioning and
mance decreases [6]. To overcome this disadvantage other          the corresponding approximation file can be seen in Figure 3.
index structures that we mentioned before, are developed.            Generally, the query algorithm of the VA-File traverses
                                                                  the whole approximation file to collect suitable candidates
                                                                  for the query result at first. After that, exact comparisons
Pyramid Technique.                                                between the vector data of the candidates and the query are
   An example for an exact space partitioning index struc-
                                                                  performed.
ture is the Pyramid Technique, which was introduced by
                                                                     The approximation technique of the VA-File helps to re-
Berchtold et al. [5]. The Pyramid Technique divides an n-
                                                                  duce hard-disk accesses, because small bit strings can be
dimensional space into 2d pyramids [5]. A d dimensional
                                                                  kept in main memory. Even if the whole approximation
normalized point x is inserted into a pyramid according to
                                                                  file does not fit into the main memory, the sequential ex-
the dimension jmax with its maximum distance to the center
                                                                  amination of the approximations reduces disk access costs
of the data space. Thus, the pyramid number pi is computed
                                                                  compared to random accesses to many data items [22]. An-
as follows:
                                                                 other advantage is, in contrast to the Pyramid Technique,
                  jmax           if xjmax < 0, 5                  the availability of different algorithms to efficiently support
            i=
                  (jmax + d)     if xjmax ≥ 0, 5                  all query types being executable on a sequential scan with-
  Second, for managing the space enclosed by a pyramid,           out adaption of the space partitioning of the VA-File.
the pyramids are divided in pyramid slices. According to the      2.4.2    Approximation-based Index Structures
query types supported by the index structure, the partition
of pyramids can be done in different ways. In Figure 2,             Typical representatives for an approximation-based index
we present two different possible methods for partitioning a      structure are based on hash schemes. Apart from common
pyramid, for a two dimensional normalized space.                  hashing algorithms, scattering inserted data points over the
  In particular, the partition of Figure 2 (a) is proposed        amount of buckets is not applicable for similarity queries.
by Berchtold et al. [5] to support range queries. The other       Consequently, there is a need for hash functions, causing
partition, shown in Figure 2 (b), is used by the approach of      collisions when hashing locally near situated points. This
Lee and Kim [15] to support k-NN queries. It is possible to       challenge is handled by Locality Sensitive Hashing (LSH).
use the partitioning from Berchtold et al. for k-NN queries as    The aim of LSH is to map the key to a one dimensional
well, but not in an efficient way. Anyway, a point is inserted    hash value. Thus, all comparisons are made on the hash
into the slice depending on its distance to the center of the     value instead of a high dimensional key. Supporting nearest
space. To sum up, for supporting different query types in         neighbor queries, LSH uses (P 1, P 2, r, cr)-sensitive functions
an efficient way, different pyramid partitions are required.      h to compute the hash value. These functions h have to fulfill
                                                                  the following constraints [12]:
VA-File.                                                               For every dataset in a d-dimensional space p, q ∈ Rd :
  In 1997, Weber and Blott [22] introduce the VA-File to
overcome the curse of dimensionality. The VA-File is an                   1. if ||p − q|| ≤ r, then P r[h(p) = h(q)] > P 1
improved sequential scan, because Weber et al. noticed a                  2. if ||p − q|| ≥ cr, then P r[h(p) = h(q)] < P 2
The first constraint demands that the probability for two            the approximation-based index structure p-stable LSH pre-
points to be hashed into the same bucket has to be larger            sented in Section 2.4 are number of hash functions and width
than P 1 if their distance is smaller than r. Whereas, if their      of hash buckets.
distance is bigger than cr, the probability should be smaller           Thus, we have to assume, that these parameters have an
than P 2. In order to be an useful locality sensitive function,      impact on the performance of index structures. Therefore,
P 1 should be much bigger than P 2.                                  it is necessary to analyze suitable parameter values when
   Improving the precision of the index structure, usually           trying to identify an appropriate index structure for a given
several hash tables with different hash functions are used.          use case. However, there are some problems considering an
Consequently, the need for (P 1, P 2, r, cr)-sensitive functions     appropriate value of some parameters. For example, the
is obvious. A promising family of hash functions is used in          vectors used for p-stable LSH are randomly chosen from p-
p-stable LSH.                                                        stable distributions. As a result of this random component
                                                                     it is possible that the performance and precision results of
p-stable LSH.                                                        the same index structure created with different seeds of the
  The approach of p-stable LSH is based on p-stable distri-          random component can differ very much, within the same
butions. A distribution D is p-stable for p ≥ 0 if for any n the     use case. This is problematic when trying to quantitatively
real numbers v1 , ..., vn and i.i.d. random variables X1 , ..., Xn   evaluate the index structure.
with distribution D, the following constraint is fulfilled:
                                                                     3.2   Workload and used Queries
               n
               X             n
                            X           1/p                          Although, two different applications can deal with the
                 (vi Xi ) ∼    (kvi kp )      X,                     same data, they can have a different workload. For that
               i=1             i=1                                   reason, they can differ in requirements of index structures.
                                                                     The workload of an application depends on the used query
∼ means the operands have the same distribution and X is
                                                                     types. Yet, it is obvious, not to use the Pyramid Technique
a random variable from the distribution D [18].
                                                                     presented from Berchtold et al. [5] for performing a k-NN
  Using d random variables from D to form a d-dimensional
                                                                     query, but the version presented by Lee and Kim [15], be-
Vector ~a, the scalar of vector ~a and the data point ~v result in
                                        P d          1/p           cause it is optimized for this query type.
a random variable with distribution          (kvi kp )     X [12].     For defining the workload of a database system we use
                                          i=1                        a definition inspired by Ahmad et al. [1]. As a result, the
Several of these scalar products with different vectors can
                                                                     workload is defined by the percentage of the query types
be used to estimate k~v kp (the Lp distance metric). The
                                                                     used and the amount of concurrent requests performed.
corresponding distributions are:
    • The Cauchy Distribution DC (0, 1), defined by the den-         3.3   DBMS Environment
                                 1
       sity function c(x) = (π(1+x 2 )) is 1-stable and can be
                                                                        In addition to use cases and workload, the test environ-
       used to estimate the Manhattan distance metric.               ment has an impact on the performance of each index struc-
    • The Gaussian (Normal) Distribution DG (0, 1), defined          ture. As already mentioned, the VA-File is optimized for
                                                 2
       by the density function g(x) = √12π e−x /2 is 2-stable        database systems, storing data items on a disk and not in
       and can be used to estimate the Euclidean distance            main memory. Evaluations of VA-File and sequential scan,
       metric.                                                       result in different conclusions according to an evaluation
   Instead of estimating a distance metric, the scalar prod-         with an in-memory database or a database storing items
uct with vectors from p-stable distributions can be used to          on the disk. Consequently, it is necessary to consider the
compute hash values of the data points, because the scalar           underlying storage management of the database system as
product maps the vectors to a one dimensional space. Fur-            well.
thermore, the result of the scalar product has the same dis-            Beside the storage management of a database, the amount
tribution as the Lp distance metric, which guarantees the            of main memory and the CPU performance are other impact
(P 1, P 2, r, cr)-sensitiveness.                                     factors to the performance.

3.    ADVANCED CHALLENGES                                            4.    TOWARDS A FRAMEWORK
  After giving a short introduction to general challenges of           Since we aim at providing a comprehensive library of use
indexing multi-dimensional data, in this section we provide          cases and suitable indexes, we motivate a framework to give
existing challenges of evaluating the performance of index           users the possibility to evaluate own use cases with different
structures for a specific use case. For giving an overview           index structures. In this paper, we summarize key aspects
of possible challenges when evaluating index structures, we          of a framework that supports four groups. In Figure 4, we
group the challenges into three groups.                              give an overview of these four groups.

3.1    Parameter of the Index Structures                             4.1   Extensibility
  Some index structures have specific parameters for tuning             First, the framework has to be extensible w.r.t. four key
their performance. Thus, when evaluating the performance             aspect, we present in Section 2. In other words, for an user,
of index structures, these parameter have to be considered           it has to be possible to implement, integrate, and evaluate
as well. For index structures given in Section 2.4, these pa-        own index structures. Furthermore, it has to be possible
rameter are: the minimum and the maximum number of                   to extend the framework and existing index structures with
points within a MBR for the R-Tree, the number of slices a           additional distance metrics and also other query types. Fi-
pyramid is divided in, for the Pyramid Technique, and the            nally, it has to be possible to integrate existing data in the
length of the bit vector for the VA-File. The parameters of          framework and to create data with specific properties like a
                                      USER                                  or pit-falls of the chosen index structure. For instance, the
                                                                            user is able to follow the split of MBRs in the R-Tree and
                                                                            can easily identify overlapping regions while the tree is be-
                                                                            ing constructed. Another aspect, being worth to visualize,
                                         test environment
           workload   visualization                         extensibility   is a statistic on query performance. These statistics help to
                                             simulator
                                                                            analyze the performance of different index structures for a
                                                                            given workload or an index structure under different work-
                               FRAMEWORK                                    loads. Apart form the query performance, other interesting
                                                                            values may be worth visualizing. The time spent on con-
                                                                            structing the index structure is important for systems with
         Figure 4: Structure of the Framework.                              many delete and update queries, because a reconstruction of
                                                                            the index is sometimes necessary when a certain threshold
                                                                            of changed data is reached. Furthermore, when using an ap-
specific data distribution. Thus, creating data distributions               proximate index structure, the precision of executed queries
is not trivial, an interface has to be created for importing                and the overall precision of the index structure is worth vi-
existing data sets and communicating with systems like R,                   sualizing, because it has an impact on the suitability of an
see for instance [14].                                                      index structure for a special use case.
4.2      Adaptability to Different Workload                                 4.5      Working with the Framework
   In real world applications, the workload differs quite much.               Finally, our framework shall help finding the most suit-
On the one hand, there are use cases that use only read                     able index structures for a given use case. For this, the
transactions. On the other hand, the workload can consist                   expected workload has to be known. These parameters in-
of read and write transactions. Thus, the performance re-                   clude supported query types, exact or approximate results,
sults of workloads can differ very much. Hence, an interface                data dimensionality and distribution, amount of data, the
is needed for importing workloads from existing systems. A                  delay of the data access, and the environment. By finding
further requirement, is to support standardized benchmarks,                 suitable index structures for the given parameters, there are
e.g. the TPC-H Benchmark2 .                                                 index structures that do not have to be taken into account,
   Beside the queries used, the desired precision of the query              because they do not support certain query types or work
results have to be defined by the user. Thus, if approximate                approximately although the result is restricted to be exact.
results are allowed, the user has to define the accuracy of                 After excluding unsuitable index structures, the remaining
the results. Nevertheless, the precision depends on the data                index structures are evaluated under the given workload. By
properties, the given distance metric, the used queries and                 reviewing the performance results, the user can choose the
the parameters of the index structure as well.                              suitable index structure for her use case.
4.3      Test Environment Simulator
   Existing index structures are created with respect to dif-               5.      RELATED WORK
ferent optimization criteria. As already mentioned, the VA-                    In the last decades, many new index structures are cre-
File is optimized for reducing disk accesses. Consequently,                 ated [5, 10, 22]. In addition, existing index structures are
another criteria our framework has to consider is the envi-                 improved for supporting new query types [15] or to increase
ronment the tests are located in. Thus, within the frame-                   performance [6]. However, within the presented evaluation
work a parameter has to exist, for setting whether the test                 of these index structures only a small set of existing in-
is for an in-memory database or if a disk access is needed for              dex structures is considered. For example, within Berchtold
accessing the data. Due to an assumption, that the access                   et al. [5], the Pyramid Technique is evaluated against X-
time of data differs very much considering Hard-Disk-Drives                 Tree, Hilbert R-Tree, and sequential scan. Therefore, it is
and Solid-State-Disks, the framework should have a compo-                   problematic to identify, which is the most appropriate in-
nent for virtualizing the disk access. With this component                  dex structure for a given problem. Additionally, different
it is possible to perform tests on one system while simulat-                performance evaluations are done in different environments
ing an access delay of another system. In addition to this                  with different data characteristics. So, it is problematic to
storage device simulator, a simulator for all hardware com-                 generalize the results of an evaluation.
ponents is required to give an useful hint about the best                      For giving a comparison of the performance of multi-di-
performing index structure.                                                 mensional index structures, there already exists some frame-
   Additionally, within the framework a parameter has to                    works, like the GiST 3 framework or the MESSIF [3] frame-
exist for defining the values of some index structure param-                work. In contrast to the framework we present here, these
eters such as the maximum number of data items of leaves                    frameworks have some additional constraints. For example,
of the R-Tree.                                                              the GiST framework only focuses on trees, hence no other
                                                                            multi-dimensional index structures such as the VA-File or
                                                                            the Pyramid Technique are considered, while the MESSIF
4.4      Visualization                                                      framework only focuses on metric data. Another framework
  In case the index structure has to struggle with a specific               limiting the available index structures is introduced by Muja
data distribution or query type, it can be useful to visualize              et al. [17]. The aim of this framework is to optimize param-
the space partitioning of the index structure. With this vi-                eters of approximate index structures in order to match the
sualization, further hypothesis can be drawn on the benefits                required precision under given data distributions.
2                                                                           3
    http://www.tpc.org/tpch/                                                    http://gist.cs.berkeley.edu/
6.   CONCLUSION                                                        IEEE Trans. on Pattern Analysis and Machine
   In this paper, we provide an overview of existing chal-             Intelligence (TPAMI), 30(9):1647–1658, 2008.
lenges in finding an appropriate index for multi-dimensional      [11] A. Guttman. R-Trees: A dynamic index structure for
data for a specific use case. First, we explain distance met-          spatial searching. In SIGMOD’84, Proc. of Annual
rics and common query types that have to be considered.                Meeting, pages 47–57, 1984.
Second, the parameters of the index structures can have an        [12] P. Indyk and R. Motwani. Approximate nearest
impact on the performance of an index structure. Third, for            neighbors: Towards removing the curse of
users, it has to be possible to define own workload pattern            dimensionality. In Proc. Symp. on Theory of Compu.
and the environment, the application is located in.                    (STOC). ACM, 1998.
   For supporting these characteristics of real-world use cases   [13] N. Katayama and S. Satoh. The SR-tree: An Index
we present requirements of a framework we intend to de-                Structure for High-Dimensional Nearest Neighbor
velope. Our framework has to support four key aspects.                 Queries. In Proc. Int’l. Conf. on Mgmt. of Data
Namely, it has to be extensible, support different workload            (SIGMOD), pages 369–380. ACM, 1997.
patterns, virtualize different use case environments, and con-    [14] V. Köppen. Improving the Quality of Indicator
tain a visualization component for improving user experi-              Systems by MoSi – Methodology and Evaluation. PhD
ences.                                                                 thesis, Freie Universität Berlin, 2008.
                                                                  [15] D.-H. Lee and H.-J. Kim. An efficient technique for
7.   ACKNOWLEDGMENTS                                                   nearest-neighbor query processing on the SPY-TEC.
  The work in this paper has been funded in part by the                Trans. on Knowl. and Data Eng. (TKDE),
German Federal Ministry of Education and Science (BMBF)                15:1472–1486, 2003.
through the Research Programme under Contract No.                 [16] B. Li, E. Chang, and Y. Wu. Discovery of a perceptual
FKZ:13N10817 and FKZ:13N10818. Additionally we want                    distance function for measuring image similarity.
to thank Sandro Schulze for giving us useful comments.                 Multimedia Systems, 8(6):512–522, Apr. 2003.
                                                                  [17] M. Muja and D. G. Lowe. Fast approximate nearest
8.   REFERENCES                                                        neighbors with automatic algorithm configuration. In
 [1] M. Ahmad, A. Aboulnaga, and S. Babu. Query                        Proc. Int’l. Conf. on Computer Vision Theory and
     interactions in database workloads. In Proc. Int’l.               Applications (VISAPP), pages 331–340, 2009.
     Workshop on Testing Database Systems, DBTest,                [18] J. P. Nolan. Stable distributions: Models for heavy
     pages 11:1–11:6. ACM, 2009.                                       tailed data. Springer-Verlag, 2009.
 [2] A. Andoni and P. Indyk. Near-optimal hashing                 [19] Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima.
     algorithms for approximate nearest neighbor in high               The A-tree: An index structure for high-dimensional
     dimensions. Commun. ACM, 51(1):117–122, 2008.                     spaces using relative approximation. In Proc. Int’l.
 [3] M. Batko, D. Novak, and P. Zezula. Messif: Metric                 Conf. on Very Large Data Bases (VLDB), pages
     similarity search implementation framework. In Proc.              516–526. Morgan Kaufmann Publishers Inc., 2000.
     Conf. on Digital Libraries (DELOS), 2007.                    [20] H. Samet. Foundations of Multidimensional and
 [4] N. Beckmann, H.-P. Kriegel, R. Schneider, and                     Metric Data Structures. Morgan Kaufmann Publishers
     B. Seeger. The R*-Tree: An efficient and robust access            Inc., 2005.
     method for points and rectangles. In Proc. Int’l. Conf.      [21] T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The
     on Mgmt. of Data (SIGMOD), pages 322–331. ACM,                    R+-Tree: A dynamic index for multi-dimensional
     1990.                                                             objects. In Proc. Int’l. Conf. on Very Large Data
 [5] S. Berchtold, C. Böhm, and H.-P. Kriegel. The                    Bases (VLDB), pages 507–518. Morgan Kaufmann
     Pyramid-Technique: Towards breaking the curse of                  Publishers Inc., 1987.
     dimensionality. SIGMOD Rec., 27:142–153, 1998.               [22] R. Weber and S. Blott. An approximation-based data
 [6] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The                  structure for similarity search. Technical Report
     X-Tree: An index structure for high-dimensional data.             ESPRIT project, no. 9141, 1997.
     In Proc. Int’l. Conf. on Very Large Data Bases               [23] R. Weber, H.-J. Schek, and S. Blott. A quantitative
     (VLDB), pages 28–39. Morgan Kaufmann Publishers                   analysis and performance study for similarity-search
     Inc., 1996.                                                       methods in high-dimensional spaces. In Proc. Int’l.
 [7] C. Böhm. Efficiently Indexing High-Dimensional Data              Conf. on Very Large Data Bases (VLDB), pages
     Spaces. PhD thesis, Ludwig-Maximilians-Universität               194–205. Morgan Kaufmann Publishers Inc., 1998.
     München, 1998.                                              [24] R. Zhang, B. C. Ooi, and K.-L. Tan. Making the
 [8] C. Böhm, S. Berchtold, and D. A. Keim. Searching in              pyramid technique robust to query types and
     high-dimensional spaces: Index structures for                     workloads. In Proc. Int’l. Conf. on Data Engineering
     improving the performance of multimedia databases.                (ICDE), pages 313–324. IEEE Computer Society,
     ACM Comput. Surv., 33:322–373, 2001.                              2004.
 [9] P. H. Bugatti, A. J. M. Traina, and C. Traina, Jr.
     Assessing the best integration between
     distance-function and image-feature to answer
     similarity queries. In Proc. ACM Symp. on Applied
     Computing (SAC), pages 1225–1230. ACM, 2008.
[10] E. Chavez Gonzalez, K. Figueroa, and G. Navarro.
     Effective proximity retrieval by ordering permutations.