=Paper=
{{Paper
|id=Vol-1810/DOLAP_paper_10
|storemode=property
|title=Secondary Indexing Techniques for Key-Value Stores: Two Rings To Rule Them All
|pdfUrl=https://ceur-ws.org/Vol-1810/DOLAP_paper_10.pdf
|volume=Vol-1810
|authors=Joseph Vinish D’silva,Roger Ruiz-Carrillo,Cong Yu,Muhammad Yousuf Ahmad,Bettina Kemme
|dblpUrl=https://dblp.org/rec/conf/edbt/DsilvaRYAK17
}}
==Secondary Indexing Techniques for Key-Value Stores: Two Rings To Rule Them All==
<pdf width="1500px">https://ceur-ws.org/Vol-1810/DOLAP_paper_10.pdf</pdf>
<pre>
      Secondary Indexing Techniques for Key-Value Stores:
                  Two Rings To Rule Them All

           Joseph Vinish D’silva       Roger Ruiz-Carrillo             Cong Yu
       joseph.dsilva@mail.mcgill.ca roger.ruiz@mail.mcgill.ca cong.yu@mail.mcgill.ca
                      Muhammad Yousuf Ahmad                Bettina Kemme
                  muhammad.ahmad2@mail.mcgill.ca kemme@cs.mcgill.ca
                                             School of Computer Science, McGill University
                                                          Montréal, Canada

ABSTRACT                                                                     allowed for easy distribution of data, and thus, scalability.
Secondary indices are traditionally used in DBMS to in-                      Also other typical DBMS functionality, such as transaction
crease the performance of queries that do not rely on the                    management, was only implemented in rudimentary format,
keys of the table for data reads. Many of the newer NoSQL                    leading to much less complexity compared to traditional
distributed data stores, even if they provide a table-based                  RDMBS. However, given the prolific success and decades of
data model such as HBase, however, do not yet have a sec-                    domination of the relational model, applications started to
ondary indexing feature built in. In this paper, we explore                  request the modeling and querying functionality they were
the challenges associated with indexing modern distributed                   used to from RDBMS but at the same time wanted to main-
table-based data stores and investigate two secondary index                  tain the flexibility and scalability of key-value stores.
approaches which we have integrated within HBase. Our                           Therefore, several “hybrid” data stores emerged that adapted
detailed analysis and experimental results prove the bene-                   a relaxed notion of the “table-column” abstraction (we will
fits of both the approaches. Further, we demonstrate that                    cover this further in section 2.2) with a much less restrictive
such secondary index implementation decisions cannot be                      table-based data model than traditional RDBMS, suitable
made in isolation of the data distribution and that different                for the sparse datasets that are commonly found in BigData
indexing approaches can cater to different needs.                            applications. In tandem with this table concept the query in-
                                                                             terface also became more powerful, allowing, e.g., for predi-
                                                                             cate search. However, answering such complex queries either
CCS Concepts                                                                 requires scanning the entire data sets or access to alternate
•Information systems → Point lookups; Unidimen-                              access paths. As a result, many NoSQL stores are trying to
sional range search;                                                         implement their own notion of a secondary index.
                                                                                A notable deviation in the development of key-value stores
Keywords                                                                     compared to the traditional DBMS is its open source nature,
                                                                             facilitating the database development community at large to
secondary indexing; NoSQL data stores; in-memory indices;
                                                                             pitch in their contributions. This has resulted in various de-
                                                                             sign approaches being attempted in providing a secondary
1.    INTRODUCTION                                                           index functionality for these data stores, including HBase
  Secondary indexing plays a key-role in addressing the per-                 [14] (which we briefly cover in related work in section 6).
formance necessities of relational database management sys-                  However, so far, we are not aware of any in-depth study and
tems (RDBMS). It is instrumental in facilitating efficient                   comparative analysis of these different approaches. More-
selection of a subset of the dataset based on business con-                  over, many implementations fall short on modularity. Our
straints by providing alternative access paths to the base                   analysis of index implementations of HBase shows that most
records. Its performance benefits are primarily derived from                 of them require significant changes to the core HBase code
the reduced I/O facilitated by retrieving only those data                    instead of leveraging the existing HBase frameworks to de-
pages that contain relevant records.                                         velop a pluggable module.
  The dawn of BigData resulted in a resurgence of inter-                        In this paper, we present an in-depth discussion of in-
est in NoSQL (non-relational) DBMS, in particular, key-                      dexing for distributed, table-based NoSQL data stores. In
value stores. Their simple data model and query interface                    particular, our paper makes the following contributions.

                                                                                • We discuss two indexing strategies for distributed key-
                                                                                  value stores: one based on distributed tables that is
                                                                                  able to exploit the table model of the underlying sys-
                                                                                  tem for index management, the other using a co-location
 c 2017, Copyright is with the authors. Published in the Workshop Proceed-        approach allowing for efficient main-memory access.
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is
permitted under the terms of the Creative Commons license CC-by-nc-nd           • Both strategies are implemented and integrated into
4.0                                                                               HBase in a non-intrusive way.
     • We provide an enhanced client interface to query HBase           Row keys            Column families and associated
       tables using secondary indexing that supports both
                                                                        are sorted          column names under them
       point queries and range queries.

     • We present a detailed performance metrics on various                           Personal             Office
       database operations with secondary indices and a com-            Row Key Name Res-Phone     Office-phone Dept
       parative analysis of the different approaches.
                                                                        al23    John 321-124-7894 323-551-7452 1
     • We present a thorough analysis on the effects of data            ke77    Mark 321-435-7821 323-907-7873 1
       distribution on different indexing approaches.
                                                                        liu7    Sally 321-513-3212 323-875-7681 2
2.    BACKGROUND                                                        pat8    Sean 321-542-5521 323-423-7911 3
                                                                        smi1    Zoe   321-789-9013 323-791-4231 3
2.1     Indexing Techniques
                                                                                     Cells are the most fine grain unit of data
   Predicate queries that only retrieve a subset of the data
of a table can be executed through a table scan where each
                                                                              Figure 1: Key-value Data Model [16]
record is inspected and only qualifying records are returned.
If the table is already sorted (clustered) based on the at-           Although NoSQL DBMSes have been around since the
tribute on which the search constraint is defined, then a full     1960s [18] employing hierarchical, network, graph and other
table scan can be avoided as the matching tuples can be            semi-structured data models, the dawn of Big Data in the re-
found in logarithmic time.                                         cent decades and the associated processing frameworks have
   Alternatively to a scan, special index structures can help      spawned a new breed of NoSQL DBMS, key-value, and de-
identify the records that qualify, and then only those records     rived from there, table-based data stores. The resurgence of
are retrieved from the base table. Indices are defined over        interest in NoSQL DBMSes was primarily attributed to the
one or more attributes. Such indices are called secondary          rigid structural requirements of relational DBMS, such as
indices and are often constructed over non primary-key at-         the need to define the layout of the data in advance. Such
tributes. Typically, a secondary index contains an entry for       restrictions did not fit well with the processing needs of many
each existing value of the attribute to be indexed. This entry     modern Big Data applications, prompting the search for al-
can be seen as a key/value pair with the attribute value as        ternative data models. Further, scalability requirements ne-
key and as value a list of pointers to all records in the base     cessitated the distribution of data and computation. Widely
table that have this value. In centralized database manage-        used in practice are now data stores with a flexible table-
ment systems (DBMS) a pointer is typically a physical iden-        based data model. Examples are Bigtable [11], Cassandra
tifier indicating the position of the record in the file system.   [17] and HBase. In the following, we describe the general
In distributed systems or more high-level implementations,         structure along HBase as the others are quite similar.
a pointer is often the primary key of the record, assuming
that a lookup via the primary key can be done efficiently.         2.2.1    Data Model
   A simple index implementation would be a system table              Similar to the relational database data model, HBase ag-
(inverted list) with the index attribute as primary key and        gregates related data into tables. Each table is comprised of
the list of record pointers as extra column. This system table     multiple rows. Each row contains a unique row key (or pri-
can be sorted by primary key making the lookup of a specific       mary key), one or more column families and columns under
attribute value fast. More common are tree-based indices,          them. The table is (logically) sorted by row key. In general,
where inner nodes guide to leaf nodes which contain the            each specific value in the columns is defined as a cell, and
key/value pairs. Both sorted inverted lists and tree struc-        can be referenced uniquely by the combination of row key,
tures are good for point queries (search on a single attribute     column family, and column.
value) and range queries (search on a range of attribute val-         An important characteristic of the data model is that not
ues). A mechanism that performs very well for point queries        all columns need to be provided for a row, and new columns
is a hash-table with the index attribute as key and the list       can be added on the fly. Column names are stored along with
of pointers as value.                                              the associated value in the database. Such tables can con-
   In some cases, a query can be answered directly using the       ceptually be sparse as a given column need not be present
index, without having to access the base table. Typically,         in many of the rows. Also, column values are treated as
index structures are effective if few records qualify the search   byte arrays and clients are responsible to perform the proper
criteria. Through the index, these records are found very          data type conversion. This dynamic structure is a distin-
quickly and then can be individually retrieved from the base       guishing characteristic compared to RDBMS where the table
table. When many records qualify, it might be faster to scan       structure is rigid, typically defined ahead and maintained as
the entire table, as a scan allows reading tuples in batches       meta-table information.
(e.g., all on a given block) while using an index not only            Fig. 1 shows an HBase table containing employee informa-
first requires the index access but the individual retrievals of   tion, with two column families: Personal with columns Name
the qualifying tuples might result in many random accesses         and Res-Phone, and column family Office with columns
across the blocks of the base table. In fact, if every block       Office-phone and Dept.
contains a qualifying record, an index will not reduce the
I/O costs for bringing blocks into main memory.                    2.2.2    Client API
                                                                     While not providing a full SQL interface, HBase and other
2.2     Table-based NoSQL DBMS                                     table-based NoSQL DBS provide reasonably sophisticated
call level interfaces. The most common data retrieval is the       manages meta-information, is responsible for the allocation
lookup method that requires the row key as input. HBase has        of regions to region servers and balances the load in the
internal index structures that very efficiently find a record      HBase cluster. Clients only communicate with the master to
based on its row key. Further, the relational equivalent of        retrieve the meta information (e.g. the location of regions)
projection is accomplished by providing the lookup method          but they perform data transfer directly with region servers
with the list of column families and columns of interest.          so that the master does not become overloaded. Master
   Apart from lookup by primary key, HBase also provides a         failures are handled by a monitoring system external to the
scan operator that can take as input filters similar to SQL        DBMS cluster.
predicates. HBase then returns all records that satisfy the           In the decentralized approach, there is no master node to
filter. Queries over several tables, however, are not provided.    perform explicit coordination, but each node in the cluster
A scan operation on a range of row keys is relatively efficient    has identical functionality and can take up the role of a coor-
as HBase sorts a table by primary key. However, filters            dinator. Clients connect to any node in the cluster. Apache
over non row key attributes require a scan over the entire         Cassandra follows this approach of cluster management.
table to find the matching records. There are first attempts
of providing secondary indices but none has been officially        2.2.4    Storage Structure
integrated. Details are discussed in section 6.                       Many table-based NoSQL data stores follow a storage
   Writes and Updates are the same from the perspective            approach first proposed in the Log-Structured Merge-Tree
of the client API. They require as input the row key, the          (LSM-Tree) [22]. In this approach, the table abstraction
column families and column names to be updated and their           has an in-memory component and a persistent storage com-
corresponding values. Deletes require the row key of the           ponent. The in-memory component is used to store the most
record impacted, and also the column family names/column           recent updates of data. Additionally, these updates are also
names to be deleted. This allows specific attributes to be         recorded in the database logs for durability, similar to tra-
deleted instead of the entire record [14], a necessity when        ditional DBMS. When the in-memory component becomes
the data model allows the table to be sparse. Deletes in gen-      very large, this data structure is persisted in an immutable
eral do not translate to physical deletes. Instead, a delete       format into the disk. Therefore, at any time a table might
marker is used to indicate that the data was requested to          be made up of one in-memory component and multiple im-
be removed. Retrieval operations will encounter the delete         mutable disk components. A search for a row key is first
marker and skip that record. Storage optimization opera-           performed on the memory component followed by the disk
tions are often performed in a periodical manner which take        components. A background process can periodically scan
care of physically removing the deleted data.                      and merge multiple disk components of a table into one
                                                                   larger data structure in the disk to reduce the amount of
2.2.3    Distributed Architecture                                  structures to be searched during a read operation.
                                                                      In HBase terminology, the in-memory component is called
   In order to handle very large tables, most data store im-
                                                                   memstore, and the persistence storage is in Hadoop file sys-
plementations split the rows of a table into multiple shards,
                                                                   tem (HDFS) [9]. HDFS is a fault tolerant distributed filesys-
called regions in HBase. Each region can then be served by
                                                                   tem that is designed to run on commodity hardware and is
one of the nodes in the DBS cluster, referred to as region
                                                                   optimized for large datasets. It forms an integral part of the
server in HBase, and each region server can host several
                                                                   Hadoop ecosystem, providing storage functionality for many
regions, also from different tables. In principle, this is simi-
                                                                   distributed applications such as HBase. When the memstore
lar to the horizontal partitioning method employed by large
                                                                   starts filling up, the data is flushed to HDFS where it is
scale relational DBMS.
                                                                   stored as immutable HFiles. Thus, a store is a collection of
   The decision of which shard a particular row belongs to
                                                                   memstore and the HFiles of the corresponding column fam-
is often determined by a partitioning function over the row
                                                                   ily. The background process merging these HFiles is referred
key. The two common partitioning approaches are based
                                                                   to as compaction.
on hashing of the row key or on the range of the row keys.
The later is an especially popular approach when locality of
associated rows are desirable (such as retrieving over a small     3.    INDEXING APPROACHES
range of row keys). Such requests can often be processed              Secondary indexing becomes complicated in a distributed
by searching just one shard. HBase follows this approach.          DBMS. As discussed in section 2.1, a straightforward mech-
When a region becomes very large, the region server can            anism stores the inverted list in a separate table and treats
initiate a region split, where the rows are divided into two       it as a “system table” automatically maintained by the in-
equal sized daughter regions each covering a sub-range of the      dex maintenance modules. System table and base table are
rows. One of the regions will then typically be reassigned to      treated as independent units by the DBMS and thus, in a
a different region server.                                         distributed setting, can reside on different nodes. A second
   HBase also performs vertical partitioning, as each column       approach co-locates index and base table in such a way that
family is stored in a different partition, referred to as store.   the index entries for a record are guaranteed to reside on
Such vertical partitioning approaches have been known to           the same node as the base record itself. In the following
provide better performance [8] when reading only a subset          sections we will discuss the relevance and the general design
of columns, a concept popularized by columnar DBMS [7].            approach to both these forms of secondary indexing along
To utilize this approach, column families are chosen so as to      with their pros and cons.
bundle columns that are frequently accessed together.
   There are two fundamental approaches to cluster design.         3.1     Table-based Secondary Indexing
In the first approach, a dedicated node acts as a master.            In the table-based approach, the secondary index can be
This is the case with Bigtable and HBase. The master server        viewed as special system table whose row key is the sec-
              Node 1                   Node 2                  Node 3                 Node 4
      User_Info table           User_Info table        User_Info table        User_Info table                                  1. Retrieve base table row keys node 1
                                                                                                                               using secondary attribute’s value


                                                                                                                  Client API
   row key name …            row key name …          row key name …         row key name …


                                                                                                         Client
     al23    John …         liu7      Sally     …   pat8     Sean      …   see1     Sally     …
                                                                                                                                                               node 2
     ke77    Mark …         mel2      Mark      …   pet9     Mow       …   smi1     Zoe       …
     kit9    John …         nei3      Nancy …       rid1     Justin …      zik1     Alex      …
     secondary index          secondary index         secondary index                                                                                          node 3
                                                                                                                                     2. Retrieve base table
    row key     user id      row key user id         row key user id                                                                 rows using the row keys
    Alex      zik1                   ke77                    liu7                                                       base table                             node 4
                             Mark                    Sally                                                              secondary index
              al23                   mel2                    see1
    John
              kit9           Mow       pet9          Sean      pat8
    Justin    rid1           Nancy     nei3          Zoe       smi1
                                                                                                    Figure 3: Querying Using a Table-based Secondary
                                                                                                    Index
                          Figure 2: Table-based Secondary Index
ondary attribute’s value and an extra column contains the                              the node handling that index entry a hot spot.
list of row keys of the base table records that contain this                              A big advantage of this approach is that it is easy to im-
secondary attribute’s value. By using the DBMS’ table man-                             plement as it can exploit the table management mechanisms
agement module also for index tables, they can be parti-                               already provided by the DBMS.
tioned and distributed across the cluster in the same man-
ner as base tables. Fig. 2 shows a concrete example with a                             3.2        Co-located Secondary Indexing
base table User_Info that is partitioned across 4 nodes, and                              A co-located index adheres to the shared nothing architec-
a secondary index for attribute name, which is distributed                             ture [24], a design paradigm that is followed by most mod-
across three of the nodes. Notice how the secondary index                              ern distributed DBMS, owing to its overall performance and
entry for Mark, which is stored in node 2, points to two base                          scalability benefits. In this approach, the secondary index
table records, one in node 1 and another in node 2.                                    entries are stored on the same node as the corresponding
   Fig. 3 shows the control flow for a read request that utilizes                      base table records. Each node is therefore responsible for
such a table-based secondary index. Read requests based on                             maintaining its portion of the secondary index. Fig. 4 shows
a secondary attribute’s value are sent directly to the node                            a co-located secondary index for the same base table as in
responsible for the secondary index entry corresponding to                             fig. 2. Notice how the secondary index has two entries for
that value. This node can then either return the keys of the                           the value Mark, one in node 1, the other in node 2.
base table records to the client which can then lookup each                               The control flow for a read request using a co-located sec-
of the base table records by key (as shown in fig. 3) from                             ondary index is shown in fig. 5. A read request has to be
the corresponding nodes, or the node itself can read the                               multicast to all the nodes in the system that contain at least
base table records and send the results back to the client.                            one partition of the base table as any partition can contain
As can be inferred, this is a four-hop process that leads to                           a matching record, making it necessary for all the parti-
four consecutive message exchanges (the retrieval of the base                          tions to search their co-located portion of secondary index
records from different nodes can be done in parallel).                                 entries. If a node finds the value in its portion of secondary
   An important feature is that base table or index partitions                         index, it will lookup the corresponding records in the local
with no relevant data are never involved. The index lookup                             base table partition and return them to the client. As the
is only sent to the index partitions that maintain attribute                           secondary index search and base table retrieval steps are ex-
values that are requested: for a point query (e.g., name =                             ecuted locally, a read request in this design has two-hops,
‘Mark’) this will be one partition, for a range query (e.g.                            as the individual searches on partitions can be executed all
name LIKE ‘M%’), the index entries might span more than                                in parallel. This is, in principle, better than the four-hop
one partition. After that, only the nodes that have base                               cost of table-based indexing. However, if a table has many
partitions with matching records are contacted.                                        partitions across many nodes, the message and index lookup
   If the number of matching base table records is fairly                              costs on all index partitions can be very high. If at the end
small compared to the total number of partitions, this is                              only few records are returned, the benefits of co-location in
an efficient approach as it avoids communication messages                              terms of message rounds might not outweigh the additional
to nodes that have no data to return. However, if the base                             message and processing costs.
records to be returned span most of the partitions, there                                 A general advantage of this approach compared to a table-
is little gain in terms of message exchange compared to a                              based index is that the writes to the base table are less ex-
table scan. This is particularly true, as we have the addi-                            pensive, as the secondary index entries are updated locally,
tional round of index access, that might itself contact many                           without the communication overhead between the nodes.
partitions in case of large range queries.                                                A potential disadvantage is that it cannot reuse the table
   Whenever a record is inserted into the base table or an                             management module of the underlying DBMS but has to be
indexed attribute of a record is updated, the corresponding                            implemented from scratch. Nevertheless, this allows plenty
secondary index (indices) must be updated. These updates                               of opportunity for optimization.
involve non-negligible communication overhead, as in a large
cluster, most secondary index entries will be located in a                             3.3        Choosing The right approach
different node than its base table record. This can lead to
                                                                                         Table 1 shows a summary of the various costs associated
contention in index updates, as with non-unique and skewed
                                                                                       with both index types in a cluster of p partitions.
data distributions, the probability of multiple concurrent
                                                                                         From the table and our previous discussion, we can derive
updates on the same secondary index entry will rise, making
                                                                                       that the table-based approach will be beneficial when there
                 Node 1                    Node 2                 Node 3                 Node 4
                                                                                                                                     Query using secondary          node 1
         User_Info table           User_Info table        User_Info table        User_Info table                                     attribute’s value


                                                                                                                        Client API
     row key name …             row key name …          row key name …         row key name …


                                                                                                               Client
     al23      John      …     liu7      Sally     …   pat8     Sean      …   see1     Sally     …                                                                  node 2
     ke77      Mark      …     mel2      Mark      …   pet9     Mow       …   smi1     Zoe       …
     kit9      John      …     nei3      Nancy …       rid1     Justin …      zik1     Alex      …                                              Lookup base table
                                                                                                                                                                    node 3
       secondary index           secondary index         secondary index        secondary index
                                                                                                                              base table
      row key    user id         row key user id         row key user id       row key user id                                secondary index
                                                                                                                                                                    node 4
                al23            Sally    liu7           Justin   rid1          Alex    zik1
      John                                                                                                                Retrieve base table rows
                kit9            Mark      mel2          Mow       pet9         Sally     see1
      Mark      ke77            Nancy     nei3          Sean      pat8         Zoe       smi1         Figure 5: Querying Using a co-located Secondary
                                                                                                      Index
                             Figure 4: Co-located Secondary Index
                                         Nodes to be contacted                                                               user_info_personal_name_idx
 Type                  Hops       unique base row      base rows > p                                                                         i
 table-based           4          2 (1 index / 1 base) 1+ + p                                        row key                                 p
 co-located            2          p                    p                                             Mark                                { ke77, mel2 }
        Table 1: Index type overheads for read requests                                              Sally                               { liu7, see1 }
                                                                                                      Figure 6: Structure of table-based index
are many partitions and queries only return few records as
this will lead to few messages and low processing costs as                                index table has one column family i with one column p. The
only a few relevant partitions are targeted. In contrast, co-                             value stored under p is a set of row keys of the base table
location will be beneficial when there are generally few par-                             records, in the serialized form of a java TreeSet [10]. The
titions or queries return records from most partitions.                                   TreeSet is based on the Red-Black tree data structure, and
   We can also consider some advanced use cases for sec-                                  costs only θ(log n) for inserts, updates and deletes, where
ondary indices, such as the ability to use two indices simul-                             as a list-based structure would facilitate an O(1) insert but
taneously to satisfy conjunctive queries. This can be per-                                costs θ(n) to perform updates and deletes [12].
formed easily in the case of co-located indices as the index                                 Fig. 6 shows two example rows for an index created on the
rows for both the indices will be in the same node (as the                                name column of the personal column family of the user_info
node responsible for the base table partition). However, in                               table. The row keys of the content table are the userids of
the case of the table-based approach, the DBMS will have                                  the user information stored in the table. There are two rows
to bring the various index rows of a base table record from                               in the index table. The first row’s key ‘Mark’ is a value of the
different nodes together to perform this operation.                                       name column from the user_info table, and the set of two
   Teradata, a very successful commercial parallel RDBMS,                                 userids associated with it are the row keys of the records in
follows the co-located secondary indexing approach for at-                                the user_info table with name ‘Mark’.
tributes with non-unique values, while using a strategy sim-                                 It must be noted that, in this approach, a write in the
ilar to the table-based approach for attributes that have                                 base table can also result in a write in the index HTable,
unique values [25].                                                                       thereby increasing the I/O of the overall write operation.

                                                                                           4.1.1     Querying Using the HTable Index
4.      SECONDARY INDEXING FOR HBASE                                                         In our implementation, it is the client who decides whether
   We integrated both indexing approaches into the HBase                                  a query should use an existing index or whether HBase
data store. We used HBase tables (HTable) for the table-                                  should perform a standard table scan. For that purpose
based index, and we were able to reuse most of the func-                                  we extended the HTable interface of the HBase client API.
tionality already provided by HTable. For the collocation                                 We have created a new query method that has the same
approach, we implemented our own solution as significant                                  input parameters as the default query (table name and fil-
changes to the underlying HTable distribution would have                                  ter predicates that select specific rows), plus an additional
been necessary to exploit it for collocation, and we did not                              parameter that indicates the column name that is indexed.
want to change the HBase source code.                                                        Query execution is then controlled through the client. For
   Among the HBase recommendations to tackle indexing                                     instance, if the filter predicate is a point query (e.g., name
needs [5], only the co-processor framework [20] option can                                = ‘Mark’), the HBase client library first sends a lookup re-
keep track of updates to base table near real-time. The                                   quest to the HTable containing the secondary index, to find
co-processor framework allows code to be injected into the                                the secondary entry with the requested attribute value as
system without changing the base code, akin to triggers,                                  row key. This query returns the serialized TreeSet of the
stored procedures and aspect-oriented programming. It is                                  base table row keys associated with this secondary index
therefore suited for developing a modular and pluggable in-                               value. The set of returned row keys is then used by the
dexing solution. We also extend the HBase client interface                                client library to perform a batched lookup for all matching
to allow for the creation of indices and index-based queries.                             rows on the base table. If the base table has many regions,
                                                                                          one batched lookup is sent to each region that contains at
4.1       Table-based Secondary Index                                                     least one matching row. The client library then collects all
  A table-based index for a secondary attribute is imple-                                 results and returns them to the client application.
mented via an HTable. Each distinct attribute value is rep-                                  Support for range queries is implemented in a similar fash-
resented as one row with the attribute value as row key. The                              ion. In this case the filter contains the start and end range of
the secondary attribute to be constrained. The client library        HashTable Index        List of secondary values     B-link tree
first translates this to a range query on the secondary index
with row keys in the search range. This query might be sent                             Sally
to one or several regions of the HTable index. Each region
will return the qualifying TreeSets. Once the row keys of all
matching records are determined, the procedure is the same
                                                                                Mark
as for a point query where the batched lookups are sent to
all relevant partitions of the base table, and the results are
assembled before return to the user.                                             Mark                                  Sally
   Note that if a query contains predicates for indexed at-
tributes and predicates for attributes which have no indices,
additional processing is required. There are many ways to                  { ke77, mel2 }                          { kit9, liu7 }
achieve this. We first determine the row keys of rows that
match the attributes that are indexed. We then push the               TreeSets for each secondary index value
remaining filters down to the HBase server by the means of
modified batch lookups. We are in the process of optimizing      Figure 7: Abstract Layout of the Co-located In-Memory
this rudimentary approach.                                       Secondary Index

4.1.2    Region Management                                       persistent only during a region shutdown.
   The split or merge of base table regions or index table re-      Second, we wanted to support both point and range queries,
gions has no impact on index management, as the approach         and make updates very fast. While hash table based data
only works with row keys, and HBase automatically redirects      structures are capable of doing point lookups in θ(1), they
lookups to the appropriate regions. The split of a table re-     cannot be used for range queries. B+ -trees, on the other
gion is independent of the split for an index region as for      hand, are capable of performing range queries efficiently as
HBase, these are two different HTables. Thus, region man-        they store the secondary index values in a logically sorted
agement is completely transparent and orthogonal to index        order. But point queries have a cost of θ(n log n). Therefore,
management.                                                      we create a hybrid data structure that is a combination of a
   In general, HBase decides automatically when to split re-     hash table and a variation of B+ -tree to provide the best of
gions and on which nodes to put them. Thus, in princi-           both worlds. As before, the keys to these data structures are
ple, the index designer does not need to be concerned at         the secondary index values themselves. We store the set of
all with region management. However, HBase allows the            base table row keys associated with a secondary index value
user to specify multiple regions at HTable creation time and     using a TreeSet, exactly as we did in the table-based design.
this might be useful when creating indices in order to evenly    Each base table region has its own secondary index parti-
distributed the load across region servers, and to prevent       tion that references only the base table rows which it serves.
a small number of region servers becoming a hot spot for         Our tree-index structure is based on Blink -tree, a variation of
index lookups.                                                   B+ -tree proposed in [19] and stated to have the highest con-
                                                                 currency and overall optimal performance according to [15,
4.2     Co-located In-Memory Secondary Index                     23]. For the sake of brevity, we encourage the avid reader
                                                                 to refer to their original work for more details.
   In the co-location approach, each base table region (parti-
                                                                    Fig. 7 shows an example data of the in-memory index
tion) has a collocated index that has index entries covering
                                                                 data structure. The set of row keys associated with each
exactly the rows in the base table region.
                                                                 secondary index value is stored in a TreeSet. The references
   Although it is possible to create an HTable with exactly
                                                                 to the TreeSets are stored in the hash table as well as the
this index content, HBase does not provide any straightfor-
                                                                 Blink -tree. Point queries can be easily satisfied by hash ta-
ward means to enforce that this HTable-based index parti-
                                                                 ble lookups. The hash table also provides faster access to
tion would be collocated with the corresponding base table
                                                                 TreeSets for updates and deletes. However, Blink -tree can
region. This is because, by default, HBase considers these
                                                                 facilitate range queries based on the secondary index value.
as independent HTables and decides individually on their lo-
cation. We would need to modify the HBase code to imple-         4.2.1     Querying Using the Co-located Index
ment a custom load balancer, which we did not consider an
                                                                    The client API for using the co-located index is similar
attractive solution. Instead, we implemented our propriety,
                                                                 to that of the table-based index. The internal execution
optimized index structure and embedded it transparently
                                                                 flow, however, is more similar to the one for native HBase
into the HBase system using the co-processor framework.
                                                                 table scans than for table-based indexing. When the client
   Our index structure has been motivated by various as-
                                                                 API receives a query that wants to use a co-located index, it
pects. First, we decided to focus on main-memory indexing.
                                                                 sends a special request containing filter conditions contained
That is, the entire index partition is kept in main memory
                                                                 in the query to all the regions of the base table. These special
and only persisted during a region shutdown. As a result,
                                                                 requests are sent in parallel so that the execution across all
index maintenance is very fast and can exploit optimized
                                                                 regions can execute concurrently1 .
main-memory data structures. Our motivation is that we
                                                                    At the server side, the special request is directed to a
are looking at applications that use HBase’s capability to
                                                                 co-processor method that performs a lookup over the hash-
scale to many nodes in order to keep the working data set
in main memory and avoid expensive I/O. In such settings         1
                                                                  Note that the native HBase implementation sends the
it should be feasible to keep index structures in memory,        query to one region after the other not allowing for simulta-
too. Thus, our co-located index is memory resident and is        neous scans across all regions
table portion or a range query over B+ -tree portion of the                              Average time per non-batched put
memory resident secondary index structure associated with                7,000


                                                                             μs
the base table region to retrieve the base table row keys.               6,000                                             number of threads = 4
These row keys are used to fetch the base table rows from the            5,000
corresponding region locally. Queries that contain complex               4,000
filters that span attributes with index and attributes without
                                                                         3,000
index, can be easily handled with co-located indices. The
indices are used to determine the row keys of rows that fulfill          2,000
the search criteria on the indexed attributes. Then, the base            1,000
table is accessed to retrieve those rows and return only those              0                                               number of records
that fulfill the remaining criteria on the other attributes.                      200,000    500,000 1,000,000 2,000,000 3,000,000 4,000,000
                                                                                  tbl-index : zipf     tbl-index : unif   coloc-index : zipf
                                                                                  coloc-index : unif   no-index : zipf    no-index : unif
4.2.2    Persistence, Recovery and Region Splits
   A major challenge in main memory based database sys-                      Figure 8: Average time per non-batch put
tems is the issue of backup and recovery. There is the
dilemma of what happens to the data in memory when a               used both Uniform distribution as well as Zipfian2 distri-
region is shutdown or if it suffers a crash. Main memory           bution to generate different test datasets. The uniform dis-
database systems depend on some fast logging mechanisms            tribution (over the size of the number of records in the table)
to permanent storage in order to overcome such issues [13].        ensures that the values are highly distinct. Zipfian on the
This, however, could introduce non-negligible delays into          other hand produces values of a variety of selectivity as the
processing that can threaten the performance benefits of           frequency of occurrence of the values vary and hence rep-
a memory based database. For our memory resident co-               resents skewed data sets. We used the scrambled Zipfian
located index, we rely on the fact that a completely lost in-      distribution module in YCSB3 for our experiments.
dex can be recovered by rebuilding it from scratch by scan-           To reduce any interference from HBase compactions and
ning the base table region. This, being a local operation,         java old generation garbage collection [21] on our test met-
can be accomplished with reasonable performance overhead.          rics, we disabled HBase automatic compactions and increased
Given that crashes are not frequent and only always affect         the jvm heap. Additionally, test metrics were computed by
one node, we believe that these recovery costs are reasonable      averaging over five runs.
and worth the improved performance during runtime.
   However, as the index has to be co-located with each base       5.1     Non-Batch Writes
table region, we need to handle region splits to ensure that          To measure the impact on write to base table, we used
the memory resident indices are also split accordingly. This       a client that performed non-batched Put using four threads
is accomplished by having the coprocessor listen to region         into an empty table, incrementally adding up to four mil-
splits. Upon the initiation of a region split it will create two   lion row keys and associated secondary index values. We
new indices from the original in-memory index data struc-          logged the average time taken for every hundred thousand
ture and persist them into HDFS as separate files. The re-         records as the table grew in size to understand the impact
gion server will then shutdown the now split region (which         of a growing table size. As can be seen in fig. 8, the memory
will not exist anymore). The new two daughter regions are          resident co-located index does better than the table-based
then brought up (by possibly different region servers) and         implementation for both uniform and Zipfian distributions.
the coprocessor instances attached to them will load the cor-      The performance overhead of co-located index is between 9%
responding index files from HDFS into main memory.                 to 15% as the table size increases. However, for table-based
   In similar spirit, we persist the indices to HDFS storage       implementation the performance penalty is almost 125% for
during regular region shutdown so that they can be restored        uniform distribution. This is to be expected as every Put
without having to scan the base table region. We do so, be-        is now in effect replaced by two Puts. For Zipfian distri-
cause HBase’s load-balancing might determine that regions          bution, the performance penalty starts at 300% for 200,000
have to be moved. For such moves, it will be faster to per-        records and becomes 600% by a million records indicating
sist and transfer the index than re-creating it from scratch       severe performance issues. This is attributed due to the
at the new region server.                                          skewed representation of data in Zipfian distribution result-
                                                                   ing in more contention as well as larger secondary index
                                                                   entries which takes even more time to process, compound-
5.   EXPERIMENTS                                                   ing the issue. However, this is not an issue for the co-located
                                                                   index as each region processes and updates its secondary in-
   For experimental setup, we used a cluster of nodes each         dex entries locally, causing less contention as the index en-
with Intel R Pentium R Dual-Core CPU G2020 @ 2.9GHz ,              tries for the same secondary attribute value are distributed
8GB 1333MHz RAM, 500GB SATA-2 HDD, running Ubuntu                  across multiple regions. Further, in-memory processing is
12.04.4 LTS 64-bits. The nodes are connected together using        also faster over HTable, reducing any chance of contention
a 1 Gbps switch. Four of the nodes are configured as HBase         and giving advantage to co-located index.
Region Servers, also running HDFS Data Nodes. One node                To measure the scalability of processing, we vary the num-
is setup for running HMaster and HDFS Name Node. An-               ber of client threads. As can be seen in fig. 9, all except the
other node was used for running the client processes. We
used HBase-0.96 and Hadoop-1.2 for the test setup.                 2
                                                                    https://en.wikipedia.org/wiki/Zipf’s law
   In order to understand the impact of the distribution of        3
                                                                    https://research.yahoo.com/news/yahoo-cloud-serving-
secondary indexed attribute’s values on performance, we            benchmark/
                Average time per non-batched put with different                                                                       Average time per read
                              number of threads                                                       10,000                                                           number of threads = 4


                                                                                                                 ms
      5,000                                                                                               1,000
      4,500    μs                                               #records inserted = 4 million
      4,000                                                                                                100
      3,500                                                                                                 10
      3,000
      2,500                                                                                                  1
      2,000
      1,500                                                                                                  0
      1,000                                                                                                  0
        500
          0                                                                number of threads                 0                                                            number of records
                              2                          4                      8                                      200,000       500,000     1,000,000   2,000,000   3,000,000   4,000,000

                    tbl-index : zipf        tbl-index : unif        coloc-index : zipf                            tbl-index : zipf             tbl-index : unif    coloc-index : zipf
                    coloc-index : unif      no-index : zipf         no-index : unif                               coloc-index : unif           no-index : zipf     no-index : unif

Figure 9: Average time per non-batch put for different num-                                      Figure 12: Average time per read using secondary index
ber of threads
                                                                                                                                 Average time per read chart
                      Average time per record using batch puts                                            1000
                                                                                                                                                                       number of threads = 4


                                                                                                             ms
                                                                                                          100
           μs


                                                                      number of threads = 1
      1000
                                                                                                           10

       100                                                                                                   1

                                                                                                           0.1
        10                                                                                                0.01

                                                                                                      0.001                                                        number of records
           1                                                                                                          200,000     500,000 1,000,000 2,000,000 3,000,000 4,000,000
                                                                         number of records
                    500,000         1,000,000        2,000,000     3,000,000 4,000,000                                           tbl-index : zipf tbl-index : unif
                    tbl-index : zipf        tbl-index : unif        coloc-index : zipf                                           coloc-index : zipf      coloc-index : unif
                    coloc-index : unif      no-index : zipf         no-index : unif
                                                                                                 Figure 13: Average time per read using secondary index
      Figure 10: Average time per record for batch put
                                                                                                                                      Average time per read
                      Average time per record using batch puts                                        6                                                       number of threads = 4
                                                                                                      ms


      20                                                                                                                                                      uniform distribution
       μs


                                                                     number of threads = 1            5
      18                                                                                                                                                      table size = 1 million
      16
                                                                                                      4
      14
      12                                                                                              3
      10
       8                                                                                              2
       6
       4                                                                                              1
       2
       0                                                                number of records             0                                                  records per secondary index value
                500,000           1,000,000         2,000,000     3,000,000  4,000,000                            1              2               3           4           5         10
                                  coloc-index : zipf     coloc-index : unif                                                             tbl-index        coloc-index
                                  no-index : zipf        no-index : unif
                                                                                                Figure 14: Average time per read using secondary indexed
Figure 11: Average time per record for batch put, co-located                                    attributes at varying selectivity
index vs no index
                                                                                                by the update to the base table and index itself. While table-
table-based Zipfian scales well, giving the optimal perfor-                                     based Zipfian continues to show deteriorated performance,
mance at four threads, where each region server is occupied                                     what is intriguing is that the performance penalty of table-
simultaneously. We observe that the table-based Zipfian fail                                    based index with uniform distribution is also very severe.
to scale owing to the contention as discussed before.                                           The reason for this is that, though the client uses batch mode
                                                                                                for performing Puts, the coprocessor architecture of HBase
5.2        Batch Writes                                                                         results in the coprocessors being invoked once per each Put,
  Further, we performed the same experiment, this time                                          dampening the benefits of batched approach. The coproces-
with batched Put, which is the most efficient way of data                                       sor is therefore forced to update the index HTables using
loading in HBase. We used a single thread and batches of                                        individual Puts. Fig. 11 shows the same results without
100,000 records. From the results in fig. 10, we can see that                                   including table-based implementation for better scalability.
the co-located index does far better than table-based index-
ing. Co-located indexing still has about 95% penalty. This                                      5.3        Point Reads Using Secondary Indices
can be expected because, with batched Puts, inefficient net-                                      Point reads are performed using the indices created on
work traffic is reduced, and the only performance bounds is                                     secondary attributes and compared against the naı̈ve HBase
                      Average time per read                                                Response time for range queries in uniform distribution
                                                                                12                                            number of threads = 1


                                                                                 seconds
      7                         records per secondary index value = 1
                                                                                                                              table size = 1 million
      ms
                                uniform distribution                            10
      6
                                table size = 1 million
      5                                                                         8
      4                                                                         6
      3
                                                                                4
      2
      1                                                                         2

      0                                          number of read threads         0                                                                 range
               1            2                4                8                       0            50,000          100,000          150,000         200,000
                        tbl-index     coloc-index                                                    tbl-index    coloc-index       no-index

Figure 15: Average time per read using secondary indexed                  Figure 16: Response time for range queries in uniform dis-
attributes with varying number of threads                                 tribution

alternative, the scan operation. The scan operation results                                Response time for range queries in Zipfian distribution
in all of the region to be read. We used tables with different                  12                                           number of threads = 1


                                                                                 seconds
number of records to better measure the variance in perfor-                                                                  table size = 1 million
                                                                                10
mance as a function of table size. A client with 4 threads is
used to issue the read requests. The values for secondary in-                    8
dexed attributes are drawn from the same distribution used
                                                                                 6
to load them into the tables to perform random reads.
   The results are shown in fig. 12. For better scale, just                      4
the comparison of the two indexing approaches is shown
                                                                                 2
in fig. 13. We can see that all the index implementations
provide better performance. For uniform distribution, the                        0                                                                range
memory resident, co-located index performance increases                               0            50,000          100,000          150,000         200,000
from 77 times to 1280 times as the table size grows from                                             tbl-index    coloc-index       no-index
200,000 to 4 million. The performance benefits for co-located
index for Zipfian stays around 21%. This is because with                  Figure 17: Response time for range queries in Zipfian distri-
Zipfian, there are more records for popular values, and this              bution
results in more processing and data transfer. However, even
                                                                            Therefore on doing detailed analysis on the performance of
in such cases the benefit of having the index is very evident.
                                                                          read queries, one can conclude that table-based implemen-
   Table-based Zipfian shows the least improvement, owing
                                                                          tations are better choice for indexed attributes with near
to the fact that a large number of records will have to be read
                                                                          unique distribution, while co-located implementations pro-
from all the region servers. Hence retrieving the row keys
                                                                          vides better performance when the secondary index distri-
first and then querying the base records using them turns
                                                                          bution tends to be more non-unique.
out to be costly across the network hops. However, table-
based index seems to do better with Uniform distribution.                 5.4        Range Queries using Secondary Indices
It gives 100 to 1400 times better performance, outdoing even
                                                                             To test range queries, we chose a single region setup in the
the memory resident co-located index implementation.
                                                                          interest of fairness, as HBase scans one region at a time in
   To analyze this further, we next ran the test case for uni-
                                                                          contrast to our parallel approach. Fig. 16 shows the perfor-
form distribution, varying the number of records per sec-
                                                                          mance for uniform distribution and fig. 17 shows the same for
ondary index value. The results shown in fig. 14 shows that
                                                                          Zipfian distribution. In both distributions, the memory res-
as the number of records increase, the table-based index
                                                                          ident co-located index seems to outperform the table-based
looses its advantage over the co-located one. This is because
                                                                          implementation. This is in concurrence to our observation
with increase in number of records to be retrieved from the
                                                                          in section 5.3 where the co-located index fairs better when
base table, the table-based implementation will likely has to
                                                                          there are many records to be returned.
interact with all the region servers, a scenario in which the
                                                                             Further, as the range increases, the performance advan-
co-located index will do better due to the reduced number of
                                                                          tages of secondary indices decrease and beyond some range,
hops in the overall read request due to its broadcast nature.
                                                                          it becomes an overhead to use the index. This threshold
   We also analyzed the impact of concurrency by varying
                                                                          seems to appear at about the range of 80,000 (8% of records
the number of read threads while keeping the number of
                                                                          in the table.). Beyond this range, scanning the base table
base table records per secondary index value at 1. From the
                                                                          region directly is beneficial over using the secondary index.
results in fig. 15, we can see that co-located index can do
better at lower concurrency, but as the number of threads
increases, table-based index starts doing better. This is be-             6.    RELATED WORK
cause, at small workloads, the overhead of broadcast in co-                  As stated in section 3.3, Teradata follows a co-located in-
located implementation does not weigh-in into performance,                dexing approach for non-unique attributes and a re-distribution
while it also benefits from the reduced request rounds. But               of index entries for unique attributes. However, in Teradata,
at higher concurrency, this overhead starts to come into play.            row distribution (for non co-located index entries as well as
base table records) is accomplished via automatic hashing of
the attribute values [25]. While this leads to ease of imple-     [2] IHBase. https://github.com/ykulbak/ihbase.
mentation and maintenance of table and index structures,          [3] ITHBase. https://github.com/hbase-trx/
the downside is there is no locality of associated attribute          hbase-transactional-tableindexed.
values due to hashing, and operations such as range searches      [4] Lily. http://www.lilyproject.org/.
will have to be performed in all the nodes.                       [5] Secondary Indexes and Alternate Query Paths. http://
   ITHBase [3] is an open source implementation developed             hbase.apache.org/0.94/book/secondary.indexes.html.
by modifying base HBase source code to include transac-           [6] Solr. http://lucene.apache.org/solr/.
tion and secondary index support. Their implementation is         [7] D. J. Abadi, P. A. Boncz, and S. Harizopoulos.
similar to our table-based indexing solution.                         Column-Oriented Database Systems. VLDB, pages
   IHBase [2] is an in-memory secondary indexing solution             1664–1665, 2009.
for HBase, quite similar to our own in-memory co-located in-      [8] D. J. Abadi, S. R. Madden, and N. Hachem.
dexing solution. By modifying core HBase, only the portion            Column-Stores vs. Row-Stores: How Different Are
of the data in disk is indexed and the contents of the mem-           They Really? In ACM SIGMOD, pages 967–980, 2008.
store is searched completely. The index is built from scratch     [9] D. Borthakur. HDFS Architecture Guide. http:
by scanning the region every time it is brought online.               //hadoop.apache.org/docs/r1.2.1/hdfs design.html,
   Culvert [1] is a secondary index implementation similar            2008.
to our table-based approach. However, it seems like they do      [10] G. Bracha. Generics in the Java programming
their index table updates via modified client side code which         language. Sun Microsystems, pages 1–23, 2004.
needs to be used to perform data loading.
                                                                 [11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
   Lily [4] is a cloud-scale repository for social content ap-
                                                                      Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.
plications that uses HBase for data storage and Solr [6] for
                                                                      Gruber. Bigtable: A Distributed Storage System for
indexing needs. Their indexing engine therefore resides out-
                                                                      Structured Data. ACM Trans. Comput. Syst., pages
side the HBase ecosystem.
                                                                      4:1–4:26, June 2008.
   For the sake of brevity, we have covered only the variants
of HBase indexing implementations that are popular in liter-     [12] T. Cormen, C. Leiserson, R. Rivest, and C. Stein.
ature. In our implementation, we focused on leveraging the            Introduction To Algorithms. MIT Press, 2001.
flexibility of the coprocessor framework over the approaches     [13] H. Garcia-Molina and K. Salem. Main Memory
that required modification to core HBase classes. This has            Database Systems: An Overview. IEEE Trans. on
the advantage that existing applications need not be mi-              Knowledge and Data Eng., pages 509–516, Dec 1992.
grated to perform index maintenance. Further, our analysis       [14] L. George. HBase: The Definitive Guide. O’Reilly
demonstrates that either of the indexing approaches could             Media, 2011.
be the optimal solution depending on the data distribution.      [15] T. Johnson and D. Sasha. The Performance of
                                                                      Current B-tree Algorithms. ACM Tran. on Database
7.   CONCLUSION & FUTURE WORK                                         Systems, pages 51–101, 1993.
                                                                 [16] A. Khurana. Introduction to HBase Schema Design.
   In this paper we described the challenges associated with
                                                                      Networked Syst, 37:29–36, 2012.
indexing in distributed DBMS and provided two implemen-
                                                                 [17] A. Lakshman and P. Malik. Cassandra: A
tations for HBase that works with less intrusion into the core
                                                                      Decentralized Structured Storage System. ACM
HBase source code. Our indexing solutions can work with
                                                                      SIGOPS Operating Systems Review, 44(2):35–40, 2010.
point queries as well as range queries. Further, we provided
a very detailed analysis of how different data distributions     [18] N. Leavitt. Will NoSQL Databases Live Up to Their
warrant different indexing approaches and demonstrated a              Promise? Computer, 43(2):12–14, 2010.
case for both implementations. Our results show that there       [19] P. L. Lehman et al. Efficient Locking for Concurrent
is clearly a benefit to having secondary indices in HBase,            Operations on B-trees. ACM Trans. on Database
and that they can be often built with reasonable perfor-              Systems, pages 650–670, 1981.
mance overhead. Although there has been some prior works         [20] A. P. Mingjie Lai, Eugene Koontz. Coprocessor
to achieve secondary indexing in HBase, our work have been            Introduction. https://blogs.apache.org/hbase/entry/
more detailed and insightful about the various alternatives           coprocessor introduction, 2012.
and clearly shows that there is no one-stop solution to sec-     [21] Oracle Corporation. Java Garbage Collection Basics.
ondary indexing needs in HBase.                                       http://www.oracle.com/webfolder/technetwork/
   For our future work, we are exploring on how to use End-           tutorials/obe/java/gc01/index.html.
Point coprocessors to re-build secondary indices for batch       [22] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The
processing-only type of workloads where real-time consis-             Log-Structured Merge-Tree (LSM-Tree). Acta
tency of index structures are not necessary. This approach            Informatica, 33(4):351–385, 1996.
is also common in large relational databases where the in-       [23] V. Srinivasan and M. J. Carey. Performance of B+
dices are dropped during bulk loads and rebuilt after the             Tree Concurrency Control Algorithms. VLDB, pages
data loads are completed. A key challenge would be how to             361–406, 1993.
tackle the client reads while indices are being rebuilt in a     [24] M. Stonebraker. The Case for Shared Nothing. IEEE
less intrusive manner.                                                Database Eng. Bull., 9(1):4–9, 1986.
                                                                 [25] Teradata Corporation. Introduction to Teradata.
8.   REFERENCES                                                       www.teradata.com/workarea/downloadasset.aspx?id=
 [1] Culvert.                                                    17947, 2010.
     https://github.com/booz-allen-hamilton/culvert.

</pre>