=Paper=
{{Paper
|id=Vol-1810/DOLAP_paper_10
|storemode=property
|title=Secondary Indexing Techniques for Key-Value Stores: Two Rings To Rule Them All
|pdfUrl=https://ceur-ws.org/Vol-1810/DOLAP_paper_10.pdf
|volume=Vol-1810
|authors=Joseph Vinish D’silva,Roger Ruiz-Carrillo,Cong Yu,Muhammad Yousuf Ahmad,Bettina Kemme
|dblpUrl=https://dblp.org/rec/conf/edbt/DsilvaRYAK17
}}
==Secondary Indexing Techniques for Key-Value Stores: Two Rings To Rule Them All==
Secondary Indexing Techniques for Key-Value Stores: Two Rings To Rule Them All Joseph Vinish D’silva Roger Ruiz-Carrillo Cong Yu joseph.dsilva@mail.mcgill.ca roger.ruiz@mail.mcgill.ca cong.yu@mail.mcgill.ca Muhammad Yousuf Ahmad Bettina Kemme muhammad.ahmad2@mail.mcgill.ca kemme@cs.mcgill.ca School of Computer Science, McGill University Montréal, Canada ABSTRACT allowed for easy distribution of data, and thus, scalability. Secondary indices are traditionally used in DBMS to in- Also other typical DBMS functionality, such as transaction crease the performance of queries that do not rely on the management, was only implemented in rudimentary format, keys of the table for data reads. Many of the newer NoSQL leading to much less complexity compared to traditional distributed data stores, even if they provide a table-based RDMBS. However, given the prolific success and decades of data model such as HBase, however, do not yet have a sec- domination of the relational model, applications started to ondary indexing feature built in. In this paper, we explore request the modeling and querying functionality they were the challenges associated with indexing modern distributed used to from RDBMS but at the same time wanted to main- table-based data stores and investigate two secondary index tain the flexibility and scalability of key-value stores. approaches which we have integrated within HBase. Our Therefore, several “hybrid” data stores emerged that adapted detailed analysis and experimental results prove the bene- a relaxed notion of the “table-column” abstraction (we will fits of both the approaches. Further, we demonstrate that cover this further in section 2.2) with a much less restrictive such secondary index implementation decisions cannot be table-based data model than traditional RDBMS, suitable made in isolation of the data distribution and that different for the sparse datasets that are commonly found in BigData indexing approaches can cater to different needs. applications. In tandem with this table concept the query in- terface also became more powerful, allowing, e.g., for predi- cate search. However, answering such complex queries either CCS Concepts requires scanning the entire data sets or access to alternate •Information systems → Point lookups; Unidimen- access paths. As a result, many NoSQL stores are trying to sional range search; implement their own notion of a secondary index. A notable deviation in the development of key-value stores Keywords compared to the traditional DBMS is its open source nature, facilitating the database development community at large to secondary indexing; NoSQL data stores; in-memory indices; pitch in their contributions. This has resulted in various de- sign approaches being attempted in providing a secondary 1. INTRODUCTION index functionality for these data stores, including HBase Secondary indexing plays a key-role in addressing the per- [14] (which we briefly cover in related work in section 6). formance necessities of relational database management sys- However, so far, we are not aware of any in-depth study and tems (RDBMS). It is instrumental in facilitating efficient comparative analysis of these different approaches. More- selection of a subset of the dataset based on business con- over, many implementations fall short on modularity. Our straints by providing alternative access paths to the base analysis of index implementations of HBase shows that most records. Its performance benefits are primarily derived from of them require significant changes to the core HBase code the reduced I/O facilitated by retrieving only those data instead of leveraging the existing HBase frameworks to de- pages that contain relevant records. velop a pluggable module. The dawn of BigData resulted in a resurgence of inter- In this paper, we present an in-depth discussion of in- est in NoSQL (non-relational) DBMS, in particular, key- dexing for distributed, table-based NoSQL data stores. In value stores. Their simple data model and query interface particular, our paper makes the following contributions. • We discuss two indexing strategies for distributed key- value stores: one based on distributed tables that is able to exploit the table model of the underlying sys- tem for index management, the other using a co-location c 2017, Copyright is with the authors. Published in the Workshop Proceed- approach allowing for efficient main-memory access. ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd • Both strategies are implemented and integrated into 4.0 HBase in a non-intrusive way. • We provide an enhanced client interface to query HBase Row keys Column families and associated tables using secondary indexing that supports both are sorted column names under them point queries and range queries. • We present a detailed performance metrics on various Personal Office database operations with secondary indices and a com- Row Key Name Res-Phone Office-phone Dept parative analysis of the different approaches. al23 John 321-124-7894 323-551-7452 1 • We present a thorough analysis on the effects of data ke77 Mark 321-435-7821 323-907-7873 1 distribution on different indexing approaches. liu7 Sally 321-513-3212 323-875-7681 2 2. BACKGROUND pat8 Sean 321-542-5521 323-423-7911 3 smi1 Zoe 321-789-9013 323-791-4231 3 2.1 Indexing Techniques Cells are the most fine grain unit of data Predicate queries that only retrieve a subset of the data of a table can be executed through a table scan where each Figure 1: Key-value Data Model [16] record is inspected and only qualifying records are returned. If the table is already sorted (clustered) based on the at- Although NoSQL DBMSes have been around since the tribute on which the search constraint is defined, then a full 1960s [18] employing hierarchical, network, graph and other table scan can be avoided as the matching tuples can be semi-structured data models, the dawn of Big Data in the re- found in logarithmic time. cent decades and the associated processing frameworks have Alternatively to a scan, special index structures can help spawned a new breed of NoSQL DBMS, key-value, and de- identify the records that qualify, and then only those records rived from there, table-based data stores. The resurgence of are retrieved from the base table. Indices are defined over interest in NoSQL DBMSes was primarily attributed to the one or more attributes. Such indices are called secondary rigid structural requirements of relational DBMS, such as indices and are often constructed over non primary-key at- the need to define the layout of the data in advance. Such tributes. Typically, a secondary index contains an entry for restrictions did not fit well with the processing needs of many each existing value of the attribute to be indexed. This entry modern Big Data applications, prompting the search for al- can be seen as a key/value pair with the attribute value as ternative data models. Further, scalability requirements ne- key and as value a list of pointers to all records in the base cessitated the distribution of data and computation. Widely table that have this value. In centralized database manage- used in practice are now data stores with a flexible table- ment systems (DBMS) a pointer is typically a physical iden- based data model. Examples are Bigtable [11], Cassandra tifier indicating the position of the record in the file system. [17] and HBase. In the following, we describe the general In distributed systems or more high-level implementations, structure along HBase as the others are quite similar. a pointer is often the primary key of the record, assuming that a lookup via the primary key can be done efficiently. 2.2.1 Data Model A simple index implementation would be a system table Similar to the relational database data model, HBase ag- (inverted list) with the index attribute as primary key and gregates related data into tables. Each table is comprised of the list of record pointers as extra column. This system table multiple rows. Each row contains a unique row key (or pri- can be sorted by primary key making the lookup of a specific mary key), one or more column families and columns under attribute value fast. More common are tree-based indices, them. The table is (logically) sorted by row key. In general, where inner nodes guide to leaf nodes which contain the each specific value in the columns is defined as a cell, and key/value pairs. Both sorted inverted lists and tree struc- can be referenced uniquely by the combination of row key, tures are good for point queries (search on a single attribute column family, and column. value) and range queries (search on a range of attribute val- An important characteristic of the data model is that not ues). A mechanism that performs very well for point queries all columns need to be provided for a row, and new columns is a hash-table with the index attribute as key and the list can be added on the fly. Column names are stored along with of pointers as value. the associated value in the database. Such tables can con- In some cases, a query can be answered directly using the ceptually be sparse as a given column need not be present index, without having to access the base table. Typically, in many of the rows. Also, column values are treated as index structures are effective if few records qualify the search byte arrays and clients are responsible to perform the proper criteria. Through the index, these records are found very data type conversion. This dynamic structure is a distin- quickly and then can be individually retrieved from the base guishing characteristic compared to RDBMS where the table table. When many records qualify, it might be faster to scan structure is rigid, typically defined ahead and maintained as the entire table, as a scan allows reading tuples in batches meta-table information. (e.g., all on a given block) while using an index not only Fig. 1 shows an HBase table containing employee informa- first requires the index access but the individual retrievals of tion, with two column families: Personal with columns Name the qualifying tuples might result in many random accesses and Res-Phone, and column family Office with columns across the blocks of the base table. In fact, if every block Office-phone and Dept. contains a qualifying record, an index will not reduce the I/O costs for bringing blocks into main memory. 2.2.2 Client API While not providing a full SQL interface, HBase and other 2.2 Table-based NoSQL DBMS table-based NoSQL DBS provide reasonably sophisticated call level interfaces. The most common data retrieval is the manages meta-information, is responsible for the allocation lookup method that requires the row key as input. HBase has of regions to region servers and balances the load in the internal index structures that very efficiently find a record HBase cluster. Clients only communicate with the master to based on its row key. Further, the relational equivalent of retrieve the meta information (e.g. the location of regions) projection is accomplished by providing the lookup method but they perform data transfer directly with region servers with the list of column families and columns of interest. so that the master does not become overloaded. Master Apart from lookup by primary key, HBase also provides a failures are handled by a monitoring system external to the scan operator that can take as input filters similar to SQL DBMS cluster. predicates. HBase then returns all records that satisfy the In the decentralized approach, there is no master node to filter. Queries over several tables, however, are not provided. perform explicit coordination, but each node in the cluster A scan operation on a range of row keys is relatively efficient has identical functionality and can take up the role of a coor- as HBase sorts a table by primary key. However, filters dinator. Clients connect to any node in the cluster. Apache over non row key attributes require a scan over the entire Cassandra follows this approach of cluster management. table to find the matching records. There are first attempts of providing secondary indices but none has been officially 2.2.4 Storage Structure integrated. Details are discussed in section 6. Many table-based NoSQL data stores follow a storage Writes and Updates are the same from the perspective approach first proposed in the Log-Structured Merge-Tree of the client API. They require as input the row key, the (LSM-Tree) [22]. In this approach, the table abstraction column families and column names to be updated and their has an in-memory component and a persistent storage com- corresponding values. Deletes require the row key of the ponent. The in-memory component is used to store the most record impacted, and also the column family names/column recent updates of data. Additionally, these updates are also names to be deleted. This allows specific attributes to be recorded in the database logs for durability, similar to tra- deleted instead of the entire record [14], a necessity when ditional DBMS. When the in-memory component becomes the data model allows the table to be sparse. Deletes in gen- very large, this data structure is persisted in an immutable eral do not translate to physical deletes. Instead, a delete format into the disk. Therefore, at any time a table might marker is used to indicate that the data was requested to be made up of one in-memory component and multiple im- be removed. Retrieval operations will encounter the delete mutable disk components. A search for a row key is first marker and skip that record. Storage optimization opera- performed on the memory component followed by the disk tions are often performed in a periodical manner which take components. A background process can periodically scan care of physically removing the deleted data. and merge multiple disk components of a table into one larger data structure in the disk to reduce the amount of 2.2.3 Distributed Architecture structures to be searched during a read operation. In HBase terminology, the in-memory component is called In order to handle very large tables, most data store im- memstore, and the persistence storage is in Hadoop file sys- plementations split the rows of a table into multiple shards, tem (HDFS) [9]. HDFS is a fault tolerant distributed filesys- called regions in HBase. Each region can then be served by tem that is designed to run on commodity hardware and is one of the nodes in the DBS cluster, referred to as region optimized for large datasets. It forms an integral part of the server in HBase, and each region server can host several Hadoop ecosystem, providing storage functionality for many regions, also from different tables. In principle, this is simi- distributed applications such as HBase. When the memstore lar to the horizontal partitioning method employed by large starts filling up, the data is flushed to HDFS where it is scale relational DBMS. stored as immutable HFiles. Thus, a store is a collection of The decision of which shard a particular row belongs to memstore and the HFiles of the corresponding column fam- is often determined by a partitioning function over the row ily. The background process merging these HFiles is referred key. The two common partitioning approaches are based to as compaction. on hashing of the row key or on the range of the row keys. The later is an especially popular approach when locality of associated rows are desirable (such as retrieving over a small 3. INDEXING APPROACHES range of row keys). Such requests can often be processed Secondary indexing becomes complicated in a distributed by searching just one shard. HBase follows this approach. DBMS. As discussed in section 2.1, a straightforward mech- When a region becomes very large, the region server can anism stores the inverted list in a separate table and treats initiate a region split, where the rows are divided into two it as a “system table” automatically maintained by the in- equal sized daughter regions each covering a sub-range of the dex maintenance modules. System table and base table are rows. One of the regions will then typically be reassigned to treated as independent units by the DBMS and thus, in a a different region server. distributed setting, can reside on different nodes. A second HBase also performs vertical partitioning, as each column approach co-locates index and base table in such a way that family is stored in a different partition, referred to as store. the index entries for a record are guaranteed to reside on Such vertical partitioning approaches have been known to the same node as the base record itself. In the following provide better performance [8] when reading only a subset sections we will discuss the relevance and the general design of columns, a concept popularized by columnar DBMS [7]. approach to both these forms of secondary indexing along To utilize this approach, column families are chosen so as to with their pros and cons. bundle columns that are frequently accessed together. There are two fundamental approaches to cluster design. 3.1 Table-based Secondary Indexing In the first approach, a dedicated node acts as a master. In the table-based approach, the secondary index can be This is the case with Bigtable and HBase. The master server viewed as special system table whose row key is the sec- Node 1 Node 2 Node 3 Node 4 User_Info table User_Info table User_Info table User_Info table 1. Retrieve base table row keys node 1 using secondary attribute’s value Client API row key name … row key name … row key name … row key name … Client al23 John … liu7 Sally … pat8 Sean … see1 Sally … node 2 ke77 Mark … mel2 Mark … pet9 Mow … smi1 Zoe … kit9 John … nei3 Nancy … rid1 Justin … zik1 Alex … secondary index secondary index secondary index node 3 2. Retrieve base table row key user id row key user id row key user id rows using the row keys Alex zik1 ke77 liu7 base table node 4 Mark Sally secondary index al23 mel2 see1 John kit9 Mow pet9 Sean pat8 Justin rid1 Nancy nei3 Zoe smi1 Figure 3: Querying Using a Table-based Secondary Index Figure 2: Table-based Secondary Index ondary attribute’s value and an extra column contains the the node handling that index entry a hot spot. list of row keys of the base table records that contain this A big advantage of this approach is that it is easy to im- secondary attribute’s value. By using the DBMS’ table man- plement as it can exploit the table management mechanisms agement module also for index tables, they can be parti- already provided by the DBMS. tioned and distributed across the cluster in the same man- ner as base tables. Fig. 2 shows a concrete example with a 3.2 Co-located Secondary Indexing base table User_Info that is partitioned across 4 nodes, and A co-located index adheres to the shared nothing architec- a secondary index for attribute name, which is distributed ture [24], a design paradigm that is followed by most mod- across three of the nodes. Notice how the secondary index ern distributed DBMS, owing to its overall performance and entry for Mark, which is stored in node 2, points to two base scalability benefits. In this approach, the secondary index table records, one in node 1 and another in node 2. entries are stored on the same node as the corresponding Fig. 3 shows the control flow for a read request that utilizes base table records. Each node is therefore responsible for such a table-based secondary index. Read requests based on maintaining its portion of the secondary index. Fig. 4 shows a secondary attribute’s value are sent directly to the node a co-located secondary index for the same base table as in responsible for the secondary index entry corresponding to fig. 2. Notice how the secondary index has two entries for that value. This node can then either return the keys of the the value Mark, one in node 1, the other in node 2. base table records to the client which can then lookup each The control flow for a read request using a co-located sec- of the base table records by key (as shown in fig. 3) from ondary index is shown in fig. 5. A read request has to be the corresponding nodes, or the node itself can read the multicast to all the nodes in the system that contain at least base table records and send the results back to the client. one partition of the base table as any partition can contain As can be inferred, this is a four-hop process that leads to a matching record, making it necessary for all the parti- four consecutive message exchanges (the retrieval of the base tions to search their co-located portion of secondary index records from different nodes can be done in parallel). entries. If a node finds the value in its portion of secondary An important feature is that base table or index partitions index, it will lookup the corresponding records in the local with no relevant data are never involved. The index lookup base table partition and return them to the client. As the is only sent to the index partitions that maintain attribute secondary index search and base table retrieval steps are ex- values that are requested: for a point query (e.g., name = ecuted locally, a read request in this design has two-hops, ‘Mark’) this will be one partition, for a range query (e.g. as the individual searches on partitions can be executed all name LIKE ‘M%’), the index entries might span more than in parallel. This is, in principle, better than the four-hop one partition. After that, only the nodes that have base cost of table-based indexing. However, if a table has many partitions with matching records are contacted. partitions across many nodes, the message and index lookup If the number of matching base table records is fairly costs on all index partitions can be very high. If at the end small compared to the total number of partitions, this is only few records are returned, the benefits of co-location in an efficient approach as it avoids communication messages terms of message rounds might not outweigh the additional to nodes that have no data to return. However, if the base message and processing costs. records to be returned span most of the partitions, there A general advantage of this approach compared to a table- is little gain in terms of message exchange compared to a based index is that the writes to the base table are less ex- table scan. This is particularly true, as we have the addi- pensive, as the secondary index entries are updated locally, tional round of index access, that might itself contact many without the communication overhead between the nodes. partitions in case of large range queries. A potential disadvantage is that it cannot reuse the table Whenever a record is inserted into the base table or an management module of the underlying DBMS but has to be indexed attribute of a record is updated, the corresponding implemented from scratch. Nevertheless, this allows plenty secondary index (indices) must be updated. These updates of opportunity for optimization. involve non-negligible communication overhead, as in a large cluster, most secondary index entries will be located in a 3.3 Choosing The right approach different node than its base table record. This can lead to Table 1 shows a summary of the various costs associated contention in index updates, as with non-unique and skewed with both index types in a cluster of p partitions. data distributions, the probability of multiple concurrent From the table and our previous discussion, we can derive updates on the same secondary index entry will rise, making that the table-based approach will be beneficial when there Node 1 Node 2 Node 3 Node 4 Query using secondary node 1 User_Info table User_Info table User_Info table User_Info table attribute’s value Client API row key name … row key name … row key name … row key name … Client al23 John … liu7 Sally … pat8 Sean … see1 Sally … node 2 ke77 Mark … mel2 Mark … pet9 Mow … smi1 Zoe … kit9 John … nei3 Nancy … rid1 Justin … zik1 Alex … Lookup base table node 3 secondary index secondary index secondary index secondary index base table row key user id row key user id row key user id row key user id secondary index node 4 al23 Sally liu7 Justin rid1 Alex zik1 John Retrieve base table rows kit9 Mark mel2 Mow pet9 Sally see1 Mark ke77 Nancy nei3 Sean pat8 Zoe smi1 Figure 5: Querying Using a co-located Secondary Index Figure 4: Co-located Secondary Index Nodes to be contacted user_info_personal_name_idx Type Hops unique base row base rows > p i table-based 4 2 (1 index / 1 base) 1+ + p row key p co-located 2 p p Mark { ke77, mel2 } Table 1: Index type overheads for read requests Sally { liu7, see1 } Figure 6: Structure of table-based index are many partitions and queries only return few records as this will lead to few messages and low processing costs as index table has one column family i with one column p. The only a few relevant partitions are targeted. In contrast, co- value stored under p is a set of row keys of the base table location will be beneficial when there are generally few par- records, in the serialized form of a java TreeSet [10]. The titions or queries return records from most partitions. TreeSet is based on the Red-Black tree data structure, and We can also consider some advanced use cases for sec- costs only θ(log n) for inserts, updates and deletes, where ondary indices, such as the ability to use two indices simul- as a list-based structure would facilitate an O(1) insert but taneously to satisfy conjunctive queries. This can be per- costs θ(n) to perform updates and deletes [12]. formed easily in the case of co-located indices as the index Fig. 6 shows two example rows for an index created on the rows for both the indices will be in the same node (as the name column of the personal column family of the user_info node responsible for the base table partition). However, in table. The row keys of the content table are the userids of the case of the table-based approach, the DBMS will have the user information stored in the table. There are two rows to bring the various index rows of a base table record from in the index table. The first row’s key ‘Mark’ is a value of the different nodes together to perform this operation. name column from the user_info table, and the set of two Teradata, a very successful commercial parallel RDBMS, userids associated with it are the row keys of the records in follows the co-located secondary indexing approach for at- the user_info table with name ‘Mark’. tributes with non-unique values, while using a strategy sim- It must be noted that, in this approach, a write in the ilar to the table-based approach for attributes that have base table can also result in a write in the index HTable, unique values [25]. thereby increasing the I/O of the overall write operation. 4.1.1 Querying Using the HTable Index 4. SECONDARY INDEXING FOR HBASE In our implementation, it is the client who decides whether We integrated both indexing approaches into the HBase a query should use an existing index or whether HBase data store. We used HBase tables (HTable) for the table- should perform a standard table scan. For that purpose based index, and we were able to reuse most of the func- we extended the HTable interface of the HBase client API. tionality already provided by HTable. For the collocation We have created a new query method that has the same approach, we implemented our own solution as significant input parameters as the default query (table name and fil- changes to the underlying HTable distribution would have ter predicates that select specific rows), plus an additional been necessary to exploit it for collocation, and we did not parameter that indicates the column name that is indexed. want to change the HBase source code. Query execution is then controlled through the client. For Among the HBase recommendations to tackle indexing instance, if the filter predicate is a point query (e.g., name needs [5], only the co-processor framework [20] option can = ‘Mark’), the HBase client library first sends a lookup re- keep track of updates to base table near real-time. The quest to the HTable containing the secondary index, to find co-processor framework allows code to be injected into the the secondary entry with the requested attribute value as system without changing the base code, akin to triggers, row key. This query returns the serialized TreeSet of the stored procedures and aspect-oriented programming. It is base table row keys associated with this secondary index therefore suited for developing a modular and pluggable in- value. The set of returned row keys is then used by the dexing solution. We also extend the HBase client interface client library to perform a batched lookup for all matching to allow for the creation of indices and index-based queries. rows on the base table. If the base table has many regions, one batched lookup is sent to each region that contains at 4.1 Table-based Secondary Index least one matching row. The client library then collects all A table-based index for a secondary attribute is imple- results and returns them to the client application. mented via an HTable. Each distinct attribute value is rep- Support for range queries is implemented in a similar fash- resented as one row with the attribute value as row key. The ion. In this case the filter contains the start and end range of the secondary attribute to be constrained. The client library HashTable Index List of secondary values B-link tree first translates this to a range query on the secondary index with row keys in the search range. This query might be sent Sally to one or several regions of the HTable index. Each region will return the qualifying TreeSets. Once the row keys of all matching records are determined, the procedure is the same Mark as for a point query where the batched lookups are sent to all relevant partitions of the base table, and the results are assembled before return to the user. Mark Sally Note that if a query contains predicates for indexed at- tributes and predicates for attributes which have no indices, additional processing is required. There are many ways to { ke77, mel2 } { kit9, liu7 } achieve this. We first determine the row keys of rows that match the attributes that are indexed. We then push the TreeSets for each secondary index value remaining filters down to the HBase server by the means of modified batch lookups. We are in the process of optimizing Figure 7: Abstract Layout of the Co-located In-Memory this rudimentary approach. Secondary Index 4.1.2 Region Management persistent only during a region shutdown. The split or merge of base table regions or index table re- Second, we wanted to support both point and range queries, gions has no impact on index management, as the approach and make updates very fast. While hash table based data only works with row keys, and HBase automatically redirects structures are capable of doing point lookups in θ(1), they lookups to the appropriate regions. The split of a table re- cannot be used for range queries. B+ -trees, on the other gion is independent of the split for an index region as for hand, are capable of performing range queries efficiently as HBase, these are two different HTables. Thus, region man- they store the secondary index values in a logically sorted agement is completely transparent and orthogonal to index order. But point queries have a cost of θ(n log n). Therefore, management. we create a hybrid data structure that is a combination of a In general, HBase decides automatically when to split re- hash table and a variation of B+ -tree to provide the best of gions and on which nodes to put them. Thus, in princi- both worlds. As before, the keys to these data structures are ple, the index designer does not need to be concerned at the secondary index values themselves. We store the set of all with region management. However, HBase allows the base table row keys associated with a secondary index value user to specify multiple regions at HTable creation time and using a TreeSet, exactly as we did in the table-based design. this might be useful when creating indices in order to evenly Each base table region has its own secondary index parti- distributed the load across region servers, and to prevent tion that references only the base table rows which it serves. a small number of region servers becoming a hot spot for Our tree-index structure is based on Blink -tree, a variation of index lookups. B+ -tree proposed in [19] and stated to have the highest con- currency and overall optimal performance according to [15, 4.2 Co-located In-Memory Secondary Index 23]. For the sake of brevity, we encourage the avid reader to refer to their original work for more details. In the co-location approach, each base table region (parti- Fig. 7 shows an example data of the in-memory index tion) has a collocated index that has index entries covering data structure. The set of row keys associated with each exactly the rows in the base table region. secondary index value is stored in a TreeSet. The references Although it is possible to create an HTable with exactly to the TreeSets are stored in the hash table as well as the this index content, HBase does not provide any straightfor- Blink -tree. Point queries can be easily satisfied by hash ta- ward means to enforce that this HTable-based index parti- ble lookups. The hash table also provides faster access to tion would be collocated with the corresponding base table TreeSets for updates and deletes. However, Blink -tree can region. This is because, by default, HBase considers these facilitate range queries based on the secondary index value. as independent HTables and decides individually on their lo- cation. We would need to modify the HBase code to imple- 4.2.1 Querying Using the Co-located Index ment a custom load balancer, which we did not consider an The client API for using the co-located index is similar attractive solution. Instead, we implemented our propriety, to that of the table-based index. The internal execution optimized index structure and embedded it transparently flow, however, is more similar to the one for native HBase into the HBase system using the co-processor framework. table scans than for table-based indexing. When the client Our index structure has been motivated by various as- API receives a query that wants to use a co-located index, it pects. First, we decided to focus on main-memory indexing. sends a special request containing filter conditions contained That is, the entire index partition is kept in main memory in the query to all the regions of the base table. These special and only persisted during a region shutdown. As a result, requests are sent in parallel so that the execution across all index maintenance is very fast and can exploit optimized regions can execute concurrently1 . main-memory data structures. Our motivation is that we At the server side, the special request is directed to a are looking at applications that use HBase’s capability to co-processor method that performs a lookup over the hash- scale to many nodes in order to keep the working data set in main memory and avoid expensive I/O. In such settings 1 Note that the native HBase implementation sends the it should be feasible to keep index structures in memory, query to one region after the other not allowing for simulta- too. Thus, our co-located index is memory resident and is neous scans across all regions table portion or a range query over B+ -tree portion of the Average time per non-batched put memory resident secondary index structure associated with 7,000 μs the base table region to retrieve the base table row keys. 6,000 number of threads = 4 These row keys are used to fetch the base table rows from the 5,000 corresponding region locally. Queries that contain complex 4,000 filters that span attributes with index and attributes without 3,000 index, can be easily handled with co-located indices. The indices are used to determine the row keys of rows that fulfill 2,000 the search criteria on the indexed attributes. Then, the base 1,000 table is accessed to retrieve those rows and return only those 0 number of records that fulfill the remaining criteria on the other attributes. 200,000 500,000 1,000,000 2,000,000 3,000,000 4,000,000 tbl-index : zipf tbl-index : unif coloc-index : zipf coloc-index : unif no-index : zipf no-index : unif 4.2.2 Persistence, Recovery and Region Splits A major challenge in main memory based database sys- Figure 8: Average time per non-batch put tems is the issue of backup and recovery. There is the dilemma of what happens to the data in memory when a used both Uniform distribution as well as Zipfian2 distri- region is shutdown or if it suffers a crash. Main memory bution to generate different test datasets. The uniform dis- database systems depend on some fast logging mechanisms tribution (over the size of the number of records in the table) to permanent storage in order to overcome such issues [13]. ensures that the values are highly distinct. Zipfian on the This, however, could introduce non-negligible delays into other hand produces values of a variety of selectivity as the processing that can threaten the performance benefits of frequency of occurrence of the values vary and hence rep- a memory based database. For our memory resident co- resents skewed data sets. We used the scrambled Zipfian located index, we rely on the fact that a completely lost in- distribution module in YCSB3 for our experiments. dex can be recovered by rebuilding it from scratch by scan- To reduce any interference from HBase compactions and ning the base table region. This, being a local operation, java old generation garbage collection [21] on our test met- can be accomplished with reasonable performance overhead. rics, we disabled HBase automatic compactions and increased Given that crashes are not frequent and only always affect the jvm heap. Additionally, test metrics were computed by one node, we believe that these recovery costs are reasonable averaging over five runs. and worth the improved performance during runtime. However, as the index has to be co-located with each base 5.1 Non-Batch Writes table region, we need to handle region splits to ensure that To measure the impact on write to base table, we used the memory resident indices are also split accordingly. This a client that performed non-batched Put using four threads is accomplished by having the coprocessor listen to region into an empty table, incrementally adding up to four mil- splits. Upon the initiation of a region split it will create two lion row keys and associated secondary index values. We new indices from the original in-memory index data struc- logged the average time taken for every hundred thousand ture and persist them into HDFS as separate files. The re- records as the table grew in size to understand the impact gion server will then shutdown the now split region (which of a growing table size. As can be seen in fig. 8, the memory will not exist anymore). The new two daughter regions are resident co-located index does better than the table-based then brought up (by possibly different region servers) and implementation for both uniform and Zipfian distributions. the coprocessor instances attached to them will load the cor- The performance overhead of co-located index is between 9% responding index files from HDFS into main memory. to 15% as the table size increases. However, for table-based In similar spirit, we persist the indices to HDFS storage implementation the performance penalty is almost 125% for during regular region shutdown so that they can be restored uniform distribution. This is to be expected as every Put without having to scan the base table region. We do so, be- is now in effect replaced by two Puts. For Zipfian distri- cause HBase’s load-balancing might determine that regions bution, the performance penalty starts at 300% for 200,000 have to be moved. For such moves, it will be faster to per- records and becomes 600% by a million records indicating sist and transfer the index than re-creating it from scratch severe performance issues. This is attributed due to the at the new region server. skewed representation of data in Zipfian distribution result- ing in more contention as well as larger secondary index entries which takes even more time to process, compound- 5. EXPERIMENTS ing the issue. However, this is not an issue for the co-located index as each region processes and updates its secondary in- For experimental setup, we used a cluster of nodes each dex entries locally, causing less contention as the index en- with Intel R Pentium R Dual-Core CPU G2020 @ 2.9GHz , tries for the same secondary attribute value are distributed 8GB 1333MHz RAM, 500GB SATA-2 HDD, running Ubuntu across multiple regions. Further, in-memory processing is 12.04.4 LTS 64-bits. The nodes are connected together using also faster over HTable, reducing any chance of contention a 1 Gbps switch. Four of the nodes are configured as HBase and giving advantage to co-located index. Region Servers, also running HDFS Data Nodes. One node To measure the scalability of processing, we vary the num- is setup for running HMaster and HDFS Name Node. An- ber of client threads. As can be seen in fig. 9, all except the other node was used for running the client processes. We used HBase-0.96 and Hadoop-1.2 for the test setup. 2 https://en.wikipedia.org/wiki/Zipf’s law In order to understand the impact of the distribution of 3 https://research.yahoo.com/news/yahoo-cloud-serving- secondary indexed attribute’s values on performance, we benchmark/ Average time per non-batched put with different Average time per read number of threads 10,000 number of threads = 4 ms 5,000 1,000 4,500 μs #records inserted = 4 million 4,000 100 3,500 10 3,000 2,500 1 2,000 1,500 0 1,000 0 500 0 number of threads 0 number of records 2 4 8 200,000 500,000 1,000,000 2,000,000 3,000,000 4,000,000 tbl-index : zipf tbl-index : unif coloc-index : zipf tbl-index : zipf tbl-index : unif coloc-index : zipf coloc-index : unif no-index : zipf no-index : unif coloc-index : unif no-index : zipf no-index : unif Figure 9: Average time per non-batch put for different num- Figure 12: Average time per read using secondary index ber of threads Average time per read chart Average time per record using batch puts 1000 number of threads = 4 ms 100 μs number of threads = 1 1000 10 100 1 0.1 10 0.01 0.001 number of records 1 200,000 500,000 1,000,000 2,000,000 3,000,000 4,000,000 number of records 500,000 1,000,000 2,000,000 3,000,000 4,000,000 tbl-index : zipf tbl-index : unif tbl-index : zipf tbl-index : unif coloc-index : zipf coloc-index : zipf coloc-index : unif coloc-index : unif no-index : zipf no-index : unif Figure 13: Average time per read using secondary index Figure 10: Average time per record for batch put Average time per read Average time per record using batch puts 6 number of threads = 4 ms 20 uniform distribution μs number of threads = 1 5 18 table size = 1 million 16 4 14 12 3 10 8 2 6 4 1 2 0 number of records 0 records per secondary index value 500,000 1,000,000 2,000,000 3,000,000 4,000,000 1 2 3 4 5 10 coloc-index : zipf coloc-index : unif tbl-index coloc-index no-index : zipf no-index : unif Figure 14: Average time per read using secondary indexed Figure 11: Average time per record for batch put, co-located attributes at varying selectivity index vs no index by the update to the base table and index itself. While table- table-based Zipfian scales well, giving the optimal perfor- based Zipfian continues to show deteriorated performance, mance at four threads, where each region server is occupied what is intriguing is that the performance penalty of table- simultaneously. We observe that the table-based Zipfian fail based index with uniform distribution is also very severe. to scale owing to the contention as discussed before. The reason for this is that, though the client uses batch mode for performing Puts, the coprocessor architecture of HBase 5.2 Batch Writes results in the coprocessors being invoked once per each Put, Further, we performed the same experiment, this time dampening the benefits of batched approach. The coproces- with batched Put, which is the most efficient way of data sor is therefore forced to update the index HTables using loading in HBase. We used a single thread and batches of individual Puts. Fig. 11 shows the same results without 100,000 records. From the results in fig. 10, we can see that including table-based implementation for better scalability. the co-located index does far better than table-based index- ing. Co-located indexing still has about 95% penalty. This 5.3 Point Reads Using Secondary Indices can be expected because, with batched Puts, inefficient net- Point reads are performed using the indices created on work traffic is reduced, and the only performance bounds is secondary attributes and compared against the naı̈ve HBase Average time per read Response time for range queries in uniform distribution 12 number of threads = 1 seconds 7 records per secondary index value = 1 table size = 1 million ms uniform distribution 10 6 table size = 1 million 5 8 4 6 3 4 2 1 2 0 number of read threads 0 range 1 2 4 8 0 50,000 100,000 150,000 200,000 tbl-index coloc-index tbl-index coloc-index no-index Figure 15: Average time per read using secondary indexed Figure 16: Response time for range queries in uniform dis- attributes with varying number of threads tribution alternative, the scan operation. The scan operation results Response time for range queries in Zipfian distribution in all of the region to be read. We used tables with different 12 number of threads = 1 seconds number of records to better measure the variance in perfor- table size = 1 million 10 mance as a function of table size. A client with 4 threads is used to issue the read requests. The values for secondary in- 8 dexed attributes are drawn from the same distribution used 6 to load them into the tables to perform random reads. The results are shown in fig. 12. For better scale, just 4 the comparison of the two indexing approaches is shown 2 in fig. 13. We can see that all the index implementations provide better performance. For uniform distribution, the 0 range memory resident, co-located index performance increases 0 50,000 100,000 150,000 200,000 from 77 times to 1280 times as the table size grows from tbl-index coloc-index no-index 200,000 to 4 million. The performance benefits for co-located index for Zipfian stays around 21%. This is because with Figure 17: Response time for range queries in Zipfian distri- Zipfian, there are more records for popular values, and this bution results in more processing and data transfer. However, even Therefore on doing detailed analysis on the performance of in such cases the benefit of having the index is very evident. read queries, one can conclude that table-based implemen- Table-based Zipfian shows the least improvement, owing tations are better choice for indexed attributes with near to the fact that a large number of records will have to be read unique distribution, while co-located implementations pro- from all the region servers. Hence retrieving the row keys vides better performance when the secondary index distri- first and then querying the base records using them turns bution tends to be more non-unique. out to be costly across the network hops. However, table- based index seems to do better with Uniform distribution. 5.4 Range Queries using Secondary Indices It gives 100 to 1400 times better performance, outdoing even To test range queries, we chose a single region setup in the the memory resident co-located index implementation. interest of fairness, as HBase scans one region at a time in To analyze this further, we next ran the test case for uni- contrast to our parallel approach. Fig. 16 shows the perfor- form distribution, varying the number of records per sec- mance for uniform distribution and fig. 17 shows the same for ondary index value. The results shown in fig. 14 shows that Zipfian distribution. In both distributions, the memory res- as the number of records increase, the table-based index ident co-located index seems to outperform the table-based looses its advantage over the co-located one. This is because implementation. This is in concurrence to our observation with increase in number of records to be retrieved from the in section 5.3 where the co-located index fairs better when base table, the table-based implementation will likely has to there are many records to be returned. interact with all the region servers, a scenario in which the Further, as the range increases, the performance advan- co-located index will do better due to the reduced number of tages of secondary indices decrease and beyond some range, hops in the overall read request due to its broadcast nature. it becomes an overhead to use the index. This threshold We also analyzed the impact of concurrency by varying seems to appear at about the range of 80,000 (8% of records the number of read threads while keeping the number of in the table.). Beyond this range, scanning the base table base table records per secondary index value at 1. From the region directly is beneficial over using the secondary index. results in fig. 15, we can see that co-located index can do better at lower concurrency, but as the number of threads increases, table-based index starts doing better. This is be- 6. RELATED WORK cause, at small workloads, the overhead of broadcast in co- As stated in section 3.3, Teradata follows a co-located in- located implementation does not weigh-in into performance, dexing approach for non-unique attributes and a re-distribution while it also benefits from the reduced request rounds. But of index entries for unique attributes. However, in Teradata, at higher concurrency, this overhead starts to come into play. row distribution (for non co-located index entries as well as base table records) is accomplished via automatic hashing of the attribute values [25]. While this leads to ease of imple- [2] IHBase. https://github.com/ykulbak/ihbase. mentation and maintenance of table and index structures, [3] ITHBase. https://github.com/hbase-trx/ the downside is there is no locality of associated attribute hbase-transactional-tableindexed. values due to hashing, and operations such as range searches [4] Lily. http://www.lilyproject.org/. will have to be performed in all the nodes. [5] Secondary Indexes and Alternate Query Paths. http:// ITHBase [3] is an open source implementation developed hbase.apache.org/0.94/book/secondary.indexes.html. by modifying base HBase source code to include transac- [6] Solr. http://lucene.apache.org/solr/. tion and secondary index support. Their implementation is [7] D. J. Abadi, P. A. Boncz, and S. Harizopoulos. similar to our table-based indexing solution. Column-Oriented Database Systems. VLDB, pages IHBase [2] is an in-memory secondary indexing solution 1664–1665, 2009. for HBase, quite similar to our own in-memory co-located in- [8] D. J. Abadi, S. R. Madden, and N. Hachem. dexing solution. By modifying core HBase, only the portion Column-Stores vs. Row-Stores: How Different Are of the data in disk is indexed and the contents of the mem- They Really? In ACM SIGMOD, pages 967–980, 2008. store is searched completely. The index is built from scratch [9] D. Borthakur. HDFS Architecture Guide. http: by scanning the region every time it is brought online. //hadoop.apache.org/docs/r1.2.1/hdfs design.html, Culvert [1] is a secondary index implementation similar 2008. to our table-based approach. However, it seems like they do [10] G. Bracha. Generics in the Java programming their index table updates via modified client side code which language. Sun Microsystems, pages 1–23, 2004. needs to be used to perform data loading. [11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Lily [4] is a cloud-scale repository for social content ap- Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. plications that uses HBase for data storage and Solr [6] for Gruber. Bigtable: A Distributed Storage System for indexing needs. Their indexing engine therefore resides out- Structured Data. ACM Trans. Comput. Syst., pages side the HBase ecosystem. 4:1–4:26, June 2008. For the sake of brevity, we have covered only the variants of HBase indexing implementations that are popular in liter- [12] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. ature. In our implementation, we focused on leveraging the Introduction To Algorithms. MIT Press, 2001. flexibility of the coprocessor framework over the approaches [13] H. Garcia-Molina and K. Salem. Main Memory that required modification to core HBase classes. This has Database Systems: An Overview. IEEE Trans. on the advantage that existing applications need not be mi- Knowledge and Data Eng., pages 509–516, Dec 1992. grated to perform index maintenance. Further, our analysis [14] L. George. HBase: The Definitive Guide. O’Reilly demonstrates that either of the indexing approaches could Media, 2011. be the optimal solution depending on the data distribution. [15] T. Johnson and D. Sasha. The Performance of Current B-tree Algorithms. ACM Tran. on Database 7. CONCLUSION & FUTURE WORK Systems, pages 51–101, 1993. [16] A. Khurana. Introduction to HBase Schema Design. In this paper we described the challenges associated with Networked Syst, 37:29–36, 2012. indexing in distributed DBMS and provided two implemen- [17] A. Lakshman and P. Malik. Cassandra: A tations for HBase that works with less intrusion into the core Decentralized Structured Storage System. ACM HBase source code. Our indexing solutions can work with SIGOPS Operating Systems Review, 44(2):35–40, 2010. point queries as well as range queries. Further, we provided a very detailed analysis of how different data distributions [18] N. Leavitt. Will NoSQL Databases Live Up to Their warrant different indexing approaches and demonstrated a Promise? Computer, 43(2):12–14, 2010. case for both implementations. Our results show that there [19] P. L. Lehman et al. Efficient Locking for Concurrent is clearly a benefit to having secondary indices in HBase, Operations on B-trees. ACM Trans. on Database and that they can be often built with reasonable perfor- Systems, pages 650–670, 1981. mance overhead. Although there has been some prior works [20] A. P. Mingjie Lai, Eugene Koontz. Coprocessor to achieve secondary indexing in HBase, our work have been Introduction. https://blogs.apache.org/hbase/entry/ more detailed and insightful about the various alternatives coprocessor introduction, 2012. and clearly shows that there is no one-stop solution to sec- [21] Oracle Corporation. Java Garbage Collection Basics. ondary indexing needs in HBase. http://www.oracle.com/webfolder/technetwork/ For our future work, we are exploring on how to use End- tutorials/obe/java/gc01/index.html. Point coprocessors to re-build secondary indices for batch [22] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The processing-only type of workloads where real-time consis- Log-Structured Merge-Tree (LSM-Tree). Acta tency of index structures are not necessary. This approach Informatica, 33(4):351–385, 1996. is also common in large relational databases where the in- [23] V. Srinivasan and M. J. Carey. Performance of B+ dices are dropped during bulk loads and rebuilt after the Tree Concurrency Control Algorithms. VLDB, pages data loads are completed. A key challenge would be how to 361–406, 1993. tackle the client reads while indices are being rebuilt in a [24] M. Stonebraker. The Case for Shared Nothing. IEEE less intrusive manner. Database Eng. Bull., 9(1):4–9, 1986. [25] Teradata Corporation. Introduction to Teradata. 8. REFERENCES www.teradata.com/workarea/downloadasset.aspx?id= [1] Culvert. 17947, 2010. https://github.com/booz-allen-hamilton/culvert.