=Paper=
{{Paper
|id=Vol-2324/Paper09-LSubba
|storemode=property
|title=Efficient Indexing of Hashtags Using Bitmap Indices
|pdfUrl=https://ceur-ws.org/Vol-2324/Paper09-LSubba.pdf
|volume=Vol-2324
|authors=Lawan Thamsuhang Subba,Christian Thomsen,Torben Bach Pedersen
|dblpUrl=https://dblp.org/rec/conf/dolap/Subba0P19
}}
==Efficient Indexing of Hashtags Using Bitmap Indices==
<pdf width="1500px">https://ceur-ws.org/Vol-2324/Paper09-LSubba.pdf</pdf>
<pre>
             Efficient Indexing of Hashtags using Bitmap Indices
      Lawan Thamsuhang Subba                                     Christian Thomsen                         Torben Bach Pedersen
               Aalborg University                                   Aalborg University                          Aalborg University
               lawbba@cs.aau.dk                                       chr@cs.aau.dk                               tbp@cs.aau.dk

ABSTRACT                                                                        workloads (OLAP). In order to read only relevant data, column-
The enormous amounts of data being generated regularly means                    oriented storage formats like Orc and Parquet support predicate
that rapidly accessing relevant data from data stores is just as                pushdown, where the search predicates using =, <, >, <=, >= or !=
important as its storage. This study focuses on the use of a dis-               are pushed down to the storage level and are evaluated against its
tributed bitmap indexing framework to accelerate query execu-                   aggregate based indices holding the minimum and maximum val-
tion times in distributed data warehouses. Previous solutions for               ues for each column in each block. Such aggregate based indices
bitmap indexing at a distributed scale are rigid in their implemen-             work fine when the data is ordered, but when the data appears
tation, use a single compression algorithm, and provide their own               unordered or is skewed, they are prone to false positive results.
mechanisms to store, distribute and retrieve the indices. Users                 To alleviate this problem [20] proposed columnar imprints which
are locked to their implementations even when other alterna-                    scans the entire column to create bit vectors for every cache line
tives for compression and index storage are available or desirable.             of data. A bit is set within a bit vector if at least one value occurs
We provide an open source, lightweight, and flexible distributed                within the corresponding bin. As a result, a unique imprint of
bitmap indexing framework, where the mechanisms to search for                   the cache line is created providing a coarse-grained view for the
keywords to index, the bitmap compression algorithm used, and                   entire column. However, neither the default aggregate based in-
the key-value store used for the indices are easily interchange-                dices of Orc nor the column imprint index supports the indexing
able. We demonstrate using Roaring bitmaps for compression,                     of substrings like hashtags that can exist within non-numeric
HBase for storing key-values, and adding an updated version                     columns. For queries like SELECT tweet FROM table WHERE
of Apache Orc that uses bitmap indices to Apache Hive that al-                  string LIKE "%#hashtag%" there is no alternative but to read every
though there is some runtime overhead due to index creation,                    single shard of the dataset, which is not a practical approach for
the search of hashtags and their combinations in tweets can be                  large datasets as it is incredibly time consuming.
greatly accelerated.                                                               The compressible nature of bitmap indices, the ability to per-
                                                                                form hardware assisted logical operations (AND, OR, XOR) on
                                                                                them and their lightweight nature when compared to tree-based
1    INTRODUCTION                                                               indices make them ideal for indexing hashtags under such con-
Social media platforms like Facebook, Instagram and Twitter                     ditions. Distributed bitmap indexing frameworks are not a new
have millions of daily active users. On a daily basis, users upload             concept, and several have been developed [10, 13, 17]. However,
content onto the platforms on a petabyte scale. To keep its user                they are rigid in the use of a specific bitmap compression algo-
base engaged and active on their platforms, it is critical for such             rithm and provide their own implementation to store, distribute
platforms to ensure that users can find content that is relevant                and retrieve the bitmap indices. A flexible system where the com-
to them quickly. Restrictions are not placed on how much infor-                 pression algorithm to use and the mechanism to store, distribute
mation users can upload due to ever decreasing storage costs.                   and retrieve bitmaps can be easily swapped to the state of the
Therefore, efficient retrieval from data warehouses becomes just                art systems is desirable. Such a system allows for the reuse of
as important as storage. Most social media platforms support                    existing technologies, and better alternatives can be swapped
hashtags, a keyword containing numbers and letters preceded                     in when available. With this paper, we develop and evaluate a
by a hash sign (#). They allow users to add specific targeted                   lightweight and flexible bitmap indexing framework which can
keywords to contents they upload on social media platforms,                     be incorporated into existing big data technologies. In this paper,
allowing other users in turn to find them. Its simplicity and lack              our main contributions are
of formal syntax have allowed for its widespread adoption on
                                                                                   (1) An open source, lightweight and flexible distributed bitmap
multiple platforms. Efficiently finding relevant hashtags and their
                                                                                       indexing framework for big data which integrates with
combinations at the Big data scale is a challenge.
                                                                                       commonly used tools incl. Apache Hive and Orc. Users
    The volume, velocity and variety of data arriving every sec-
                                                                                       can easily plug-in their desired functions to find keys to
ond means that distributed file systems like Hadoop Distributed
                                                                                       index, bitmap compression algorithm and key-value store.
File System (HDFS) [19] are preferred for Big data. HDFS sup-
                                                                                       Otherwise, they may use the default setup consisting of
ports several file formats like text/comma separated values (CSV)
                                                                                       Roaring bitmap index, HBase as the key-value store and
files, column-oriented storage formats like Orc [15], Parquet
                                                                                       provide their specific method to find indexable keys.
[16] and row-oriented storage formats like Avro [1]. The differ-
                                                                                   (2) A demonstration of how the search for substrings like
ence between row-oriented and column-oriented storage formats
                                                                                       hashtags in tweets can be greatly accelerated by using our
lies in how they store contiguous blocks of data. Row-oriented
                                                                                       bitmap indexing framework. The storage costs for bitmap
storage formats store successive rows contiguously, whereas
                                                                                       indices are minimal, however there is runtime overhead
column-oriented storage formats ensure that all values of a col-
                                                                                       due to index creation.
umn are stored contiguously. The former is suitable for transac-
tional (OLTP) workloads, while the latter is suitable for analytical               The paper is structured as follows. Section 2 provides back-
                                                                                ground information on the technologies used in our framework.
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings   Section 3 presents the related work. Section 4 describes the imple-
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                    mentation details for our indexing framework. Section 5 presents
the experiments conducted with our distributed indexing frame-                   num  string
                                                                                  1 twt3 #tag1        4        twt5       7           twt7         10     twt10
work. Finally, Section 6 concludes our work.
                                                                                  2    twt2           5        twt4       8     twt8 #tag1 #tag2   50     twt50
                                                                                  3    twt1           6        twt6       9           twt9         11     twt11

2 BACKGROUND                                                                                                       (a) Sample Dataset
2.1 Apache HBase                                                                       Index Data          col: num          rg­1 Index         min=1, max=3
                                                                                                          col: string


                                                                            Stripe1
                                                                                                                             rg­2 Index        min=4, max=6
A key-value database stores data as a series of keys-values where                       Row Data           col: num          rg­1 Index min="twt1", max="twt3 #tag1"
the key acts as a unique identifier and maps to a value in the                                            col: string        rg­2 Index    min="twt4", max="twt6"
                                                                                      Stripe Footer                           rg­1 Data             1, 2, 3
database. Both the key and value can be either simple types or                                                                rg­2 Data             4, 5, 6
complex types as supported by the key-value database. The three                                                               rg­1 Data  "twt3 #tag1", "twt2", "twt1"
                                                                                                                              rg­2 Data     "twt5", "twt4", "twt6"
major operations that define a key-value database are put(key,
value), get(key), delete(key). Apache HBase [9] is an open-source                      Index Data          col: num          rg­3 Index          min=7, max=9
                                                                                                          col: string        rg­4 Index          min=10, max=50
distributed wide column store providing key-value store opera-


                                                                            Stripe2
                                                                                       Row Data            col: num          rg­3 Index     min="twt7", max="twt9"
tions on top of HDFS. HBase is used for low latency read/write                                            col: string        rg­4 Index min="twt10", max="twt50"
                                                                                      Stripe Footer                           rg­3 Data               7, 8, 9
operations on large datasets. Logically, data in HBase is orga-                                                               rg­4 Data             10, 50, 11
nized as labeled tables containing rows and columns, each row                                                                 rg­3 Data "twt7","twt8 #tag1 #tag2", "twt9"
                                                                                       File Footer                            rg­4 Data     "twt10", "twt50", "twt11"
is defined by a sorting key and an arbitrary number of columns.                        PostScript
Several other open source key-values databases are available                                                               col: num            min=1, max=6
                                                                              Orc File Format               Stripe­1 Index
                                                                                                                           col: string     min="twt1", max="twt6"
[5, 11, 14, 18].                                                                                                           col: num            min=7, max=50
                                                                                                            Stripe­2 Index
                                                                                                                           col: string     min="twt7", max="twt50"
                                                                                                                           col: num            min=1, max=50
                                                                                                              File Index
2.2    Apache Orc                                                                                                          col:string      min="twt1", max="twt50"

Apache Orc is a columnar-oriented file format for Hadoop that                                         (b) Sample Dataset stored as Orc file
is both type aware and self-describing [15]. It is optimized for
reading data and creates indices at multiple levels to find relevant                                      Figure 1: Orc File Structure
data efficiently. A low-level example of the Orc file format and the
three levels of its aggregate based indices are shown in Figure 1.
We use a small dataset in Figure 1a with 12 records (each has two           rowgroup-4 data contains skewed data, and its index informa-
attributes num and string) to illustrate how a dataset is stored            tion is distorted for both columns. Queries searching for num
as an Orc file, including its aggregate based indices. An Orc file          = 25 or string = "twt25" end up reading data streams from the
comprises of independent units called stripes, each having the              Orc file even though no stripes or rowgroups in the Orc file
default size of 64MB. Each stripe is further composed of groups             hold those values. [20] proposed the columnar imprint index to
of rows called rowgroups with the default size of 10,000 rows.              solve this problem for numeric columns. However, neither the
In our example, a rowgroup size of three is used resulting in an            default aggregate based indices of Orc or the columnar imprint
Orc file containing two stripes and four rowgroups. For every               index supports the indexing of substrings that can exist within
stripe, the index data streams denoted by green blocks store index          non-numeric columns. For example, our sample dataset contains
information about columns for every rowgroups within the stripe.            tweets and some rare hashtags. There is no way to accelerate
For both numeric and non-numeric attributes, the (min, max)                 queries like SELECT * FROM table WHERE string LIKE "%#tag1%"
values within the rowgroup are used as indices. In case of non-             or (AND and OR) operations of hashtags.
numeric attributes, the values are compared lexicographically to                In addition to aggregate based indices, Orc also supports Bloom
determine the max and min value. The rowgroup data streams                  filters [2] on its columns. Bloom filters are highly space efficient
denoted by the blue blocks contain the actual data stored in                probabilistic data structure for determining set membership. Com-
rowgroups within stripes. Each stripe contains a stripe footer              pared to aggregate based indices and Columnar imprints, the
with information on the index stream, data stream and encoding              probabilistic nature of bloom filter means that false positives are
information for each column in the stripe. The rowgroup indices             possible but false negatives are not. Although the false positive
within a stripe are used to produce the indices at stripe level             probability is configurable on Orc, bitmap indices do not suffer
denoted by the purple blocks, and the stripe level indices are              from this problem.
used to generate the file level index denoted by red blocks. The
file level and stripe level indices are stored in the file footer section   3           RELATED WORK
of the Orc file. Additionally, information about the datatype of            A bitmap index is a special kind of index where if a dataset con-
every column in the file, number of stripes and the number of               tains N records and an attribute A has D distinct values, the
rowgroups in every stripe are stored in the file footer. Lastly, the        bitmap index generates D bitmaps having N bits each. Each bit
information regarding how to interpret an Orc file including the            in the bitmaps is set to "1" if the record contains that value other-
version of the file, the length of the file footer and the compression      wise, the bit is set to "0" [21]. Bitmap indices can be compressed
used is stored in the postscript section.                                   significantly and require less space than other conventional tree-
    In our example, the file level index contains (min=1, max=50)           based indices. In addition, hardware supported bitwise operations
for the column num and (min="twt1", max="twt50") for the col-               (AND, OR, NOT and XOR) can be utilized on bitmap indices in
umn string. Queries with filter predicates like SELECT * FROM               order to speed up queries. Based on run-length encoding (RLE)
table WHERE num = 51 or string = "twt51" search for keys out-               bitmap compression schemes WAH [22] and PLWAH [6] have
side the min-max range and are stopped immediately. However,                been proposed to reduce the space occupied by bitmap indices.
in cases where data is unordered or skewed, aggregate based                 [22] proposed the Word-Aligned Hybrid (WAH) bitmap compres-
indices are prone to false positives. For instance, in our example,         sion, where a sequence of consecutive bits of ones or zeros can be
represented with their bit value and a count indicating the length         The executor processes the call for all relevant slices on the lo-
of the sequence. WAH runs are comprised of a fills and tails. A            cal node and concurrently issues requests to process the call for
fill is a set of similar bits that is represented as a count plus a bit    slices which reside on remote nodes in the cluster. Once local
value indicating whether it is a zero fill or a one fill. Next, the tail   and remote processing has ended, it performs any aggregation
is a mixture of zeros and ones, which are represented without              or reduction work and returns the results.
compression. [6] observed that WAH compression runs were                      All the previous indexing frameworks use a fixed compres-
never long enough to use all the bits allocated for the run-length         sion algorithm, but a flexible framework where the compression
counter. Hence, they proposed the Position List Word Aligned               algorithm can be substituted is desirable as better compression
Hybrid (PLWAH) compression, which uses those unused bits to                algorithms are developed and released. Also, all of them lock
hold the position list of set/unset bits that follow a zero or one         users to their specific implementation to store, distribute and
run. Thus, if a tail following a fill differs only by a few bits, the      retrieve bitmap indices in a distributed setting. However, a frame-
fill word can encode the difference between the tail and fill. The         work where users can use any key-value store allows greater
size of PLWAH bitmaps are often half that required for WAH                 flexibility as state of the art key-value stores can be utilized. For
bitmaps, and PLWAH was found to be faster than WAH. While                  example, HBase and Hive are regularly used together on the same
compression algorithms based on run-length encoded compres-                Hadoop cluster and Hive provides storage handlers that allow
sion perform better on sparse bitmaps, where most of the bits are          Hive statements to access HBase tables.
0's, they are not so effective on dense bitmaps where most of the
bits are a combination of 0's and 1's. In addition, they have slow         4     SYSTEM
random access and cannot skip sections of the bitmap. Therefore,
                                                                           In this section, we present our bitmap indexing framework. We
[3] developed a hybrid compression technique called Roaring
                                                                           will look at how the index creation process takes place for datasets
bitmaps that uses packed arrays and uncompressed bitmaps in a
                                                                           in Hive, and how the bitmap indices are used during query pro-
two-level index. It separates the data and divides it into sparse
                                                                           cessing. Finally, how the indexing framework can be used on
and dense chunks. The dense chunks are stored using bitmaps
                                                                           Hive is discussed.
while the sparse chunks are stored using a packed array of 16 -
bit integers. It supports fast random access, and in experiments
where it was compared to WAH, compresses several times better              4.1    System Architecture
and was found to be faster. However, compared to RLE compres-              Hive [10] is a data warehouse solution running on Hadoop [8]
sion algorithms, Roaring bitmap had limitations regarding the              that allows users to use the query language HiveQL to write, read
compression of long compressible runs. Therefore, a third type of          and manage datasets in distributed storage structures. It supports
container was added to support such runs of consecutive values             the Orc file format as one of its underlying storage formats. The
[12] making Roaring several times faster than the RLE based                system architecture for Hive and our indexing framework is
(WAH) while also compressing better.                                       shown in Figure 2. HiveQL queries are submitted to Hive through
    While bitmap indices were introduced to expedite queries in            the Hive clients. The queries are received by the driver, then the
traditional centralized systems, there are challenges when apply-          compiler is used to parse, type check and semantically analyze the
ing them to Big data platforms utilizing distributed file systems          queries with schema information in the metastore. An execution
(DFS). In such situations, local indices will be created on every          plan is created for the query and an optimizer optimizes the
computing node, and a mechanism is required to create and main-            execution plan using solutions like column pruning, predicate
tain a global index for expediting queries. [10, 13, 17] have pro-         pushdown (PPD) and pipelining. Finally, the executor executes
posed scalable distributed bitmap indexing frameworks. [13] pro-           the optimized execution plan as jobs on Hadoop. Hive supports
poses the Bitmap Index for Database Service (BIDS) framework               query execution via three execution engines (MapReduce, Tez
for large-scale data stores. The framework utilizes an adaptive            and Spark). The job scheduling and resource management tasks
indexing technique, which uses either WAH, bit-sliced encoding             are handled by YARN [23]. If Orc use is enabled, all the engines
or partial indexing depending on the data characteristics to re-           create jobs that use the Orc reader/writer to read and write files in
duce index size. Their indexing scheme favors the creation of the          HDFS. As aggregate based indices are prone to false positives, we
maximum number of indexed attributes so that a wide variety                added the bitmap indexing framework to the Orc reader/writer
of queries can be supported. The compute nodes are organized               to support more accurate indices. The framework is agnostic to
according to the Chord protocol, and the indexes are distributed           the execution engine. Users can use our implementation that
across the nodes using a load balancing mechanism. In Apache               searches for hashtags to generate <key, bitmap> values, uses the
Hive [10], data is stored as logical tables, the tables themselves are     state of the art Roaring bitmap for compression and HBase as its
stored as files distributed in HDFS and the metadata is stored in          key-value store. Alternatively, users can easily replace the default
the Hive metastore. The bitmap indices for Hive tables are stored          implementation of the indexing framework with their desired
in index tables with columns containing the indexed column,                bitmap compression algorithm, key-value store and functions
the name of the block storing the data, offset within the block            to search for keys to index by uploading a custom Jar file to the
and bitmap index for the column values. Hive uses its metastore            working list of Hive’s Jar file.
and both tables to process queries on its bitmap indexed tables.              As bitmap indices are key-value pairs where the key is the
It uses an enhanced version of WAH to compress bitmaps and                 search key to be indexed and the value is its bitmap representa-
does not support the indexing of substrings from string columns.           tion, the indexing framework uses a key-value store for persistent
The work that closely resembles ours is the distributed bitmap             storage of the indices for Orc files. For our implementation, HBase
indexing framework Pilosa [17]. Pilosa uses a modified version of          [9] was chosen as the key-value store due to its low input/output
Roaring bitmap based on 64-bit integers and divides each bitmap            latency and interoperability with Hadoop, but other key-value
index into frames, views and fragments. Pilosa runs as a cluster           stores can be easily swapped in. When Hive stores datasets as
of one or more nodes, and there is no designated master node.              Orc files, the indexing framework uses functions defined by the
  Hive
                   CLI         JDBC/ODBC
                                                                            Algorithm 1: Bitmap Index Creation
  Client
                                                                             input : shard shard for mapper, col column to be indexed,
  Hive                        Driver                                                  udf user defined function to find keys
  Server        (Compiler, Optimizer and Executor)   Metastore
                                                                             output : kBms list of all keys their bitmaps
  Execution
  Engine       MapReduce       Tez      Spark                                Algorithm createIndex(shard, col, udf )
                                                                           1    uKeys=List<String>                           /*unique keys*/
  Resource                     YARN
  Management                                                               2    rnR=bitmap()                        /*rownumbers in bitmap*/
                                                      Key­value Store
  File                                                                     3    prnR=bitmap()                /*padded rownumbers in bitmap*/
  Reader           Orc           Bitmap Indexing
                                                       HBase     ...
               Reader/Writer                                               4    lstR=List<bitmap>                        /*list of bitmaps*/
                                    Framework
                                                     Bitmap Compression    5    kBms=List<String,bitmap> /*list of keys and bitmaps*/
  File                                                 Roaring   ...       6    for i ← 1 to shard.size do
                               HDFS
  System                                                                             /*Default writing process of Orc*/
                                                                           7         createRowNrBitmap(i, shard.get(col, i))
           Figure 2: Orc Bitmap Indexing Framework                         8     addGhostRowgroups()
                                                                           9     createBitmapIndex()
                                                                          11     return kBms
user to find keys, generates state of the art Roaring bitmap in-
dices and stores them in HBase. For Hive queries that read from              Procedure createRowNrBitmap(rowNr, col)
tables stored as Orc files, the indexing framework uses the search         1    nr=0
predicate passed during predicate pushdown to retrieve and pro-            2    bm = new bitmap()
cess relevant bitmap indices from HBase. After processing the              3    keys = udf(col)
bitmaps, the indexing framework allows Hive to skip irrelevant             4    if keys!=null then
stripes and rowgroups. Note that our indexing framework only               5        foreach key in keys do
controls which stripes and rowgroups are accessed, the query               6            if uKeys.exists(key) then nr = uKeys.get(key)
processing itself is done entirely by Hive's own execution engine.         7            else nr = uKeys.add(key).getSize()
                                                                           8            bm.add(nr)
4.2    Bitmap index Creation
                                                                           9         lstR.add(bm)
As Orc files are composed of stripes and each stripe is composed
                                                                          10         rnR.add(rowNr)
of rowgroups, a bit in a bitmap represents the presence or ab-
sence of a key in a tuple at a certain row number in a rowgroup              Procedure addGhostRowgroups()
within a stripe. The stripe number and rowgroup number can be              1    rownr=0
determined from a tuple’s row number provided that the max-                2    for j ← 1 to shard.size do
imum number of rows that can fit into a rowgroup is known                  3       if rnR.contains(j) then prnR.add(rownr++)
[default for Orc: 10,000] and the maximum number of rowgroups              4       if isStpBnd(j) then rownr += addPadding (j, mrgps,
per stripe (mrgps) is consistent across the Orc file. However, by
                                                                                     rprg)
default Orc uses a stripe size of 64MB and depending on the
nature of the dataset, the number of rowgroups across stripes is             Procedure createBitmapIndex()
unlikely to be consistent. In order to ensure consistency across           1    for k ← 1 to prnR.size do
all stripes, ghost rowgroups can be added to stripes that contain          2       setBits = lstR.getAt(k).toArray()
a smaller number of rowgroups than the maximum number of                   3       foreach setbit in setBits do
rowgroup per stripe. Ghost rowgroups do not exist in the Orc               4           key = uKeys.get(setbit);
files but are added only during the bitmap index creation process.         5           if kBms.exists(key) then
Once the number of rowgroups across the stripes has been made              6                kBms.add(key, kBms.get(key).add(k))
consistent, the maximum rowgroups per stripe (mrgps), rows per             7           else kBms.add(key, new bitmap(k))
rowgroup (rprg) and row number for a particular tuple (rn) can
be used in the integer divisions in equation (1) to determine the
containing stripe number (str) and rowgroup number (rg).

                     str = rn/(mrдps ∗ rprд)                              mapper, the column name to be indexed and the user-defined
                                                               (1)
              rд = (rn mod (mrдps ∗ rprд))/rprд                           function (UDF) set by the user to find keys within the column.
   The approach is similar to the Hakan factor [7] for bitmap             In lines 6-7, as the columns of the dataset are being written to
indices used by the Oracle DBMS, where the Hakan factor refers            an Orc file, createRowNrBitmap is executed on the column to
to the maximal number of rows a data block can store and is used          be indexed. In line 3 of createRowNrBitmap, the UDF set by the
to determine the data block which contains a row.                         user is used to find keys in the column. Then in lines 5-8, each
   The bitmap index creation process documented in Algorithm 1            unique key is stored in a list and a roaring bitmap is created to
takes place concurrently with the process of writing datasets             identify the position of the keys in the unique list. Finally, the
into a Hive table as Orc files. When the dataset is being written         roaring bitmap identifier of keys is added to the list lstR and the
into a Hive table, several mappers are run and each mapper                row number of the column containing the keys is added to the
processes a shard of the dataset, the shard size ranges between           roaring bitmap rnR. The default value of 10,000 is used for rows
50 and 1000 MB under default MapReduce and Hive settings.                 per rowgroups (rprg) and the maximum rowgroups per stripe
Our algorithm takes as input a shard being processed by the               (mrgps) is determined at the end of the default writing process
        rownr          tweet             rownr      tweet                                            0 twt0 #tag1                           12        twt8
                                                                                                   rg0                                     rg4
          0          twt0 #tag1            11       twt11                                            1    twt1                              13        twt9
          1             twt1               12       twt12                                            2    twt2                              14       twt10
                                                                                           str0 rg1                                str2 rg5
          2             twt2               13       twt13                                            3    twt3                              15       twt11
          3             twt3               14       twt14                                            4                                      16       twt12
                                                                                               grg0                                     rg6
          4             twt4               15       twt15                                            5                                      17       twt13
                                                                                      blk0                                    blk1
          5             twt5               16  twt16 #tag1#tag2                                      6    twt4                              18       twt14
                                                                                                rg2                                     rg7
          6             twt6               17       twt17                                            7    twt5                              19       twt15
          7          twt7 #tag2                                                                      8    twt6                              20 twt16 #tag1 #tag2
                                                                                           str1 rg3                                str3 rg8
          8             twt8                                                                         9 twt7 #tag2                           21       twt17
          9             twt9                                                                        10                                      22
                                                                                               grg1                                    grg2
          10           twt10                                                                        11                                      23

                                 (a) Sample dataset                                            (b) Sample dataset in Orc including ghost rowgroups
        block    0   0   0   0   0   0   0   0   0   0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
        stripe   0   0   0   0   0   0   1   1   1   1 1 1 2 2 2 2 2 2 3 3 3 3 3 3                         Key                                Value
                                                                                                                              WorkerNode­OrcFilename                        ..
      rowgroup   0   0   1   1   2   2   0   0   1   1 2 2 0 0 1 1 2 2 0 0 1 1 2 2
                                                                                                           mrgps                            3                               ..
        rownr    0   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
                                                                                                           #tag1 Roaring(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0)   ..
        #tag1    1   0   0   0   0   0   0   0   0   0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
                                                                                                           #tag2 Roaring(0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0)   ..
        #tag2    0   0   0   0   0   0   0   0   0   1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
                                                                                                             (d) Key and bitmaps with compression algorithm
                 (c) Bitmap representation including ghost rowgroups
                                                                    Figure 3: Orc Index creation process.

of Orc. After all the columns of the shard have been written,                                        Algorithm 2: Bitmap Index Processing
ghost rowgroups are added in addGhostRowgroups to ensure that                                         input : ast search predicates in abstract syntax tree, mrgps
the number of rowgroups per stripe is consistent. In lines 2-4,                                                maximum rowgroups per stripe, rprg rows per
the row numbers from the roaring bitmap rnR are moved to the                                                   rowgroup, stripes list of stripes being processed;
padded roaring bitmap prnR. addPadding is used to calculate the                                       output : strrg stripes and rowgroups to read;
padding at the stripe boundaries for stripes that contain fewer
                                                                                                      Algorithm useIndex(ast, mrgps, rprg, stripes)
rowgroups than mrgps. Next, in createBitmapIndex, all the keys
                                                                                                    1    rbm = ProcessPredAST(ast, stripes.start, stripes.end);
and the row numbers found are converted to key-bitmap pairs
                                                                                                    2    foreach setBit in resultBM do
and stored in the list kBms, and finally returned by createIndex.
                                                                                                    3        rownr = getRowNr(setBit);
    Filling a Hive table with data ends with Orc files being created
for the shards processed by each mapper. Next, the keys identified                                  4        strrg.add(getStripe (rownr), getRg (rownr));
by the user-defined function and the bitmaps associated with                                        5     return strrg;
each key are stored in a HBase table. The index keys are used as
HBase keys and the bitmap for the keys are stored as values using
column identifiers. Using a column identifier helps identify which
                                                                                                   from the underlying Orc files. HiveQL queries run by users on
worker node the Orc file resides and what Orc file the bitmap is
                                                                                                   Hive are translated to MapReduce jobs [4] and each mapper will
associated with. Its use allows bitmaps for keys from other Orc
                                                                                                   process a split of the Orc files. To improve performance, Hive
files to be stored in the same HBase table. The maximum number
                                                                                                   employs a hybrid splitting strategy to process Orc files. If the
of rowgroups per stripe (mrgps) for an Orc file is also stored in
                                                                                                   number of Orc files are less than the expected number of mappers,
HBase using column identifiers.
                                                                                                   to improve parallelism, Orc file footers are read to provide each
    Figure 3 shows an example of the bitmap index creation pro-
                                                                                                   mapper a split of the stripes to process. However, if the average
cess. In Figure 3a, the sample input dataset consists of two columns
                                                                                                   file size is less than the default HDFS block size, each Orc file
(rownr, tweet) and 18 tuples. Figure 3b, shows the dataset stored
                                                                                                   will be treated as a split and a mapper will receive all its stripes.
as an Orc file in HDFS as two HDFS blocks {blk0, blk1}, contain-
                                                                                                   Hive calls the former strategy ETL, while the latter is called BI.
ing four stripes {str0, str1, str2, str3} and a total of 9 rowgroups.
                                                                                                   As our indexing framework creates bitmaps at the file level, for
In order to create a file level bitmap index, the maximum num-
                                                                                                   the BI strategy, each mapper is processing the entire Orc file
ber of rowgroups per stripe across the Orc file is determined
                                                                                                   and bitmaps also cover the entire Orc file. However, for the ETL
by reading the Orc file footer. For the sample dataset there are
                                                                                                   strategy mappers are processing only a portion of the stripe from
4 stripes and stripe str2 with its 3 rowgroups has the maximum
                                                                                                   Orc files. Therefore, parts of the bitmap covering non-relevant
number of rowgroups. Therefore, ghost rowgroups are added to
                                                                                                   stripes need to be removed.
stripes str0, str1 and str3 to make them consistent with stripe str2.
                                                                                                       The bitmap index usage is described in Algorithm 2. As in-
In Figure 3c, we see the bitmap representation for the hashtags
                                                                                                   put the algorithm takes the predicates to search for in the form
#tag1 and #tag2 where the shaded portions represent the ghost
                                                                                                   of an abstract syntax tree (AST), the maximum rowgroups per
rowgroups. Finally, the keys, their bitmaps , the column identifier,
                                                                                                   stripe, rows per rowgroup and information about the stripes
and the maximum rowgroups per stripe is stored in an HBase
                                                                                                   being processed by the mapper. In line 1, the search predicate
table as shown in Figure 3d.
                                                                                                   AST is traversed using ProcessPredAST, the procedure executes
                                                                                                   recursively pulling bitmaps for each predicate using the column
4.3    Bitmap Index Processing                                                                     identifier from HBase. Next, the bitmaps are sliced using the
Based on the queries submitted to Hive, it can use our indexing                                    stripe start/end information if Hive is processing the query under
framework to retrieve and process bitmap indices stored in a key-                                  ETL mode, under BI mode no slicing occurs as the mapper is
value store to prevent access of irrelevant stripes and rowgroups                                  processing all stripes of an Orc file. Then, logical operators (LIKE,
           Query                                                                                                      Task
                                                                                        DataNode1
         SELECT tweet FROM Tweets                                                                                          Predicate:
         WHERE tweet LIKE "%#tag1%"        WorkerNode1             Executor             0001.orc                                                                                Stripe0     Stripe1
                                                                                                                      1     tweet LIKE "%#tag1%" OR
         OR tweet LIKE "%#tag2%"                                                                                             tweet LIKE "%#tag2%"
                                                                      Task                 Stripe0     Stripe1
                                                 Node
                                                                                            ...
                                                Manager
                       Client                                         Task                                            2      rb1 = RoaringBitmap(Stripe0,Stripe1,..,StripeN))
                                                                                           StripeN
                                                                                                                             rb2 = RoaringBitmap(Stripe0,Stripe1,..,StripeN))
                                                             ...                        DataNodeN
                      Resource
                      Manager              WorkerNodeN             Executor             000N.orc
                                                                                                                             rb1 = RoaringBitmap(Stripe0,Stripe1)
                                                                                           Stripe0     Stripe1        3      rb2 = RoaringBitmap(Stripe0,Stripe1) slice(Stripe0, Stripe1)
                                                                       Task
                                                  Node                                                                       resultRb = rb1 OR rb2
                                                 Manager                                    ...
                                                                       Task                StripeN
                     HBase                                                                                                      Determine stripes and rowgroups to read
                                                                                                                       4
                     table:Tweets­tweet
                      Key                                           Value
                            WorkerNode1_0001                          .. WorkerNodeN_000N
                     mrgps val                                        .. val                                                                                        read(stripes, rowgroups)
                      #tag1 RoaringBitmap(Stripe0,Stripe1,..,StripeN) .. RoaringBitmap(Stripe0,Stripe1,..,StripeN)
                      #tag2 RoaringBitmap(Stripe0,Stripe1,..,StripeN) .. RoaringBitmap(Stripe0,Stripe1,..,StripeN)


                                                                              Figure 4: Orc Index Processing.


OR, AND or XOR) between predicates are applied to the bitmaps                                            Orc file writing process to create and store indices for predefined
to retrieve the result bitmap rbm. Finally, in lines 2-4, for all set                                    fields of the table. Similarly, during query execution, if bitmap in-
bits in the result bitmap, the row numbers representing the set                                          dexing has been enabled, the bitmap index processing component
bits are used in getStripe and getRg to determine the stripes and                                        of our framework discussed in 4.3 runs for the pushed predicates
rowgroups to read and stored in the list strrg and returned in                                           on the Hive table and the stripes and rowgroups to read from
line 5.                                                                                                  each Orc file is determined. To use the indexing functionality in
    Figure 4 provides an example of how the indexing processing                                          a new Hive installation, the default Orc reader and writer pack-
framework enables Hive to skip irrelevant blocks during query                                            ages in Hive will have to be replaced with ones containing our
execution if bitmap indexing is enabled. The example shows                                               indexing functionality. As Hive allows users to write and upload
a cluster containing Hadoop, Hive and HBase with N worker                                                their custom built Jar files, the bitmap indexing framework can
nodes. Tasks are executed by the worker nodes in the cluster to                                          be uploaded alongside the Hive Jar to its list of working Jar files,
answer queries. Hive contains a table Tweet stored into N Orc                                            and the Orc reader and writer can reference it.
files and the HBase table (Tweets-tweet) contains the indexed                                                The interface of our bitmap indexing framework is provided in
keys, their bitmaps and the maximum stripes per rowgroups for                                            Listing 1. Users can use our implementation or plugin their spe-
each Orc file in the worker nodes. A select query is executed on                                         cific implementation. In line 3, the function findKeys is used by the
the tweet table in Hive with two predicates. Each task processing                                        framework to find indexable keys in non-numeric columns. Our
the query will receive and pass the predicates to the underlying                                         implementation returns hashtags as keys. In line 5, the boolean
storage structure. Hive is running using its ETL strategy and each                                       function isProcessable is used to determine if a predicate is pro-
task will process only a portion of the stripes. One of the tasks is                                     cessable by the framework or not. In case the predicate cannot be
running in the executor of WorkerNode1 and is processing stripe0                                         processed, Hive's default query processing is run. In line 7, the
and stripe1 of the Orc file 0001.orc from DataNode1. The task uses                                       function createBitmap is used by users to decide which bitmap
the keys (#tag1 and #tag2) from the predicates and the column                                            compression algorithm to use to index their data and also imple-
identifier WorkerNode1_0001 to retrieve the bitmap values for                                            ments Algorithm 1. Our implementation uses Roaring bitmaps
the keys from HBase. As the task is not processing all the stripes                                       for compression. Finally, in lines 9 and 11, functions storeKey-
within an Orc file, the bitmaps retrieved from the key-value store                                       Bitmap and getKeyBitmap are used to store the key-bitmap pairs
are sliced to cover only the stripes being processed. Next, any                                          into a key-value store and to return a byte value of a bitmap for a
logical operation between the keys are applied to their bitmaps                                          particular key. Our implementation uses Roaring for compression
to retrieve the result bitmap. Finally, the stripes and rowgroups                                        and HBase as the key-value store. To replace the default imple-
to read are determined by applying equation (1) to the set bits of                                       mentation, users need to override our implementation, rebuild
the result bitmap, allowing the skipping of irrelevant stripes and                                       the indexing framework and deploy the Jar file to Hive's working
rowgroups.                                                                                               list of Jar files.

4.4     Hive Integration                                                                                             Listing 1: Interface for Indexing framework
The indexing framework is made publicly1 available for use                                                 1 public interface IBitmapIndexingFramework {
                                                                                                           2     /* find indexable keys in column fields */
under the Apache License 2.02 . The index creation and pro-                                                3     String [] findKeys ( String column );
                                                                                                                 /* determine if search predicate is usable by framework */
cessing functionality are integrated into the W riterImpl and                                              4
                                                                                                           5     boolean isProcessable ( String ast );
RecordReaderImpl classes of Hive's Orc writer and reader com-                                              6     /* create bitmap index from rownumber and column */
                                                                                                           7     boolean createBitmap ( int rowNr , String column );
ponents. During data insertion to a Hive table that uses Orc file                                          8     /* store all key - bitmap pairs in key - value store */
format for storage, Orc files are created across the cluster storing                                       9     boolean storeKeyBitmap ( String [] args );
                                                                                                          10     /* get bitmap index for a single key */
table data. If bitmap indexing has been enabled, the bitmap index                                         11     byte [] getKeyBitmap ( String [] args );
creation process of our framework discussed in 4.2 hooks into the                                         12 }


1 https://github.com/lawansubba/lbif                                                                       Listing 2 shows how users can use the Hive console to use
2 https://www.apache.org/licenses/LICENSE-2.0                                                            our indexing framework. The first statement in line 2 creates a
Hive table with two columns with Orc as the underlying storage                together successively in queries to determine execution times
structure. In lines 3-6, a flag is set enabling bitmap indexing, the          for OR-LIKE queries. Lastly, hashtags are discovered and used
Hive table with the column to index is declared, and what bitmap              in self JOIN queries. There are not enough common hashtags
indexing implementation of Listing 1 to use is declared. Finally, an          between tweets to perform AND operations and test for the
insert statement like line 6 will fill the Orc based table, while our         threshold. Therefore, AND operations have been excluded from
indexing framework uses the set bitmap indexing implementation                the experiments. Each query was run a total of five times, and the
to find keys and creates <key, bitmap> pairs, which are stored                median value was taken as the execution time. Hive runs queries
in the predetermined key-value store. How the bitmap indices                  on all datasets using ETL strategy. Note that the experiments
can be used is shown in lines 8-11. Lines 8-10 enable predicate               show execution times/stripes and rowgroups accessed by the
push down, the use of indices based filtering and bitmap indexing             (LIKE, OR-LIKE and JOIN queries) and the number of matching
functionality. Lastly, a select query like in line 11 will use the            tuples accessed before a group by operation is performed. The
search key #tag1 in Algorithm 2 to return only relevant results.              three types of queries used in our experiments are shown below.
These settings can also be defined in the configuration file of
Hive so that users don’t have to specify them every time.                       LIKE: SELECT tweetSource, COUNT(*) as Cnt
                                                                                FROM TableName
       Listing 2: HiveQL for Bitmap Index creation/use                          WHERE tweet LIKE '%hashtag1%'
 1   /* bitmap index creation */                                                GROUP BY tweetSource;
 2   CREATE TABLE tblOrc ( id INT , tweet VARCHAR ) STORED AS ORC ;
 3   SET hive . optimize . bitmapindex = true ;
 4   SET hive . optimize . bitmapindex . format = tblOrc / tweet /;
 5   SET hive . optimize . bitmapindex . framework = ' com . BIFramework ';
                                                                                OR-LIKE: SELECT tweetSource, COUNT(*) as Cnt
 6   INSERT INTO tblOrc SELECT id , tweet FROM tblCSV ;                         FROM TableName
 7   /* bitmap index usage */
 8   SET hive . optimize . ppd = true ;                                         WHERE (tweet LIKE '%hashtag1%'
 9   SET hive . optimize . index . filter = true ;                              OR tweet LIKE '%hashtag2%',...)
10   SET hive . optimize . bitmapindex = true ;
11   SELECT * FROM tblOrc WHERE tweet LIKE ' %# tag % ';                        GROUP BY tweetSource;

                                                                                JOIN: SELECT t1.tweetSource, COUNT(*) as Cnt
5     EVALUATION                                                                FROM TableName AS t1 JOIN TableName AS t2
The indexing framework is integrated into the Orc reader and                    ON (t1.tweetNr = t2.reTweetNr)
writer components of Hive 2.2.0 and then installed in a fully                   WHERE t1.tweetNr != -1
distributed cluster on Microsoft Azure with one node acting as                  AND (t1.tweet LIKE '%hashtag1%')
master and seven nodes as slaves. All nodes are in the East US                  AND (t2.tweet LIKE '%hashtag1%')
region and use Ubuntu OS with 4 VCPUS, 8 GB memory, 192 GB                      GROUP BY t1.tweetSource;
SSD. HDFS 2.7.4 is used for distributed file system and HBase
1.3.1 for persistent storage of key-value stores. Details about the               Figure 5a shows the execution times for the LIKE and OR-LIKE
datasets used for our experiments are provided in Table 1. All                queries for the largest dataset Tweets220 using both the default
three datasets contain tweets collected from the Twitter API for              mode and bitmap indices. The results for the other two datasets
different months in 2013. The schema for the dataset contains 13              (Tweets55 and Tweets110) show similar results and do not add
attributes [tweetYear, tweetNr, userIdNr, username, userId, lati-             new information and are not included here. Under default mode,
tude, longitude, tweetSource, reTweetUserIdNr, reTweetUserId,                 all stripes and rowgroups of the dataset are read and processed.
reTweetNr, tweetTimeStamp, tweet]. The sizes of datasets are                  In case of LIKE queries, a single like comparison is done on the
55, 110 and 220 GB respectively and the size of the datasets de-              tweet column and the tweetSource is used in the group by only
termine the number of tuples, the total number of hashtags and                if the tweet contains the hashtag. Therefore, the execution times
the total number of unique hashtags found in each dataset. The                remains nearly constant for LIKE queries accessing between 1
number of Orc files the dataset is stored into is determined by the           and 572,725 tuples. In contrast, for OR-LIKE queries an increasing
MapReduce configurations like number or mappers, the number                   amount of LIKE operations are performed on the tweet, and then
of cores available for processing and amount of RAM available                 an OR operation is performed between the results. Therefore,
for each mapper. The three datasets are stored as 66, 128 and                 as more LIKE conditions are added in the OR-LIKE query, the
224 separate Orc files respectively across the cluster, each file             execution time for OR-LIKE queries increases. Compared to the
containing a different number of stripes and rowgroups.                       default mode, we observe that if the queries are highly selective,
   If a query returns a significant portion of the dataset, at a              our indexing framework can accelerate execution times for both
certain threshold, the indexed scan will be just as or more time              LIKE and OR-LIKE queries. However, as more tuples are accessed,
consuming than a full scan of the dataset. Therefore, for any in-             more stripes and rowgroups are accessed from the Orc files, and
dexing system, it is important to investigate when this threshold             as a result there is an increase in execution time.
is reached. The three datasets are stored in Hive as Orc based                    Figure 5b and Figure 5c show the percentage of stripes and
tables, and their indices are stored in HBase. In order to investi-           rowgroups accessed by the LIKE and OR-LIKE queries when
gate the threshold, indices for each dataset are analyzed to find             bitmap indices are used. We can summarize that response times
hashtags for queries that access tuples in a geometric sequence               when bitmap indices are used are influenced more by the number
(1,2,4,8,..) until the maximum sequence number is found. If a                 of rowgroups accessed than the number of stripes accessed. A
hashtag does not exist that accesses tuples for a sequence num-               significant portion of the queries read nearly all the stripes, but
ber, the hashtag accessing the closest higher sequence number                 only a few queries read almost all the rowgroups and the execu-
is used. The discovered hashtags are used in LIKE queries to                  tion time for those queries are nearly equal to the execution time
record execution times from Hive tables under the default mode                in default mode. A similar pattern is observed in Figure 5c for
and using bitmap indices. Next, the very same hashtags are OR'd               OR-LIKE queries when bitmap indices are used. The last three
                                                                                                   Table 1: Dataset details

                                                  Dataset             Tuples           Total HashTags           Unique Hastags                Orc Files      Stripes        Rowgroups
                                                 Tweets55           192,665,259              32,534,370               5,363,727                  66            285             19,360
                                                 Tweets110          381,478,160              62,281,496              9,063,962                  128            624             38,351
                                                 Tweets220          765,196,395             126,603,736              16,149,621                 224           1342             76,918

                               375
                               350
                               325
    Execution Time (Seconds)


                               300                   OR-LIKE.Default          OR-LIKE.Bitmap
                               275
                               250
                               225                   LIKE.Default             LIKE.Bitmap
                               200
                               175
                               150
                               125
                               100
                                75
                                50
                                25
                                 0
                                                         15
                                                         16
                                                                    31
                                                                         32
                                                                              63
                                                                                         64


                                                                                      10 3
                                                                                      20 4
                                                                                      20 7
                                                                                      40 8
                                                                                      40 4
                                                                                      81 5
                                                                                      81 9
                                                                                     16 97


                                                                                    15 516
                                                                                    16 929
                                                                                    26 085
                                                                                    31 706


                                                                                    56 396
                                                                                    57 484
                                                                                    87 725
                                                                                     38 8
                                                                                    13 90


                                                                                    41 65
                                     1
                                         2
                                             3
                                                 4
                                                     7
                                                          8


                                                                                          7
                                                                                          8
                                                                                          5
                                                                                          6


                                                                                          2


                                                                                     16 86
                                                                                     32 44
                                                                                     33 26
                                                                                     65 13
                                                                                     65 19
                                                                                    13 39
                                                                                          1


                                                                                          8
                                                                                         2
                                                                                         2
                                                                                         4
                                                                                         4
                                                                                         9
                                                                                         9
                                                                                         8


                                                                                   11 344
                                                                                       12
                                                                                       12
                                                                                       25
                                                                                       25
                                                                                       51
                                                                                       51


                                                                                       61
                                                                                      10


                                                                                      11


                                                                                      19
                                                                                       3
                                                                                       4
                                                                                       8
                                                                                       0
                                                                                       8
                                                                                       8

                                                                                      2
                                                                                      5
                                                                                      5
                                                                                      3


                                                                                      6
                                                                                      1
                                                                                      2
                                                                                                                Tuples Accessed

                                                                                   (a) Execution times for LIKE and OR-LIKE queries
                               100

                                                  Stripes
                                75
    Data access (%)


                                                  Rowgroups

                                50


                                25


                                0
                                                                     16


                                                                              32


                                                                                     64


                                                                                                                 24


                                                                                                                        48


                                                                                                                                  94


                                                                                                                                         97


                                                                                                                                                                        16


                                                                                                                                                                                29


                                                                                                                                                                                        85


                                                                                                                                                                                                      25
                                                                                                                                                                                                65
                                     1


                                             2


                                                      4


                                                              8


                                                                                             8


                                                                                                    6


                                                                                                           2


                                                                                                                                                  4


                                                                                                                                                          3


                                                                                                                                                                9
                                                                                            12


                                                                                                   25


                                                                                                         51


                                                                                                                                                 44


                                                                                                                                                       01


                                                                                                                                                               81
                                                                                                                10


                                                                                                                       20


                                                                                                                              40


                                                                                                                                       81


                                                                                                                                                                       25


                                                                                                                                                                               59


                                                                                                                                                                                      50


                                                                                                                                                                                                     27
                                                                                                                                                                                               19
                                                                                                                                               16


                                                                                                                                                      33


                                                                                                                                                              65

                                                                                                                                                                     13


                                                                                                                                                                              15


                                                                                                                                                                                     16


                                                                                                                                                                                             31


                                                                                                                                                                                                     57
                                                                                                                Tuples Accessed
                                                                                    (b) Stripes/Rowgroups accessed by LIKE queries
                               100
                                                  Stripes
                                75
    Data access (%)


                                                  Rowgroups
                                50

                                25

                                0
                                                            15


                                                                     31


                                                                              63


                                                                                                           3


                                                                                                                  7


                                                                                                                         5


                                                                                                                                  9


                                                                                                                                                                          6

                                                                                                                                                                                96


                                                                                                                                                                                        84


                                                                                                                                                                                                        8
                                                                                                                                                               90
                                     1


                                             3


                                                      7


                                                                                       7


                                                                                               5


                                                                                                                                         86


                                                                                                                                                 26


                                                                                                                                                               39
                                                                                                     1


                                                                                                                                                                                                        8
                                                                                                            2


                                                                                                                   4


                                                                                                                          9


                                                                                                                                 8


                                                                                                                                                                       70


                                                                                                                                                                                                     44
                                                                                    12


                                                                                            25


                                                                                                   51


                                                                                                                                                                                                     61
                                                                                                         10


                                                                                                                20


                                                                                                                       40


                                                                                                                              81


                                                                                                                                                                                63


                                                                                                                                                                                       14
                                                                                                                                                             11
                                                                                                                                          3


                                                                                                                                                  8


                                                                                                                                                         8


                                                                                                                                                                        3


                                                                                                                                                                                                3
                                                                                                                                       16


                                                                                                                                               32


                                                                                                                                                      65


                                                                                                                                                                                                  38
                                                                                                                                                          13


                                                                                                                                                                     26


                                                                                                                                                                              41


                                                                                                                                                                                     56


                                                                                                                                                                                             87

                                                                                                                                                                                               11


                                                                                                                Tuples Accessed
                                                                                   (c) Stripes/Rowgroups accessed by OR-LIKE queries
                               Figure 5: Query execution times and stripes/rowgroups accessed by LIKE and OR-LIKE queries on Tweets220.


queries access almost all the stripes and rowgroups of the dataset,                                                     the hashtag used in the former query is much more common than
and the execution time exceeds the default implementation. In                                                           the latter one and exists throughout the Tweets220 dataset, but
such cases, a full scan is preferable to an indexed scan.                                                               only 132 tuples exist that satisfy the join condition (t1.tweetNr =
   The results of the last experiment involving self JOIN queries                                                       t2.reTweetNr). This explains the sudden spike in execution time
using both the default mode and bitmap indices are shown in                                                             in Figure 6a for the query that accesses 132 tuples. However, even
Figure 6a. Similar to our previous findings, we find that query                                                         in this case the execution time using bitmap indices is better than
execution times can be greatly reduced for highly selective JOIN                                                        the default mode as the indexed solution is able to skip some
queries by using bitmap indices. The amounts of data involved in                                                        irrelevant rowgroups.
the JOIN operation from the left table and right table is greatly                                                          The size of the three datasets in CSV format, their sizes when
reduced and thus the improvement in execution times. As the                                                             stored as Orc based tables, the size of their indices when stored
queries involve self JOINs, Figure 6b shows the percentage of                                                           in HBase and the size of Roaring bitmap indices are shown in
stripes and rowgroups accessed in either side of the left and right                                                     Figure 7a. Compared to the CSV format, the Orc formats using
table of the join. An interesting observation is that the query                                                         their encoders for the different column types can significantly
accessing 132 tuples is accessing significantly more stripes and                                                        reduce the storage footprint of each dataset by more than half.
rowgroups than the query accessing 322 tuples. The reason is that                                                       The sizes of the Roaring bitmap indices and the HBase tables
                                                200                                                                                                          100
                                                175                 JOIN.Default               JOIN.Bitmap


                     Execution Time (Seconds)
                                                                                                                                                                             Stripes
                                                150                                                                                                           75
                                                                                                                                                                             Rowgroups


                                                                                                                                     Data access (%)
                                                125
                                                100                                                                                                           50
                                                 75
                                                 50                                                                                                           25
                                                 25
                                                  0                                                                                                               0

                                                                                      16

                                                                                               32

                                                                                                      66
                                                       1

                                                               2

                                                                      4

                                                                                8


                                                                                                             2

                                                                                                                      2

                                                                                                                            8


                                                                                                                                                                                              16

                                                                                                                                                                                                    32

                                                                                                                                                                                                            66
                                                                                                                                                                      1

                                                                                                                                                                            2

                                                                                                                                                                                  4

                                                                                                                                                                                         8


                                                                                                                                                                                                                  2

                                                                                                                                                                                                                         2

                                                                                                                                                                                                                              8
                                                                                                           13

                                                                                                                  32

                                                                                                                          68


                                                                                                                                                                                                                 13

                                                                                                                                                                                                                         32

                                                                                                                                                                                                                              68
                                                                                Tuples Accessed                                                                                          Tuples Accessed
                                                      (a) Execution times for JOIN queries                                            (b) Stripes/Rowgroups accessed by JOIN queries

                 Figure 6: Query execution times and stripes/rowgroups accessed by JOIN queries on Tweets220.

                                                250                                                                                                          40
                                                225                                                        220
                                                                                                                                                                                Orc Default                               35.83
                                                              Dataset in CSV                                                                                                    Orc + Bitmap Index
                                                200           Dataset in Orc


                                                                                                                                  Execution Time (Minutes)
                                                              Hbase Index                                                                                    30
                                                175           Roaring Bitmap Index
                   Size (GigaBytes)


                                                150                                                                                                                                                 23.58

                                                125                                                                                                          20                  18.44
                                                                                    110
                                                100
                                                 75                                                              70                                                                                              10.64
                                                         55                                                                                                  10
                                                 50                                       36                                                                                                 5.18
                                                 25            17                                                                                                     3.22
                                                                     1.3 0.29                  2.36 0.56              4.34 1.15
                                                 0                                                                                                            0
                                                              Tweets55                Tweets110              Tweets220                                                    Tweets55           Tweets110            Tweets220
                                                                                       Datasets                                                                                               Datasets
                                                 (a) Tweets datasets and their Index sizes                                                                   (b) Index creation time for Tweets datasets
                                                                    Figure 7: Tweets datasets their index sizes and index creation times.


where they are stored are a fraction of size of the datasets in the                                                                 [2] Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable
CSV format and Orc format. However, the index creation process                                                                          errors. Commun. ACM 13, 7 (1970), 422–426.
                                                                                                                                    [3] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better
comes with an initial index building cost as shown in Figure 7b.                                                                        bitmap performance with roaring bitmaps. Software: practice and experience
Compared to the default table creation process which stores the                                                                         46, 5 (2016), 709–719.
                                                                                                                                    [4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data pro-
datasets as Orc files, our indexing framework scans the datasets                                                                        cessing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
for hashtags, creates bitmap indices for each Orc file and stores                                                                   [5] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula-
them in HBase resulting in a 4 to 6 times more expensive table                                                                          pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
                                                                                                                                        Vosshall, and Werner Vogels. 2007. Dynamo: amazon’s highly available key-
creation process.                                                                                                                       value store. In ACM SIGOPS operating systems review, Vol. 41. ACM, 205–220.
                                                                                                                                    [6] François Deliège and Torben Bach Pedersen. 2010. Position list word aligned
                                                                                                                                        hybrid: optimizing space and performance for compressed bitmaps. In Pro-
6    CONCLUSION                                                                                                                         ceedings of the 13th international conference on Extending Database Technology.
In this paper, a lightweight, flexible and open source bitmap                                                                           ACM, 228–239.
                                                                                                                                    [7] Hakan Factor. 2018. Oracles Hakan Factor. (2018). Retrieved 2018-10-20
indexing framework is proposed to efficiently index and search                                                                          from https://www.databasejournal.com/features/oracle/oracles-hakan-factor.
for keys in big data. The framework provides a function to search                                                                       html
for hashtags, uses Roaring bitmap for bitmap compression and                                                                        [8] Apache Hadoop. 2018. Welcome to Apache Hadoop! (2018). Retrieved 2018-
                                                                                                                                        10-20 from http://hadoop.apache.org/
HBase for storing key-values. However, all three components                                                                         [9] Apache HBase. 2018. Apache HBase Home. (2018). Retrieved 2018-10-20 from
can be easily swapped with other alternatives. The indexing                                                                             https://hbase.apache.org/
                                                                                                                                   [10] Apache Hive. 2018. Apace Hive TM. (2018). Retrieved 2018-10-20 from
framework was integrated into Hive and tested on a Hadoop,                                                                              https://hive.apache.org/
Hive and HBase cluster. Experiments on the cluster using three                                                                     [11] Riak KV. 2018. Redis KV Home. (2018). Retrieved 2018-10-20 from http:
datasets of different sizes containing tweets demonstrates that                                                                         //basho.com/products/riak-kv/
                                                                                                                                   [12] Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently
the execution times can be significantly accelerated for queries                                                                        faster and smaller compressed bitmaps with roaring. Software: Practice and
of high selectivity.                                                                                                                    Experience 46, 11 (2016), 1547–1569.
                                                                                                                                   [13] Peng Lu, Sai Wu, Lidan Shou, and Kian-Lee Tan. 2013. An efficient and compact
                                                                                                                                        indexing scheme for large-scale data store. In Data Engineering (ICDE), 2013
ACKNOWLEDGEMENTS                                                                                                                        IEEE 29th International Conference on. IEEE, 326–337.
                                                                                                                                   [14] Memcached. 2018. memcached - a distributed memory object caching system.
This research has been funded by the European Commission                                                                                (2018). Retrieved 2018-10-20 from http://www.memcached.org/
through the Erasmus Mundus Joint Doctorate "Information Tech-                                                                      [15] Apache ORC. 2018. Apache ORC, High-Performance Columnar Storage for
nologies for Business Intelligence Doctoral College" (IT4BI-DC)                                                                         Hadoop. (2018). Retrieved 2018-10-18 from https://orc.apache.org/
                                                                                                                                   [16] Apache Parquet. 2018. Apache Parquet Home. (2018). Retrieved 2018-10-20
and Aalborg University. All experiments were performed on Mi-                                                                           from https://parquet.apache.org/
crosoft Azure using a sponsorship granted by Microsoft.                                                                            [17] Pilosa. 2018. Pilosa Home. (2018). Retrieved 2018-10-20 from https://www.
                                                                                                                                        pilosa.com/
                                                                                                                                   [18] Redis. 2018. Redis Home. (2018). Retrieved 2018-10-20 from https://redis.io/
REFERENCES                                                                                                                         [19] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010.
                                                                                                                                        The hadoop distributed file system. In Mass storage systems and technologies
 [1] Apache Avro. 2018. Apache Parquet Home. (2018). Retrieved 2018-10-20 from
                                                                                                                                        (MSST), 2010 IEEE 26th symposium on. Ieee, 1–10.
     https://avro.apache.org/
[20] Lefteris Sidirourgos and Martin Kersten. 2013. Column imprints: a secondary
     index structure. In Proceedings of the 2013 ACM SIGMOD International Confer-
     ence on Management of Data. ACM, 893–904.
[21] Kurt Stockinger and Kesheng Wu. 2007. Bitmap indices for data warehouses.
     In Data Warehouses and OLAP: Concepts, Architectures and Solutions. IGI Global,
     157–178.
[22] Kesheng Wu, Ekow J Otoo, and Arie Shoshani. 2006. Optimizing bitmap
     indices with efficient compression. ACM Transactions on Database Systems
     (TODS) 31, 1 (2006), 1–38.
[23] Yarn. 2018. Apache Hadoop YARN. (2018). Retrieved 2018-10-20
     from https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/
     YARN.html

</pre>