Performance Evaluation of Analytical Queries on a
              Stand-alone and Sharded Document Store
                                           Aarthi Raghavendra and Karen C. Davis
                                    Electrical Engineering and Computing Systems Department
                                                       University of Cincinnati
                                                  Cincinnati, OH USA 45221-0030
                                                          1-513-556-2214
                                        raghavai@mail.uc.edu, karen.davis@uc.edu

ABSTRACT                                                                 features such as aggregation, secondary indexing, sharding and
Numerous organizations perform data analytics using relational           replication. Parker et al. [8] compare the runtime performance of
databases by executing data mining queries. These queries include        MongoDB with SQL Server for a modest-sized database (3 tables,
complex joins and aggregate functions. However, due to an                at most 12,416 tuples total) and conclude that the former performs
explosion of data in terms of volume, variety, veracity, and velocity    equally well or better than SQL Server except when aggregation is
known as Big Data [1], many organizations such as Foursquare,            utilized. However, the impact of data modeling and deployment
Adobe, and Bosch have migrated to NoSQL databases [2] such as            environments for aggregation operations were not explored in
MongoDB [3] and Cassandra [4]. We investigate the performance            detail.
impact of analytical queries on a NoSQL document store. We               In this paper, we investigate the performance of complex data
benchmark the performance of MongoDB [3], a cross-platform               mining queries against datasets of different sizes. We use a stand-
document-oriented database, in a stand-alone environment and a           alone and distributed data organization known as sharding [9]. In a
sharded environment. The TPC-DS benchmark [5] is used to                 sharded database, data is split into chunks and distributed across the
generate data of different scales and selected data mining queries       cluster nodes. A query run against such a system can target either
are executed in both the environments. Our experimental results          one, a few, or all the nodes and the result from each of the nodes is
show that along with choosing the environment, data modeling in          aggregated and displayed to the user.
MongoDB also has a significant impact on query execution times.
Analytical query performance is best when data is stored in a            In Section 2, we outline features of MongoDB such as data
denormalized fashion. When the data is sharded, due to multiple          modeling, indexing, sharding, and aggregation. In Section 3, we
query predicates in an analytical query, aggregating data from a few     discuss the TPC-DS benchmarking standard that is used to generate
or all nodes proves to be an expensive process and hence performs        datasets of varying sizes as well as the criteria used to select data
poorly when compared to the alternative process of executing the         mining queries for our study. In Section 4, we describe the
same in a stand-alone environment.                                       hardware and software configurations of the systems used to
                                                                         conduct the experiments. In Section 5, algorithms for migrating
CCS Concepts                                                             relational data and translating SQL queries to MongoDB are
• Information systems➝Database management system                         presented. Section 6 outlines the experimental procedures
engines • Database query processing➝query optimization.                  implemented on the stand-alone and sharded environments and
                                                                         discusses our findings. We conclude with a synopsis of the
Keywords                                                                 contributions and future work in Section 7.
document databases; MongoDB; sharded query performance;
analytical queries.                                                      2. MongoDB/BIG DATA BENCHMARKING
                                                                         We give an overview of MongoDB concepts and terms, followed
1. INTRODUCTION                                                          by a discussion of relevant benchmarking efforts to date.
Relational database systems have been the foundation for enterprise
data management for over 30 years. Many organizations use a              2.1 MongoDB
relational platform to perform data analysis by running data mining      MongoDB is a cross-platform document-oriented database
queries against a database. With an estimated growth in enterprise       classified under the aegis of the NoSQL databases. It is coined from
data to 35ZB by 2020 [6] along with growing user loads,                  the term huMONGOus for its support of big data management. Key
organizations are adopting newer technologies such as NoSQL              features of MongoDB include high performance, high availability,
databases to store data. Among the types of NoSQL databases [7]          and automatic scaling [10]. It is schema-less or has a flexible
(key-value store, column-oriented, document store, and graph), we        schema. Unlike SQL where the structure of the data is defined prior
have chosen MongoDB, a cross-platform document-oriented                  to loading into the database, MongoDB does not enforce any rigid
database against which we execute data mining queries. It provides       structure. The flexible structure is achieved by storing data in
                                                                         BSON format [11], which is a binary-encoded serialization of
__________________________________________________________               JSON-like [13] key and value pairs.
2017, Copyright is with the authors. Published in the Workshop DOLAP.
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017,
                                                                         A document is composed of key-value pairs, and is the basic unit
Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this     of data in MongoDB. The value of these fields can be another
paper is permitted under the terms of the Creative Commons license CC-   document, array, and array of documents. A group of documents is
by-nc-nd 4.0.                                                            called a collection. Since documents do not dictate a specific
format, a collection can have documents with each having a varying       a single database operation. For example, an application can
number of fields and types of values, thereby giving it a flexible       retrieve complete publisher information in one query and new
schema.                                                                  books published by the publisher can be added as embedded sub-
                                                                         documents to the books array. This ensures no repetition of the
An example of a document is given in Figure 1. Every document in         publisher details per book, thereby reducing redundant data.
a collection has a unique _id field that acts as the primary key.        However, if the size of the document exceeds the 16 MB limit, data
Unless explicitly specified by the application, MongoDB uses a           has to be split and stored in separate documents. In such cases, the
special 12-byte BSON type, called ObjectId, as the default value         referenced data model can be adopted.
for the _id field. The ObjectId is generated from the timestamp,
machine ID, process ID, and a process-local incremental counter
that guarantees uniqueness [10].
     {
         _id: ObjectId (“5480abb8986c9d3197f6682c”),
         customer_id: 23,
         customer_address: {
              apartment_number: 26,
              street_name: “Whitfield”,
              state: “CA”,
              country: “United States”
         }
         customer_name: “Earl Garrison”,
                                                                                         Figure 3: Referenced Data Model
         birth_date: “9/25/1979”,
         email_id: earl.garrison@G3sM4P.com                              Figure 3 illustrates the referenced data model design for the one-to-
     }                                                                   many relationship between Publisher and Book entities. References
                  Figure 1: Document Structure                           represent relationships between data by associating or referencing
                                                                         one document to another. This is achieved by storing the _id of one
2.1.1 Data Modeling                                                      document as the value for another field in the other document. The
An application that uses a relational database management system         one-to-many relationship between Publisher and Book can be
models data by declaring a table’s structure and its relationships to    modeled by keeping the publisher and book information in two
other tables prior to inserting data into it. Similarly, data modeling   separate collections and the relationship can be enforced by storing
in MongoDB focuses on the document structure and relationships           the Publisher reference inside the Book document. In doing so, a
between data. Since MongoDB does not support joins, the two              query to retrieve complete publisher information would have to
concepts that facilitate data modeling are embedding and                 make multiple requests to the server as follow-up queries are
referencing [10].                                                        necessary to resolve the references. However, it is a suitable choice
We illustrate data modeling techniques in MongoDB through an             to model entities that are independent of each other. Also, if two or
example. Consider two entities, Book and Publisher and a one-to-         more entities are related but complex, then the complexity can be
many relationship between them.                                          reduced by breaking down the data into multiple documents. Table
                                                                         1 provides a comparison between the two data models.
                                                                         Table 1: Embedded and Referenced Data Model Comparison


                                                                         For fast and efficient data access, indexes can be created to locate
                                                                         data quickly and perform random lookups.
                Figure 2: Embedded Data Model
Figure 2 illustrates the embedded data model design for the one-to-      2.1.2 Indexing
many relationship between Publisher and Book entities.                   An index is a special on-disk data structure that allows the database
Embedding represents a relationship by encapsulating related             application to retrieve data faster by not scanning the entire
information in a single document or structure. It depicts a contains     collection. It uses keys built from one or more fields in a document.
relationship between entities since related pieces of information is     MongoDB implements indexing by storing the keys in a B-Tree
stored in the same database record. The one-to-many relationship         data structure which helps in finding rows associated with the keys
between a Publisher and a Book can be modeled by embedding the           quickly and efficiently. Indexes in MongoDB are defined at the
book data entities in the publisher data. This provides good             collection level on any field or sub-field of the document.
performance for read operations as related data can be retrieved in      MongoDB supports 7 different types of indexes [10]. In our work,
we use the default_id and compound indexes. All collections are           resides on just a few shards leading to exhaustion problems. Having
indexed on the _id field by default. A compound index is created          more shards reduces the amount of data on each shard and resources
on multiple fields of a document with a specific sort order for each      such as RAM and CPU cannot be used effectively. The number of
field. If a collection has a compound index on PersonID and Age,          shards in a cluster can be calculated based on the following factors
the index sorts first by PersonID and then within each PersonID           [14].
value, sorts by Age. Therefore, the order of the fields in the
compound index should be declared appropriately based on the              1. The sum of disk storage space across shards should be greater
application needs.                                                           than the required storage size. For example, if the application
                                                                             data is 1.5TB and the disk storage available per server is 256GB,
2.1.3 Sharding                                                               then the number of shards needed can be calculated as
A database system deployed on a single server can experience                 1.5TB/256GB ~ 6 shards.
heavy loading due to high query rates or large datasets. When the         2. The sum of RAM across shards should be greater than the
demand on the server spikes drastically, alternate means should be           working set of the sharded cluster. The working set is the
identified to keep the system online and robust. This scaling issue          segment of client data that is accessed most frequently by the
is addressed in database systems by implementing vertical scaling            application. For read intensive applications, storing the entire
or horizontal scaling [15]. MongoDB implements horizontal                    working set in the RAM results in faster read operations. If the
scaling, or sharding [10]. It is the process of splitting data into          working set memory requirement is more than the available
chunks and distributing the chunks across multiple servers, known            RAM, the operating system needs to perform frequent IO
as shards. The data is partitioned at a collection level and stored on       operations to the disk to retrieve the information, thereby
multiple shards such that all shards put together make up a single           drastically slowing down the system. The working set size is the
logical database.                                                            size of the frequently accessed collections and their indexes. For
A sharded cluster in MongoDB [10] has 3 components:                          a working set of 200GB and server RAM of 64GB, the number
                                                                             of shards can be calculated as 200GB/64GB ~ 4 shards.
1. A shard is either a single mongod instance or a replica set [10]
   that stores data. The mongod is a daemon process that starts the       Other factors include disk throughput and operations per second.
   MongoDB server and handles data requests and other                     We calculate the number of shards needed in our cluster based on
   background operations. Replica set is a feature of MongoDB that        disk storage and RAM. Since the data load is a write intensive
   ensure redundancy by storing the same data on multiple servers.        process, each server needs to have a sufficient amount of disk space
                                                                          to the store the data that is continually written. In doing so the
2. A config server is a mongod instance that stores the metadata of       server resources such as CPU, memory, and disk are utilized
   the cluster. It maintains a mapping of the chunks to the shards.       effectively without being overloaded. Also, since we focus on
3. A query router is a mongos instance that is responsible for            analytical query performance, read operations should be optimized
   directing read and write operations from the client application        to achieve best results. For fast read operations, all the collections
   layer to the specific shard or shards. The mongos is a routing         and indexes related to the query should reside in the RAM to avoid
   service to which the application queries are directed internally,      disk IO operations. For this purpose, along with disk storage, we
   which then uses the metadata information stored in the config          also take the server RAM into consideration for calculating the
   server to locate the target shard or shards and consolidates all the   number of shards.
   information returned from the various shards before displaying         Data distribution effects the application read and write
   it to the user.                                                        performance. If a considerable amount of data resides on a single
While deploying the sharded cluster for our research, we                  shard, it can lead to a server crash or latency issues. Similarly, if
encountered issues that affect the application and cluster                too little data resides on each shard, the server resources are not
performance. Most issues are caused by the number of instances of         fully utilized. In MongoDB, distribution of data across multiple
each of the components deployed in the cluster. We discuss the            cluster members is determined by the shard key. A shard key [10]
issues faced and methods adopted to avoid them.                           is either an indexed field or an indexed compound field that is
                                                                          present in all documents of a collection. MongoDB uses the shard
In a sharded environment the number of instances of each of the           key to divide a collection into small non-overlapping ranges called
components determines the robustness of the cluster. For read             chunks and the default chunk size is 64 MB. MongoDB uses range-
intensive applications, having multiple query routers helps balance       based partitioning and hash-based partitioning for dividing shard
the application needs rather than having a single query router that       key values into chunks.
can be easily overloaded due to high frequency of read operations.
Based on the cluster and application needs, query routers can be          If the shard key is a numeric value, MongoDB can use range-based
added to the cluster on the fly by establishing connections to the        partitioning [10], where documents with nearby shard key values
config servers and the shards.                                            reside in the same chunk and therefore on the same shard. This
                                                                          distributes data evenly with an overhead for efficient range queries.
The number of config servers and shards is of greater importance
since they perform all the application critical operations. Deploying     MongoDB outlines the following strategies for selecting a shard
a cluster with multiple config servers enables data accessibility and     key:
avoids a single point of failure. Similar to config servers, the          1. High cardinality: The cardinality of a shard key refers to the
number of shards can also cause potential problems if they are not           number of different values associated to it. A shard key with high
aligned with the application needs. For write intensive applications,        cardinality has low probability of creating jumbo chunks.
shards can exceed their capacity and be exhausted quickly if data is
continually written to it. Therefore the capacity of a shard should       2. Compound shard key: If a collection does not have a field which
be decided before deploying the cluster based on the amount of the           can serve as an optimal shard key, additional fields can be added
data to be stored on them. If the number of shards are too few, data         to produce a more ideal key.
3. Hashed shard key: For a shard key with high cardinality, a           tabular data. However, we are interested in the performance of
   hashed index on the field can ensure even distribution of data       MongoDB when tabular data is denormalized and modeled in a way
   across the shards.                                                   that better suits MongoDB; we compare the performance of
                                                                        analytical queries against a normalized and denormalized data
We utilize high cardinality and compound shard keys in our              model. Therefore, among the big data benchmarks [15], we choose
implementation.                                                         a benchmark that satisfies a scalable volume, semi-controllable
                                                                        velocity, structured variety, and partially considers veracity. The 3
2.2 Benchmarking Big Data                                               benchmarks that satisfy our needs are TPC-DS [16], BigBench [17]
Among the many definitions for big data, we adopt the definition
                                                                        and Bigdatabench [19]. Since we are studying the performance
given by Dumbill: “Big data is data that exceeds the processing
                                                                        evaluation of analytical queries, we need the benchmark to be able
capacity of conventional database systems. The data is too big,
                                                                        to generate data that supports joins between entities and queries
moves too fast, or doesn’t fit the strictures of your database
                                                                        containing varying aggregate functions. BigBench benchmark
architectures.” [1] Big data is typically characterized by 4V
                                                                        provides a limited number of data mining queries and Bigdatabench
properties (i.e., volume, velocity, variety, and veracity [15]).
                                                                        provides a dataset consisting of only two tables, whereas TPC-DS
Therefore in order to benchmark big data, a standard should be
                                                                        provides a dataset of 24 tables and a query set of 100 queries, most
chosen that satisfies the 4V properties at least partially, if not
                                                                        of which support aggregate functions. Therefore, we choose TPC-
completely. The synthetic data generator should meet the following
                                                                        DS as our big data benchmark
criteria.
1. Volume refers to the ability to generate data of various scaling     3. TPC-DS AND QUERY SELECTION
   factors as inputs of typical workloads. The volume of data           The underlying business model of the TPC-DS schema is a retail
   generated can range from GBs to PBs.                                 product supplier that follows a snowflake schema [16]. It is a
                                                                        database architecture where a central table called fact table is linked
2. Velocity refers to data generation rates.                            to multiple other tables called dimension tables. A fact table
3. Variety refers to the support for generating diverse data types,     typically has two types of columns, the foreign key columns and
   which include structured data, unstructured data, and semi-          measures columns. The foreign key columns reference the primary
   structured data.                                                     key of the dimension tables, and the measures columns hold data
                                                                        used for calculations and analysis.
4. Veracity, with respect to benchmarking, is the ability to keep the
                                                                                             Table 3: Query Features
   synthetic data generated aligned with the characteristics of real
   world data [15]. The credibility of the benchmark is dependent
   on how well the raw data features are preserved in the generated
   data.
Han et al. [15] compare big data benchmarks such as TPC-DS [16]
BigBench [17], and Hibench [18] in terms of data generation and
benchmarking techniques. We use this paper as a reference to                   Table 4: Table Details for Datasets 1GB and 5GB
choose our benchmarking standard for generating data sets of
various scale factors and data mining queries of varying
complexities. We briefly discuss the state-of-the-art which includes
terms introduced in relation to each of the 4V properties in order to
compare and categorize the existing big data benchmarks. Table 3
gives basic definitions of the terminology.
  Table 2: Terms Used to Categorize Big Data Benchmarks
               based on 4V Properties [15]


We conduct a performance evaluation of MongoDB by executing
analytical queries on datasets of two different sizes. Therefore, we
need a benchmark that can generate scalable datasets, and real          The TPC-DS benchmark [5] has a total of 7 fact tables and 17
world data in order to achieve a realistic result. MongoDB is an        dimension tables. Among the 24 tables, the representative queries
appropriate choice of database for unstructured data rather than        we selected utilize 3 fact tables (Store_Sales, Store_Returns, and
Inventory) and a total of 9 dimension tables. Among the 4 query        dataset of specific size, the machine configurations for both the
classes supported by TPC-DS we choose the data mining class.           environments differ. For example, the 41.93GB sharded
Among the 23 queries available in that class, we select 4 queries      environment has more powerful machines than the 9.94GB sharded
which meet three or more of these criteria: (1) join of 4 or more      environment since it has more data. Only one stand-alone system is
tables, (2) aggregation functions such as sum() and avg(), (3) group   setup for both the 9.94GB and 41.93GB datasets. All the AWS
by and order by clauses, (4) conditional constructs such as case,      machines have the same software configuration: Red Hat
and (5) correlated subquery using the from clause.                     Enterprise Linux 7.1 and MongoDB version 3.0.2.
We select 4 queries that satisfy the criteria: Query 7, Query 21,      4.2 MongoDB Cluster Architecture
Query 46, and Query 50. Table 3 summarizes the criteria met by         A MongoDB sharded system consists of 3 components: shards,
each query. Table 4 lists the number of records in the tables for      config servers, and query routers. We determine the number of
datasets of sizes 1GB and 5GB.                                         instances of each component by taking the 9.94GB dataset as an
                                                                       example. The MongoDB sharded cluster test architecture [10] is
4. EXPERIMENTAL PLATFORM                                               used as a reference for creating our sharded system. The test
For setting up the stand-alone and sharded environments, we used       architecture uses 1 config server, 1 query router, and 1 or more
the Amazon Web Services (AWS), a cloud-computing platform              shards based on the application needs. Our MongoDB sharded
that provides on-demand delivery of IT resources [20]. We rented       system has 1 config server and 1 query router similar to the test
virtual computers through the Amazon Elastic Compute Cloud             architecture. However, the number of shards is decided taking into
(EC2) web service for application deployment and experimental          consideration the data to be stored on it, which in this case is 9.94
set-up. We boot the Red Hat Enterprise Linux AMI (Amazon               GB.
Machine Image) to create our virtual machines [20]. AWS provides
the capability of starting, stopping, and terminating instances as     We use the disk storage and RAM as factors for deciding the
needed, whereby active servers are charged by the hour.                number of shards in the cluster. Among the two factors, RAM is
                                                                       given priority as it reduces random disk IO thereby improving read
TPC-DS is chosen as our benchmark for generating data and              performance. Therefore, a system that is capable of accommodating
analytical queries. We use datasets of sizes 1GB and 5GB for           data, indexes, and other running applications in the RAM is chosen.
conducting our research. However, the 1GB and 5GB text data            Among the available AWS machines, those with a RAM storage of
when migrated to MongoDB increases to 9.94GB and 41.93GB               either 4GB or 8GB best suit our application needs. A machine with
respectively, an increase by a factor of nearly nine compared to the   less RAM would require deploying more systems in the cluster,
original dataset size. Therefore, we need machine(s) that can          increasing operational costs and a machine with RAM higher than
accommodate datasets with a minimum size of 10GB. Hence, an            8GB would make sharding insignificant for the 9.94GB dataset.
EC2 instance is chosen such that the RAM is greater than the
working set, the portion of data that is accessed often by the         The RAM consumption of the operating system and other
application server [10]. A RAM that fits all the indexes and           applications typically does not exceed 2GB. If an AWS machine
working set ensures faster processing there by reducing random         with 4GB RAM is chosen, only 2GB space would be available for
disk IO.                                                               storing data and indexes, hence requiring 5 machines
                                                                       (9.94GB/2GB). On the other hand, if an AWS machine with 8GB
The MongoDB stand-alone environment uses the m4.4xlarge1               RAM is chosen, 6GB space would be available for storing data and
instance that is capable of storing both the 9.94GB and 41.93GB        indexes, thus requiring 2 machines (9.94GB/6GB). However, we
datasets. The MongoDB sharded environment is a 5 node cluster          use 3 machines with 8GB RAM as shards to accommodate not only
where every machine/instance has the same configuration. For           the data but also indexes and the intermediate and final query
application deployment on the MongoDB cluster we use the               collections. Therefore, the MongoDB sharded system deployed on
t2.large1 instance for the 9.94GB dataset and the m4.xlarge1           AWS has 3 shards, 1 config server, and 1 query router.
instance for the 41.93GB dataset.
                                                                       Figure 4 illustrates the organization of the sharded system. Each
4.1 Hardware/Software Configuration                                    node in the sharded system is named after their functionality
This section discusses the hardware configurations of all the AWS      namely, Shard1, Shard2, Shard3, ConfigServer, and AppServer/
machines utilized for the deployment of the stand-alone and            QueryRouter. Since each node has a specific functionality, the
sharded environments. Table 5 illustrates the machine                  processes running on them differ. The 3 shards are responsible for
configurations for the MongoDB sharded and stand-alone                 storing data and therefore run the mongod instance. The config
environments.                                                          server stores the metadata of the cluster and runs a mongod
         Table 5: Machine Hardware Configurations                      instance. The query router is a mongos instance, a routing service
                                                                       to which the application queries are directed to. It internally makes
                                                                       use of the config server metadata to direct queries to the appropriate
                                                                       shards. The application server and query router are deployed on the
                                                                       same AWS machine because the query router is a lightweight
                                                                       process and does not utilize much of the server resources. The two
                                                                       segmented lines represent the actual flow path of query or any
                                                                       related read operation and the continuous line indicates the
                                                                       observed flow path.
Two different MongoDB sharded environments are created for the
9.94GB and 41.93GB datasets. Each of the sharded environments
have 5 machines. Since each sharded environment supports a

1 AWS machine nomenclature
                                                                                and 17 dimension tables. The records in each table correspond to
                                                                                the documents in the respective collections.
                                                                             4. Since data in MongoDB in stored in the JSON format, the
                                                                                column names of the table correspond to the keys in the JSON
                                                                                document and the column values of the table correspond to the
                                                                                respective key values in the document.
                                                                             The pseudocode of the data migration algorithm is illustrated in
                                                                             Figure 5. It takes a data file as input and generates the MongoDB
                                                                             collection as output. The algorithm makes use of the HashMap data
                                                                             structure in Java [21], a tabular structure that maps keys to values.
                                                                             It uses hashing to generate a hashcode for every key, which
                                                                             determines the object location in the table.


            Figure 4: MongoDB Cluster Architecture

5. PERFORMANCE STUDY
In this section, we describe the algorithms developed for data
loading and translation of a SQL query to a Mongo query using
Java as the programming language2 and we describe the
experimental setup. Table 6 illustrates the different experimental
setups based on the choice of dataset sizes, data models, and
deployment environments. The experiments are conducted by                                  Figure 5: Data Migration Algorithm
taking into consideration the two dataset sizes, two data models,
and two deployment environments.                                             5.2 Translating SQL Queries to MongoDB
                                                                             The SQL queries in our study differ from one another in aspects
                        Table 6: Experiments                                 such as numbers of joined tables, aggregate functions, where
                                                                             clauses, from clauses, group by and order by clauses, and correlated
                                                                             and uncorrelated subqueries. Manually translating each of these
                                                                             queries into a Mongo query can be tedious and error-prone. We
                                                                             create algorithms for translating an analytical SQL query to its
                                                                             equivalent Mongo query. These algorithms focus on executing the
                                                                             selected queries that support aggregate functions, where clauses,
                                                                             from clauses, group by, and order by clauses.
                                                                             Performance of queries or any read operation is greatly dependent
5.1 Migrating Data into MongoDB                                              on the data model that supports the application needs. In a relational
We develop an algorithm to migrate the TPC-DS generated data                 database system, data models are broadly classified into normalized
files into MongoDB in the JSON format. The same algorithm is                 and denormalized data models. A normalized data model has a
used to load datasets of sizes 1GB and 5GB, which translate to               higher number of foreign keys and hence a higher number of joins.
9.94GB and 41.93GB, respectively, when loaded into MongoDB.                  On the other hand, in a denormalized data model, data is
The increase in size is due to the data storage in JSON format. Each         consolidated into one or a few tables, resulting in fewer foreign
document has key-value pairs where the key corresponds to the                keys and hence fewer joins. We are interested in the query
table column. Each key is repeated in every document in the                  performance on MongoDB when run against both the normalized
collection, hence drastically increasing the size of the dataset. The        and denormalized data models.
following points summarize the flow of data starting with its                In order to execute analytical queries against a denormalized
creation using the TPC-DS benchmark up to its migration into                 model, the collections in the MongoDB database should be
MongoDB:                                                                     denormalized. During data load the TPC-DS data files are migrated
1. TPC-DS generates a .dat file for each table. The column values            into individual collections, therefore on initial load the data is
   are delimited by the ‘|’ operator.                                        completely normalized. We develop an algorithm for creating a
2. The databases in MongoDB for 1GB and 5GB TPC-DS datasets                  denormalized collection from a set of given fact collection and
   are called Dataset_1GB and Dataset_5GB, respectively. Steps 3             dimension collections. In this approach, all the dimension
   and 4 correspond to the Dataset_1GB database, however they                collections are joined to the fact collection based on their foreign
   work analogously for the Dataset_5GB database.                            key relationships. In MongoDB, joining a dimension collection to
3. Each table in the TPC-DS schema corresponds to a separate                 a fact collection is equivalent to embedding the dimension
   collection in the Dataset_1GB database. Hence, Dataset_1GB                collection documents in the fact collection. To understand the
   contains a total of 24 collections, representing the 7 fact tables        structure of a denormalized collection in MongoDB, consider the


2 code is available at https://github.com/raghavai/Performance-Evaluation-

  of-Analytical-Queries-on-a-Stand-alone-and-Sharded-Document-Store
store_sales fact collection connected to multiple dimension             1. Query all the dimension collections based on their respective
collections such as time_dim, store, and item. The foreign keys such       where clauses.
as ss_sold_date_sk, ss_item_sk, and ss_cdemo_sk are replaced by         2. Perform a semi-join of the fact collection with the filtered
the actual document in the document collection.                            dimension collection documents, i.e., obtain only those fact
                                                                           collection documents whose foreign keys are present in the
Figure 6 illustrates pseudocode for the algorithm to create a              filtered dimension collection documents. For example, a semi-
denormalized fact collection. It takes in a fact collection and a set      join of the store_sales and customer_address collections would
of dimension collections as input and outputs the denormalized fact        result in a collection containing only those store_sales
collection. An EmbedDocuments method, shown in Figure 7, is                documents whose foreign key is referenced in the
called on every dimension collection, which embeds the dimension           customer_address collection. Store the semi-joined fact
collection documents in the fact collection.                               collection in an intermediate collection.
                                                                        3. Embed dimension collection documents in the intermediate
                                                                           collection documents. To improve performance embed only
                                                                           those dimension collection documents whose attributes are used
                                                                           in query aggregation.
                                                                        4. Perform aggregation operations against the embedded
                                                                           intermediate collection and store the final query results in an
                                                                           output collection.
                                                                        We illustrate the pseudocode for the query translation algorithm
                                                                        against a normalized data model in Figure 8. It takes the fact
                                                                        collection(s) and dimension collections related to the query as input
    Figure 6: Denormalized Collection Creation Algorithm
                                                                        and outputs the final query collection. In addition to the final query
                                                                        collection, an intermediate collection is created which holds the
                                                                        semi-joined fact collection documents. The algorithm uses the
                                                                        ArrayList data structure in java [23], a dynamic array that can grow
                                                                        and shrink as needed. It is used in the process of performing a semi-
                                                                        join on the fact collection. The EmbedDocuments method
                                                                        illustrated in Figure 7 is called on the intermediate collection and a
                                                                        dimension collection during the embedding process. To improve
                                                                        performance, only those dimension collections whose attributes are
                                                                        used in query aggregation are embedded. After the embedding
                                                                        process, the MongoDB aggregation framework is used to execute
                                                                        the aggregation operations in the query.


          Figure 7: Embedding Documents Algorithm

5.3 Query Translation Algorithm for
Normalized Data Model
An analytical query against a normalized data model typically
queries data from one or more tables. In SQL, this is accomplished
through join operations. The SQL query optimizer plays a vital role
in identifying an optimal query plan with low cost. For example, if
a query contains join operations, where clauses, and a group by
clause, the query optimizer decides the best possible execution          Figure 8: Query Translation Algorithm for the Normalized
order of the operations that yields a query plan that has a low cost                           Data Model
and execution time.
                                                                        6. EXPERIMENT RESULTS
In MongoDB, query optimization is limited to the usage of indexes       Table 7 summarizes the data load times for each table for both the
and efficient read operations. Join operations cannot be optimized      dataset sizes. We also show query runtimes executed on each of the
as MongoDB does not support joins. To overcome this hurdle, we          experimental setups. Table 8 illustrates the selectivity of the
develop an algorithm that simulates join operations and executes        queries, i.e., the proportion of data retrieved for both the 9.94GB
queries on the fly. All the queries used here implement the select-     and 41.93GB datasets. Table 9 summarizes the query execution
from-where template. Therefore, the algorithm is optimized for          runtimes for each of the experiments. Every query is run 5 times on
queries that follow this template. The algorithm does not take into     each experimental setup. For each run data is cached in the
consideration the details of the query predicates, aggregation          memory. Among the 5 runtimes we obtain, Table 9 presents the
operations, and sequence of join operations, but only follows a         best results. Hours are denoted by h, minutes by m, and seconds by
predetermined order of execution:                                       s.
We analyze the results of the data load times for the 1GB and 5GB          experiments correspond to the denormalized data models
datasets below.                                                            running on stand-alone systems. This indicates that
1. For tables having the same number of records in both the                denormalized data models outperform their normalized counter
   datasets, the data load times are nearly identical. This can be         parts. Also, the algorithm implemented to execute queries
   observed for tables catalog_page, customer_demographics,                against a denormalized data model handles scaling effectively,
   date_dim, household_demographics, income_band, ship_mode,               for the scales used in our experiments.
   and time_dim.                                                         2. Among the experiments conducted against a normalized data
2. For tables with an unequal number of records in both the                 model, stand-alone systems are observed to be faster than
   datasets, the ratio of the number of records is nearly identical to      sharded systems. The runtimes for the Queries 7, 21, and 46 for
   the ratio of their load times. This can be observed for tables           Experiments 1, 2 and 4, 5 support this statement.
   call_center,     catalog_returns,     catalog_sales,     customer,    3. Query 50 is observed to be faster on the sharded system. This
   customer_address, inventory, item, promotion, reason, store,             does not indicate that the sharded system is slower or faster than
   store_returns, store_sales, warehouse, web_page, web_returns,            a stand-alone system. It depends on the type of the query sent to
   web_sales, and web_site.                                                 either of the systems. If a query includes a shard key, the mongos
                     Table 7: Data Load Times                               routes the query to a specific shard rather than broadcasting the
                                                                            query to all the shards in the cluster, enhancing the query
                                                                            performance, which is the case for Query 50. Thus, we can infer
                                                                            that Queries 7, 21, and 46 which are faster on a stand-alone
                                                                            system are being broadcasted on the sharded system, resulting in
                                                                            slower runtimes.


                                                                             Figure 9: A Comparison of Query Execution Times for
                                                                                              9.94GB Dataset


                    Table 8: Query Selectivity


              Table 9: Query Execution Runtimes


                                                                            Figure 10: A Comparison of Query Execution Times for
                                                                                             41.93GB Dataset
                                                                         There have been other studies that benchmarked the performance
                                                                         of MongoDB [8, 25, 26, 27, 28]. However, the majority of them
Figures 9 and 10 illustrate the query performance on the 9.94GB          performed the MongoDB benchmark against popular databases
and 41.93GB dataset taking into consideration the data models and        such as Oracle, Cassandra, HBase, and Couchbase but none of them
deployment environments.                                                 study the performance of MongoDB based on the deployment
We analyze the query execution runtimes below.                           environments, data modeling, and scalable datasets. We
                                                                         specifically focus on modeling relational data in a NoSQL
1. Experiments 3 and 6 have the fastest query runtimes in their          environment and tune it in different ways and study why one data
   respective dataset sizes. It can be observed that the two
model outperforms the other. Also, we base our conclusions on              deployment environments, data models, dataset sizes, and query
running complex analytical queries on each of the data models.             types. Our results indicate that the denormalized data model speeds
Based on the results obtained from executing analytical queries on         up queries by a significant amount when deployed on a stand-alone
different experimental setups, we conclude that performance of             environment. The trend in execution times remains the same with
queries on MongoDB is influenced by the data model and                     the increase in scalability we investigated.
deployment environment. Given the two choices of normalized and            With the help of the algorithms proposed here, a tool can be created
denormalized data models, a decision has to be made based on the           that migrates SQL-like data into MongoDB. Since MongoDB lacks
amount of data stored in each document. If the size of a document          the support for join operations, the query translation algorithm
does not exceed 16MB, a denormalized data model should be                  developed here can be used as a basis for a tool to streamline join
chosen over a normalized data model. Our experiments indicate              and aggregation operations.
that a denormalized data model results in faster query times than its
normalized counterpart irrespective for the two dataset sizes we           There are many different ways that this study could be both
investigate. Queries against a normalized data model are slower            broadened and deepened. Specifically, the preliminary experiments
since join operations followed by the embedding procedure are              described here could be extended by implementing a denormalized
expensive. It is also expensive in terms of storage consumption due        data model on a sharded system and comparing its performance to
to the creation of intermediate collections.                               a denormalized data model implemented on a stand-alone system.
                                                                           Other ways of organizing the data could be considered. Herrero et
Given the two choices of a stand-alone and sharded deployment              al. consider conceptual, logical, and physical design of NoSQL
environment, a decision has to be made based on application needs,         databases and propose techniques that allow for varying amounts
available resources, and types of queries to be executed. A sharded        of denormalization depending on factors such as query workload
system is an appropriate choice when an application deployed on            [29]. Truică et al. study the performance of three document
stand-alone system becomes resource intensive resulting in an              databases (including MongoDB) and three relational databases for
increase in data volumes and read/write throughput. Another                create, read, update, and delete operations [30]. The replication of
scenario considers the high cost incurred from running a stand-            data considered in their approach would be interesting to consider
alone system as compared to a sharded system. When operating               as part of a future study using benchmark queries.
with a sharded system, we have to be mindful of the following
aspects. Firstly, the shard key is immutable, i.e., it cannot be altered   The research work can be furthered in different aspects such as
after the collection is sharded. Secondly, queries have to be              varying the dataset sizes, increasing the scalability, and using
structured to use a shard key to prevent broadcasting across all the       different benchmarks that are suited for schema-less data.
shards in the cluster. This allows the mongos to route the query to
                                                                           MongoDB can be implemented on both a sharded and stand-alone
a specific shard providing better performance. On the other hand,
                                                                           system. For the case of a stand-alone system, MongoDB is thread
if a stand-alone system is utilized within its resource limits, it can
                                                                           safe; multi-threading is handled by the database on the client side.
be equally effective at processing queries as compared to a sharded
                                                                           MongoDB allows operations to lock at either the collection level or
system with equivalent resources. Since it does not encounter the
                                                                           database level. In a collection level lock, only a single thread is
constraint of a shard key, indexing can be applied to any field and
                                                                           allowed to perform write operations to the collection. Aggregation
a wide variety of queries can be directed to the system. Based on
                                                                           queries usually involve multiple collections that are queried
the experimental results we can conclude that a stand-alone system
                                                                           individually followed by the aggregation operations. Since
is a suitable choice when queries with varying predicates are
                                                                           MongoDB uses collection level locks, individual threads can be
directed to the system and a sharded system is a suitable choice
                                                                           used to query each collection in parallel and then perform
when specific queries containing the shard key are directed to the
                                                                           aggregation on a single thread that runs after the completion of the
cluster.
                                                                           other threads. In our research, the entire query was performed on a
7. CONCLUSIONS AND FUTURE WORK                                             single thread; using multithreading may reduce the query execution
In this section, we summarize the contribution of our research and         times. The same concept can be used in the sharded environment
elaborate possible extensions for future work.                             where individual collections reside on different shards and multiple
                                                                           threads can be issued to query each collection in parallel.
We address the performance impact organizations may face if they           Parallelizing can also be performed for aggregation over collection
choose to run complex analytical queries on a stand-alone and              subsets using multiple threads on a sharded database.
sharded document store. We highlight the importance of data
modeling coupled with deployment environments. Different                   The performance of MongoDB can be tested further by employing
experimental setups are implemented to evaluate the combination            a more varied dataset in terms of number of fields and datatypes,
of a data model and a deployment environment and its impact on             and also by using a different benchmark intended for a document
query performance and scaling.                                             store database. Different denormalized data models could be
                                                                           deployed on the sharded cluster and its performance can be studied
For conducting performance evaluation of analytical queries, we            by varying the number of shards and the number of mongos
use TPC-DS as our chosen benchmark to generate data and                    instances, and implementing multi-threading. Additional metrics
analytical queries. We develop a data loading algorithm to migrate         could be collected such as the amount of data read from disk and
data generated by the TPC-DS benchmark into MongoDB for                    transmitted. All of the possible extensions could be executed in a
performing query analysis. TCP-DS generated SQL queries employ             more recent MongoDB engine such as WiredTiger [31], which
join operations and SQL operations. Since MongoDB does not                 introduces enhanced features including compression, column-
support joins, we develop an algorithm to simulate join operations         oriented management techniques, and additional indexing styles.
and perform aggregation operations.
                                                                           8. REFERENCES
We assess the performance of stand-alone MongoDB system with               [1] (January 11, 2012). What is big data?. Available:
a MongoDB sharded system and conclude that it is dependent on                    https://beta.oreilly.com/ideas/what-is-big-data.
[2] A. Moniruzzaman and S. A. Hossain, "NoSQL database: New               IEEE 26th International Conference On Data Engineering
     era of databases for big data analytics-classification,              Workshops (ICDEW), 2010, pp. 41-51.
     characteristics and comparison," arXiv Preprint                 [19] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao,
     arXiv:1307.0191, 2013.                                               Z. Jia, Y. Shi and S. Zhang, "Bigdatabench: A big data
[3] (2013). MongoDB: A Document Oriented Database.                        benchmark suite from internet services," in Proceedings of
     Available: http://www.mongodb.org/about/.                            2014 IEEE 20th International Symposium On High
[4] A. Lakshman and P. Malik, "Cassandra: a decentralized                 Performance Computer Architecture (HPCA) , 2014, pp.
     structured storage system," ACM SIGOPS Operating Systems             488-499.
     Review, vol. 44, pp. 35-40, 2010.                               [20] (2015). AWS Documentation. Available:
[5] (April, 2012). TPC BENCHMARK ™ DS. Available:                         http://aws.amazon.com/documentation/.
     http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf.                  [21] HashMap (Java Platform SE 7). Available:
[6] C. Baru, M. Bhandarkar, R. Nambiar, M. Poess and T. Rabl,             http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.h
     "Setting the direction for big data benchmark standards," in         tml.
     Selected Topics in Performance Evaluation and                   [22] MongoDB Java Driver Documentation. Available:
     Benchmarking, Springer, 2013, pp. 197-208.                           http://mongodb.github.io/mongo-java-driver/3.0/.
[7] A. Nayak, A. Poriya and D. Poojary, "Type of NoSQL               [23] ArrayList (Java Platform SE 7). Available:
     databases and itscomparison with relational databases,"              http://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.h
     International Journal of Applied Information Systems, vol. 5,        tml.
     2013.                                                           [24] B. G. Tudorica and C. Bucur, "A comparison between
[8] Z. Parker, S. Poe and S. V. Vrbsky, "Comparing NoSQL                  several NoSQL databases with comments and notes," in
     Mongodb to an SQL db," in Proceedings of the 51st ACM                Roedunet International Conference (RoEduNet), 2011 10th,
     Southeast Conference, 2013, pp. 5.                                   2011, pp. 1-5.
[9] (April 9, 2015). Sharding. Available:                            [25] M. Fotache and D. Cogean, "NoSQL and SQL databases for
     http://docs.mongodb.org/manual/sharding/.                            mobile applications. Case Study: MongoDB versus
[10] (June 04, 2015). MongoDB Documentation Release 3.0.3.                PostgreSQL," Informatica Economica, vol. 17, pp. 41-58,
     Available: http://docs.mongodb.org/master/MongoDB-                   2013.
     manual.pdf.                                                     [26] V. Abramova and J. Bernardino, "NoSQL databases:
[11] BSON. Available: http://bsonspec.org/.                               MongoDB vs. Cassandra," in Proceedings of the
                                                                          International C* Conference on Computer Science and
[12] N. Nurseitov, M. Paulson, R. Reynolds and C. Izurieta,               Software Engineering, 2013, pp. 14-22.
     "Comparison of JSON and XML data interchange formats: A
     case study." Caine, vol. 2009, pp. 157-162, 2009.               [27] TPC Documentation as of 04/27/2015. Available:
                                                                          http://www.tpc.org/information/current_specifications.asp.
[13] D. Pritchett, "Base: An acid alternative," Queue, vol. 6, pp.
     48-55, 2008.                                                    [28] (Aug 26, 2013). Using TPC-DS to generate RDBMS
                                                                          performance and benchmark data. Available:
[14] (2014, September 25). Sharding Methods for MongoDB.                  http://www.innovation-
     Available: http://www.slideshare.net/mongodb/sharding-v-             brigade.com/index.php?module=Content&type=user&func=
     final.
                                                                          display&tid=1&pid=3.
[15] R. Han, X. Lu and J. Xu, "On big data benchmarking," in Big
                                                                     [29] V. Herrero, A. Abello and O. Romero, "NOSQL design for
     Data Benchmarks, Performance Optimization, and Emerging
                                                                         analytical workloads: variability matters," in Proceedings of
     Hardware, Springer, 2014, pp. 3-18.
                                                                         the 35th International Conference on Conceptual Modeling
[16] M. Poess, R. O. Nambiar and D. Walrath, "Why you should             (ER), 2016, pp. 50-64.
     run TPC-DS: A workload analysis," in Proceedings of the
                                                                     [30] C.-O. Truică, I.I. Bucur and A. Boicea, "Performance
     33rd International Conference on very Large Data Bases,
                                                                          evaluation for CRUD operations in asynchronously
     2007, pp. 1138-1149.
                                                                          replicated document oriented database," in Proceedings of
[17] A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte            the International Conference on Control Systems and
     and H. Jacobsen, "BigBench: Towards an industry standard             Computer Science, 2015, pp. 191-196.
     benchmark for big data analytics," in Proceedings of the
                                                                     [31] (January 15, 2017), WiredTiger, Available:
     2013 ACM SIGMOD International Conference on
                                                                          http://www.wiredtiger.com/.
     Management of Data, 2013, pp. 1197-1208.
[18] S. Huang, J. Huang, J. Dai, T. Xie and B. Huang, "The
     HiBench benchmark suite: Characterization of the
     MapReduce-based data analysis," in Proceedings of the 2010