Performance Evaluation of Analytical Queries on a Stand-alone and Sharded Document Store Aarthi Raghavendra and Karen C. Davis Electrical Engineering and Computing Systems Department University of Cincinnati Cincinnati, OH USA 45221-0030 1-513-556-2214 raghavai@mail.uc.edu, karen.davis@uc.edu ABSTRACT features such as aggregation, secondary indexing, sharding and Numerous organizations perform data analytics using relational replication. Parker et al. [8] compare the runtime performance of databases by executing data mining queries. These queries include MongoDB with SQL Server for a modest-sized database (3 tables, complex joins and aggregate functions. However, due to an at most 12,416 tuples total) and conclude that the former performs explosion of data in terms of volume, variety, veracity, and velocity equally well or better than SQL Server except when aggregation is known as Big Data [1], many organizations such as Foursquare, utilized. However, the impact of data modeling and deployment Adobe, and Bosch have migrated to NoSQL databases [2] such as environments for aggregation operations were not explored in MongoDB [3] and Cassandra [4]. We investigate the performance detail. impact of analytical queries on a NoSQL document store. We In this paper, we investigate the performance of complex data benchmark the performance of MongoDB [3], a cross-platform mining queries against datasets of different sizes. We use a stand- document-oriented database, in a stand-alone environment and a alone and distributed data organization known as sharding [9]. In a sharded environment. The TPC-DS benchmark [5] is used to sharded database, data is split into chunks and distributed across the generate data of different scales and selected data mining queries cluster nodes. A query run against such a system can target either are executed in both the environments. Our experimental results one, a few, or all the nodes and the result from each of the nodes is show that along with choosing the environment, data modeling in aggregated and displayed to the user. MongoDB also has a significant impact on query execution times. Analytical query performance is best when data is stored in a In Section 2, we outline features of MongoDB such as data denormalized fashion. When the data is sharded, due to multiple modeling, indexing, sharding, and aggregation. In Section 3, we query predicates in an analytical query, aggregating data from a few discuss the TPC-DS benchmarking standard that is used to generate or all nodes proves to be an expensive process and hence performs datasets of varying sizes as well as the criteria used to select data poorly when compared to the alternative process of executing the mining queries for our study. In Section 4, we describe the same in a stand-alone environment. hardware and software configurations of the systems used to conduct the experiments. In Section 5, algorithms for migrating CCS Concepts relational data and translating SQL queries to MongoDB are • Information systems➝Database management system presented. Section 6 outlines the experimental procedures engines • Database query processing➝query optimization. implemented on the stand-alone and sharded environments and discusses our findings. We conclude with a synopsis of the Keywords contributions and future work in Section 7. document databases; MongoDB; sharded query performance; analytical queries. 2. MongoDB/BIG DATA BENCHMARKING We give an overview of MongoDB concepts and terms, followed 1. INTRODUCTION by a discussion of relevant benchmarking efforts to date. Relational database systems have been the foundation for enterprise data management for over 30 years. Many organizations use a 2.1 MongoDB relational platform to perform data analysis by running data mining MongoDB is a cross-platform document-oriented database queries against a database. With an estimated growth in enterprise classified under the aegis of the NoSQL databases. It is coined from data to 35ZB by 2020 [6] along with growing user loads, the term huMONGOus for its support of big data management. Key organizations are adopting newer technologies such as NoSQL features of MongoDB include high performance, high availability, databases to store data. Among the types of NoSQL databases [7] and automatic scaling [10]. It is schema-less or has a flexible (key-value store, column-oriented, document store, and graph), we schema. Unlike SQL where the structure of the data is defined prior have chosen MongoDB, a cross-platform document-oriented to loading into the database, MongoDB does not enforce any rigid database against which we execute data mining queries. It provides structure. The flexible structure is achieved by storing data in BSON format [11], which is a binary-encoded serialization of __________________________________________________________ JSON-like [13] key and value pairs. 2017, Copyright is with the authors. Published in the Workshop DOLAP. Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, A document is composed of key-value pairs, and is the basic unit Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this of data in MongoDB. The value of these fields can be another paper is permitted under the terms of the Creative Commons license CC- document, array, and array of documents. A group of documents is by-nc-nd 4.0. called a collection. Since documents do not dictate a specific format, a collection can have documents with each having a varying a single database operation. For example, an application can number of fields and types of values, thereby giving it a flexible retrieve complete publisher information in one query and new schema. books published by the publisher can be added as embedded sub- documents to the books array. This ensures no repetition of the An example of a document is given in Figure 1. Every document in publisher details per book, thereby reducing redundant data. a collection has a unique _id field that acts as the primary key. However, if the size of the document exceeds the 16 MB limit, data Unless explicitly specified by the application, MongoDB uses a has to be split and stored in separate documents. In such cases, the special 12-byte BSON type, called ObjectId, as the default value referenced data model can be adopted. for the _id field. The ObjectId is generated from the timestamp, machine ID, process ID, and a process-local incremental counter that guarantees uniqueness [10]. { _id: ObjectId (“5480abb8986c9d3197f6682c”), customer_id: 23, customer_address: { apartment_number: 26, street_name: “Whitfield”, state: “CA”, country: “United States” } customer_name: “Earl Garrison”, Figure 3: Referenced Data Model birth_date: “9/25/1979”, email_id: earl.garrison@G3sM4P.com Figure 3 illustrates the referenced data model design for the one-to- } many relationship between Publisher and Book entities. References Figure 1: Document Structure represent relationships between data by associating or referencing one document to another. This is achieved by storing the _id of one 2.1.1 Data Modeling document as the value for another field in the other document. The An application that uses a relational database management system one-to-many relationship between Publisher and Book can be models data by declaring a table’s structure and its relationships to modeled by keeping the publisher and book information in two other tables prior to inserting data into it. Similarly, data modeling separate collections and the relationship can be enforced by storing in MongoDB focuses on the document structure and relationships the Publisher reference inside the Book document. In doing so, a between data. Since MongoDB does not support joins, the two query to retrieve complete publisher information would have to concepts that facilitate data modeling are embedding and make multiple requests to the server as follow-up queries are referencing [10]. necessary to resolve the references. However, it is a suitable choice We illustrate data modeling techniques in MongoDB through an to model entities that are independent of each other. Also, if two or example. Consider two entities, Book and Publisher and a one-to- more entities are related but complex, then the complexity can be many relationship between them. reduced by breaking down the data into multiple documents. Table 1 provides a comparison between the two data models. Table 1: Embedded and Referenced Data Model Comparison For fast and efficient data access, indexes can be created to locate data quickly and perform random lookups. Figure 2: Embedded Data Model Figure 2 illustrates the embedded data model design for the one-to- 2.1.2 Indexing many relationship between Publisher and Book entities. An index is a special on-disk data structure that allows the database Embedding represents a relationship by encapsulating related application to retrieve data faster by not scanning the entire information in a single document or structure. It depicts a contains collection. It uses keys built from one or more fields in a document. relationship between entities since related pieces of information is MongoDB implements indexing by storing the keys in a B-Tree stored in the same database record. The one-to-many relationship data structure which helps in finding rows associated with the keys between a Publisher and a Book can be modeled by embedding the quickly and efficiently. Indexes in MongoDB are defined at the book data entities in the publisher data. This provides good collection level on any field or sub-field of the document. performance for read operations as related data can be retrieved in MongoDB supports 7 different types of indexes [10]. In our work, we use the default_id and compound indexes. All collections are resides on just a few shards leading to exhaustion problems. Having indexed on the _id field by default. A compound index is created more shards reduces the amount of data on each shard and resources on multiple fields of a document with a specific sort order for each such as RAM and CPU cannot be used effectively. The number of field. If a collection has a compound index on PersonID and Age, shards in a cluster can be calculated based on the following factors the index sorts first by PersonID and then within each PersonID [14]. value, sorts by Age. Therefore, the order of the fields in the compound index should be declared appropriately based on the 1. The sum of disk storage space across shards should be greater application needs. than the required storage size. For example, if the application data is 1.5TB and the disk storage available per server is 256GB, 2.1.3 Sharding then the number of shards needed can be calculated as A database system deployed on a single server can experience 1.5TB/256GB ~ 6 shards. heavy loading due to high query rates or large datasets. When the 2. The sum of RAM across shards should be greater than the demand on the server spikes drastically, alternate means should be working set of the sharded cluster. The working set is the identified to keep the system online and robust. This scaling issue segment of client data that is accessed most frequently by the is addressed in database systems by implementing vertical scaling application. For read intensive applications, storing the entire or horizontal scaling [15]. MongoDB implements horizontal working set in the RAM results in faster read operations. If the scaling, or sharding [10]. It is the process of splitting data into working set memory requirement is more than the available chunks and distributing the chunks across multiple servers, known RAM, the operating system needs to perform frequent IO as shards. The data is partitioned at a collection level and stored on operations to the disk to retrieve the information, thereby multiple shards such that all shards put together make up a single drastically slowing down the system. The working set size is the logical database. size of the frequently accessed collections and their indexes. For A sharded cluster in MongoDB [10] has 3 components: a working set of 200GB and server RAM of 64GB, the number of shards can be calculated as 200GB/64GB ~ 4 shards. 1. A shard is either a single mongod instance or a replica set [10] that stores data. The mongod is a daemon process that starts the Other factors include disk throughput and operations per second. MongoDB server and handles data requests and other We calculate the number of shards needed in our cluster based on background operations. Replica set is a feature of MongoDB that disk storage and RAM. Since the data load is a write intensive ensure redundancy by storing the same data on multiple servers. process, each server needs to have a sufficient amount of disk space to the store the data that is continually written. In doing so the 2. A config server is a mongod instance that stores the metadata of server resources such as CPU, memory, and disk are utilized the cluster. It maintains a mapping of the chunks to the shards. effectively without being overloaded. Also, since we focus on 3. A query router is a mongos instance that is responsible for analytical query performance, read operations should be optimized directing read and write operations from the client application to achieve best results. For fast read operations, all the collections layer to the specific shard or shards. The mongos is a routing and indexes related to the query should reside in the RAM to avoid service to which the application queries are directed internally, disk IO operations. For this purpose, along with disk storage, we which then uses the metadata information stored in the config also take the server RAM into consideration for calculating the server to locate the target shard or shards and consolidates all the number of shards. information returned from the various shards before displaying Data distribution effects the application read and write it to the user. performance. If a considerable amount of data resides on a single While deploying the sharded cluster for our research, we shard, it can lead to a server crash or latency issues. Similarly, if encountered issues that affect the application and cluster too little data resides on each shard, the server resources are not performance. Most issues are caused by the number of instances of fully utilized. In MongoDB, distribution of data across multiple each of the components deployed in the cluster. We discuss the cluster members is determined by the shard key. A shard key [10] issues faced and methods adopted to avoid them. is either an indexed field or an indexed compound field that is present in all documents of a collection. MongoDB uses the shard In a sharded environment the number of instances of each of the key to divide a collection into small non-overlapping ranges called components determines the robustness of the cluster. For read chunks and the default chunk size is 64 MB. MongoDB uses range- intensive applications, having multiple query routers helps balance based partitioning and hash-based partitioning for dividing shard the application needs rather than having a single query router that key values into chunks. can be easily overloaded due to high frequency of read operations. Based on the cluster and application needs, query routers can be If the shard key is a numeric value, MongoDB can use range-based added to the cluster on the fly by establishing connections to the partitioning [10], where documents with nearby shard key values config servers and the shards. reside in the same chunk and therefore on the same shard. This distributes data evenly with an overhead for efficient range queries. The number of config servers and shards is of greater importance since they perform all the application critical operations. Deploying MongoDB outlines the following strategies for selecting a shard a cluster with multiple config servers enables data accessibility and key: avoids a single point of failure. Similar to config servers, the 1. High cardinality: The cardinality of a shard key refers to the number of shards can also cause potential problems if they are not number of different values associated to it. A shard key with high aligned with the application needs. For write intensive applications, cardinality has low probability of creating jumbo chunks. shards can exceed their capacity and be exhausted quickly if data is continually written to it. Therefore the capacity of a shard should 2. Compound shard key: If a collection does not have a field which be decided before deploying the cluster based on the amount of the can serve as an optimal shard key, additional fields can be added data to be stored on them. If the number of shards are too few, data to produce a more ideal key. 3. Hashed shard key: For a shard key with high cardinality, a tabular data. However, we are interested in the performance of hashed index on the field can ensure even distribution of data MongoDB when tabular data is denormalized and modeled in a way across the shards. that better suits MongoDB; we compare the performance of analytical queries against a normalized and denormalized data We utilize high cardinality and compound shard keys in our model. Therefore, among the big data benchmarks [15], we choose implementation. a benchmark that satisfies a scalable volume, semi-controllable velocity, structured variety, and partially considers veracity. The 3 2.2 Benchmarking Big Data benchmarks that satisfy our needs are TPC-DS [16], BigBench [17] Among the many definitions for big data, we adopt the definition and Bigdatabench [19]. Since we are studying the performance given by Dumbill: “Big data is data that exceeds the processing evaluation of analytical queries, we need the benchmark to be able capacity of conventional database systems. The data is too big, to generate data that supports joins between entities and queries moves too fast, or doesn’t fit the strictures of your database containing varying aggregate functions. BigBench benchmark architectures.” [1] Big data is typically characterized by 4V provides a limited number of data mining queries and Bigdatabench properties (i.e., volume, velocity, variety, and veracity [15]). provides a dataset consisting of only two tables, whereas TPC-DS Therefore in order to benchmark big data, a standard should be provides a dataset of 24 tables and a query set of 100 queries, most chosen that satisfies the 4V properties at least partially, if not of which support aggregate functions. Therefore, we choose TPC- completely. The synthetic data generator should meet the following DS as our big data benchmark criteria. 1. Volume refers to the ability to generate data of various scaling 3. TPC-DS AND QUERY SELECTION factors as inputs of typical workloads. The volume of data The underlying business model of the TPC-DS schema is a retail generated can range from GBs to PBs. product supplier that follows a snowflake schema [16]. It is a database architecture where a central table called fact table is linked 2. Velocity refers to data generation rates. to multiple other tables called dimension tables. A fact table 3. Variety refers to the support for generating diverse data types, typically has two types of columns, the foreign key columns and which include structured data, unstructured data, and semi- measures columns. The foreign key columns reference the primary structured data. key of the dimension tables, and the measures columns hold data used for calculations and analysis. 4. Veracity, with respect to benchmarking, is the ability to keep the Table 3: Query Features synthetic data generated aligned with the characteristics of real world data [15]. The credibility of the benchmark is dependent on how well the raw data features are preserved in the generated data. Han et al. [15] compare big data benchmarks such as TPC-DS [16] BigBench [17], and Hibench [18] in terms of data generation and benchmarking techniques. We use this paper as a reference to Table 4: Table Details for Datasets 1GB and 5GB choose our benchmarking standard for generating data sets of various scale factors and data mining queries of varying complexities. We briefly discuss the state-of-the-art which includes terms introduced in relation to each of the 4V properties in order to compare and categorize the existing big data benchmarks. Table 3 gives basic definitions of the terminology. Table 2: Terms Used to Categorize Big Data Benchmarks based on 4V Properties [15] We conduct a performance evaluation of MongoDB by executing analytical queries on datasets of two different sizes. Therefore, we need a benchmark that can generate scalable datasets, and real The TPC-DS benchmark [5] has a total of 7 fact tables and 17 world data in order to achieve a realistic result. MongoDB is an dimension tables. Among the 24 tables, the representative queries appropriate choice of database for unstructured data rather than we selected utilize 3 fact tables (Store_Sales, Store_Returns, and Inventory) and a total of 9 dimension tables. Among the 4 query dataset of specific size, the machine configurations for both the classes supported by TPC-DS we choose the data mining class. environments differ. For example, the 41.93GB sharded Among the 23 queries available in that class, we select 4 queries environment has more powerful machines than the 9.94GB sharded which meet three or more of these criteria: (1) join of 4 or more environment since it has more data. Only one stand-alone system is tables, (2) aggregation functions such as sum() and avg(), (3) group setup for both the 9.94GB and 41.93GB datasets. All the AWS by and order by clauses, (4) conditional constructs such as case, machines have the same software configuration: Red Hat and (5) correlated subquery using the from clause. Enterprise Linux 7.1 and MongoDB version 3.0.2. We select 4 queries that satisfy the criteria: Query 7, Query 21, 4.2 MongoDB Cluster Architecture Query 46, and Query 50. Table 3 summarizes the criteria met by A MongoDB sharded system consists of 3 components: shards, each query. Table 4 lists the number of records in the tables for config servers, and query routers. We determine the number of datasets of sizes 1GB and 5GB. instances of each component by taking the 9.94GB dataset as an example. The MongoDB sharded cluster test architecture [10] is 4. EXPERIMENTAL PLATFORM used as a reference for creating our sharded system. The test For setting up the stand-alone and sharded environments, we used architecture uses 1 config server, 1 query router, and 1 or more the Amazon Web Services (AWS), a cloud-computing platform shards based on the application needs. Our MongoDB sharded that provides on-demand delivery of IT resources [20]. We rented system has 1 config server and 1 query router similar to the test virtual computers through the Amazon Elastic Compute Cloud architecture. However, the number of shards is decided taking into (EC2) web service for application deployment and experimental consideration the data to be stored on it, which in this case is 9.94 set-up. We boot the Red Hat Enterprise Linux AMI (Amazon GB. Machine Image) to create our virtual machines [20]. AWS provides the capability of starting, stopping, and terminating instances as We use the disk storage and RAM as factors for deciding the needed, whereby active servers are charged by the hour. number of shards in the cluster. Among the two factors, RAM is given priority as it reduces random disk IO thereby improving read TPC-DS is chosen as our benchmark for generating data and performance. Therefore, a system that is capable of accommodating analytical queries. We use datasets of sizes 1GB and 5GB for data, indexes, and other running applications in the RAM is chosen. conducting our research. However, the 1GB and 5GB text data Among the available AWS machines, those with a RAM storage of when migrated to MongoDB increases to 9.94GB and 41.93GB either 4GB or 8GB best suit our application needs. A machine with respectively, an increase by a factor of nearly nine compared to the less RAM would require deploying more systems in the cluster, original dataset size. Therefore, we need machine(s) that can increasing operational costs and a machine with RAM higher than accommodate datasets with a minimum size of 10GB. Hence, an 8GB would make sharding insignificant for the 9.94GB dataset. EC2 instance is chosen such that the RAM is greater than the working set, the portion of data that is accessed often by the The RAM consumption of the operating system and other application server [10]. A RAM that fits all the indexes and applications typically does not exceed 2GB. If an AWS machine working set ensures faster processing there by reducing random with 4GB RAM is chosen, only 2GB space would be available for disk IO. storing data and indexes, hence requiring 5 machines (9.94GB/2GB). On the other hand, if an AWS machine with 8GB The MongoDB stand-alone environment uses the m4.4xlarge1 RAM is chosen, 6GB space would be available for storing data and instance that is capable of storing both the 9.94GB and 41.93GB indexes, thus requiring 2 machines (9.94GB/6GB). However, we datasets. The MongoDB sharded environment is a 5 node cluster use 3 machines with 8GB RAM as shards to accommodate not only where every machine/instance has the same configuration. For the data but also indexes and the intermediate and final query application deployment on the MongoDB cluster we use the collections. Therefore, the MongoDB sharded system deployed on t2.large1 instance for the 9.94GB dataset and the m4.xlarge1 AWS has 3 shards, 1 config server, and 1 query router. instance for the 41.93GB dataset. Figure 4 illustrates the organization of the sharded system. Each 4.1 Hardware/Software Configuration node in the sharded system is named after their functionality This section discusses the hardware configurations of all the AWS namely, Shard1, Shard2, Shard3, ConfigServer, and AppServer/ machines utilized for the deployment of the stand-alone and QueryRouter. Since each node has a specific functionality, the sharded environments. Table 5 illustrates the machine processes running on them differ. The 3 shards are responsible for configurations for the MongoDB sharded and stand-alone storing data and therefore run the mongod instance. The config environments. server stores the metadata of the cluster and runs a mongod Table 5: Machine Hardware Configurations instance. The query router is a mongos instance, a routing service to which the application queries are directed to. It internally makes use of the config server metadata to direct queries to the appropriate shards. The application server and query router are deployed on the same AWS machine because the query router is a lightweight process and does not utilize much of the server resources. The two segmented lines represent the actual flow path of query or any related read operation and the continuous line indicates the observed flow path. Two different MongoDB sharded environments are created for the 9.94GB and 41.93GB datasets. Each of the sharded environments have 5 machines. Since each sharded environment supports a 1 AWS machine nomenclature and 17 dimension tables. The records in each table correspond to the documents in the respective collections. 4. Since data in MongoDB in stored in the JSON format, the column names of the table correspond to the keys in the JSON document and the column values of the table correspond to the respective key values in the document. The pseudocode of the data migration algorithm is illustrated in Figure 5. It takes a data file as input and generates the MongoDB collection as output. The algorithm makes use of the HashMap data structure in Java [21], a tabular structure that maps keys to values. It uses hashing to generate a hashcode for every key, which determines the object location in the table. Figure 4: MongoDB Cluster Architecture 5. PERFORMANCE STUDY In this section, we describe the algorithms developed for data loading and translation of a SQL query to a Mongo query using Java as the programming language2 and we describe the experimental setup. Table 6 illustrates the different experimental setups based on the choice of dataset sizes, data models, and deployment environments. The experiments are conducted by Figure 5: Data Migration Algorithm taking into consideration the two dataset sizes, two data models, and two deployment environments. 5.2 Translating SQL Queries to MongoDB The SQL queries in our study differ from one another in aspects Table 6: Experiments such as numbers of joined tables, aggregate functions, where clauses, from clauses, group by and order by clauses, and correlated and uncorrelated subqueries. Manually translating each of these queries into a Mongo query can be tedious and error-prone. We create algorithms for translating an analytical SQL query to its equivalent Mongo query. These algorithms focus on executing the selected queries that support aggregate functions, where clauses, from clauses, group by, and order by clauses. Performance of queries or any read operation is greatly dependent 5.1 Migrating Data into MongoDB on the data model that supports the application needs. In a relational We develop an algorithm to migrate the TPC-DS generated data database system, data models are broadly classified into normalized files into MongoDB in the JSON format. The same algorithm is and denormalized data models. A normalized data model has a used to load datasets of sizes 1GB and 5GB, which translate to higher number of foreign keys and hence a higher number of joins. 9.94GB and 41.93GB, respectively, when loaded into MongoDB. On the other hand, in a denormalized data model, data is The increase in size is due to the data storage in JSON format. Each consolidated into one or a few tables, resulting in fewer foreign document has key-value pairs where the key corresponds to the keys and hence fewer joins. We are interested in the query table column. Each key is repeated in every document in the performance on MongoDB when run against both the normalized collection, hence drastically increasing the size of the dataset. The and denormalized data models. following points summarize the flow of data starting with its In order to execute analytical queries against a denormalized creation using the TPC-DS benchmark up to its migration into model, the collections in the MongoDB database should be MongoDB: denormalized. During data load the TPC-DS data files are migrated 1. TPC-DS generates a .dat file for each table. The column values into individual collections, therefore on initial load the data is are delimited by the ‘|’ operator. completely normalized. We develop an algorithm for creating a 2. The databases in MongoDB for 1GB and 5GB TPC-DS datasets denormalized collection from a set of given fact collection and are called Dataset_1GB and Dataset_5GB, respectively. Steps 3 dimension collections. In this approach, all the dimension and 4 correspond to the Dataset_1GB database, however they collections are joined to the fact collection based on their foreign work analogously for the Dataset_5GB database. key relationships. In MongoDB, joining a dimension collection to 3. Each table in the TPC-DS schema corresponds to a separate a fact collection is equivalent to embedding the dimension collection in the Dataset_1GB database. Hence, Dataset_1GB collection documents in the fact collection. To understand the contains a total of 24 collections, representing the 7 fact tables structure of a denormalized collection in MongoDB, consider the 2 code is available at https://github.com/raghavai/Performance-Evaluation- of-Analytical-Queries-on-a-Stand-alone-and-Sharded-Document-Store store_sales fact collection connected to multiple dimension 1. Query all the dimension collections based on their respective collections such as time_dim, store, and item. The foreign keys such where clauses. as ss_sold_date_sk, ss_item_sk, and ss_cdemo_sk are replaced by 2. Perform a semi-join of the fact collection with the filtered the actual document in the document collection. dimension collection documents, i.e., obtain only those fact collection documents whose foreign keys are present in the Figure 6 illustrates pseudocode for the algorithm to create a filtered dimension collection documents. For example, a semi- denormalized fact collection. It takes in a fact collection and a set join of the store_sales and customer_address collections would of dimension collections as input and outputs the denormalized fact result in a collection containing only those store_sales collection. An EmbedDocuments method, shown in Figure 7, is documents whose foreign key is referenced in the called on every dimension collection, which embeds the dimension customer_address collection. Store the semi-joined fact collection documents in the fact collection. collection in an intermediate collection. 3. Embed dimension collection documents in the intermediate collection documents. To improve performance embed only those dimension collection documents whose attributes are used in query aggregation. 4. Perform aggregation operations against the embedded intermediate collection and store the final query results in an output collection. We illustrate the pseudocode for the query translation algorithm against a normalized data model in Figure 8. It takes the fact collection(s) and dimension collections related to the query as input Figure 6: Denormalized Collection Creation Algorithm and outputs the final query collection. In addition to the final query collection, an intermediate collection is created which holds the semi-joined fact collection documents. The algorithm uses the ArrayList data structure in java [23], a dynamic array that can grow and shrink as needed. It is used in the process of performing a semi- join on the fact collection. The EmbedDocuments method illustrated in Figure 7 is called on the intermediate collection and a dimension collection during the embedding process. To improve performance, only those dimension collections whose attributes are used in query aggregation are embedded. After the embedding process, the MongoDB aggregation framework is used to execute the aggregation operations in the query. Figure 7: Embedding Documents Algorithm 5.3 Query Translation Algorithm for Normalized Data Model An analytical query against a normalized data model typically queries data from one or more tables. In SQL, this is accomplished through join operations. The SQL query optimizer plays a vital role in identifying an optimal query plan with low cost. For example, if a query contains join operations, where clauses, and a group by clause, the query optimizer decides the best possible execution Figure 8: Query Translation Algorithm for the Normalized order of the operations that yields a query plan that has a low cost Data Model and execution time. 6. EXPERIMENT RESULTS In MongoDB, query optimization is limited to the usage of indexes Table 7 summarizes the data load times for each table for both the and efficient read operations. Join operations cannot be optimized dataset sizes. We also show query runtimes executed on each of the as MongoDB does not support joins. To overcome this hurdle, we experimental setups. Table 8 illustrates the selectivity of the develop an algorithm that simulates join operations and executes queries, i.e., the proportion of data retrieved for both the 9.94GB queries on the fly. All the queries used here implement the select- and 41.93GB datasets. Table 9 summarizes the query execution from-where template. Therefore, the algorithm is optimized for runtimes for each of the experiments. Every query is run 5 times on queries that follow this template. The algorithm does not take into each experimental setup. For each run data is cached in the consideration the details of the query predicates, aggregation memory. Among the 5 runtimes we obtain, Table 9 presents the operations, and sequence of join operations, but only follows a best results. Hours are denoted by h, minutes by m, and seconds by predetermined order of execution: s. We analyze the results of the data load times for the 1GB and 5GB experiments correspond to the denormalized data models datasets below. running on stand-alone systems. This indicates that 1. For tables having the same number of records in both the denormalized data models outperform their normalized counter datasets, the data load times are nearly identical. This can be parts. Also, the algorithm implemented to execute queries observed for tables catalog_page, customer_demographics, against a denormalized data model handles scaling effectively, date_dim, household_demographics, income_band, ship_mode, for the scales used in our experiments. and time_dim. 2. Among the experiments conducted against a normalized data 2. For tables with an unequal number of records in both the model, stand-alone systems are observed to be faster than datasets, the ratio of the number of records is nearly identical to sharded systems. The runtimes for the Queries 7, 21, and 46 for the ratio of their load times. This can be observed for tables Experiments 1, 2 and 4, 5 support this statement. call_center, catalog_returns, catalog_sales, customer, 3. Query 50 is observed to be faster on the sharded system. This customer_address, inventory, item, promotion, reason, store, does not indicate that the sharded system is slower or faster than store_returns, store_sales, warehouse, web_page, web_returns, a stand-alone system. It depends on the type of the query sent to web_sales, and web_site. either of the systems. If a query includes a shard key, the mongos Table 7: Data Load Times routes the query to a specific shard rather than broadcasting the query to all the shards in the cluster, enhancing the query performance, which is the case for Query 50. Thus, we can infer that Queries 7, 21, and 46 which are faster on a stand-alone system are being broadcasted on the sharded system, resulting in slower runtimes. Figure 9: A Comparison of Query Execution Times for 9.94GB Dataset Table 8: Query Selectivity Table 9: Query Execution Runtimes Figure 10: A Comparison of Query Execution Times for 41.93GB Dataset There have been other studies that benchmarked the performance of MongoDB [8, 25, 26, 27, 28]. However, the majority of them Figures 9 and 10 illustrate the query performance on the 9.94GB performed the MongoDB benchmark against popular databases and 41.93GB dataset taking into consideration the data models and such as Oracle, Cassandra, HBase, and Couchbase but none of them deployment environments. study the performance of MongoDB based on the deployment We analyze the query execution runtimes below. environments, data modeling, and scalable datasets. We specifically focus on modeling relational data in a NoSQL 1. Experiments 3 and 6 have the fastest query runtimes in their environment and tune it in different ways and study why one data respective dataset sizes. It can be observed that the two model outperforms the other. Also, we base our conclusions on deployment environments, data models, dataset sizes, and query running complex analytical queries on each of the data models. types. Our results indicate that the denormalized data model speeds Based on the results obtained from executing analytical queries on up queries by a significant amount when deployed on a stand-alone different experimental setups, we conclude that performance of environment. The trend in execution times remains the same with queries on MongoDB is influenced by the data model and the increase in scalability we investigated. deployment environment. Given the two choices of normalized and With the help of the algorithms proposed here, a tool can be created denormalized data models, a decision has to be made based on the that migrates SQL-like data into MongoDB. Since MongoDB lacks amount of data stored in each document. If the size of a document the support for join operations, the query translation algorithm does not exceed 16MB, a denormalized data model should be developed here can be used as a basis for a tool to streamline join chosen over a normalized data model. Our experiments indicate and aggregation operations. that a denormalized data model results in faster query times than its normalized counterpart irrespective for the two dataset sizes we There are many different ways that this study could be both investigate. Queries against a normalized data model are slower broadened and deepened. Specifically, the preliminary experiments since join operations followed by the embedding procedure are described here could be extended by implementing a denormalized expensive. It is also expensive in terms of storage consumption due data model on a sharded system and comparing its performance to to the creation of intermediate collections. a denormalized data model implemented on a stand-alone system. Other ways of organizing the data could be considered. Herrero et Given the two choices of a stand-alone and sharded deployment al. consider conceptual, logical, and physical design of NoSQL environment, a decision has to be made based on application needs, databases and propose techniques that allow for varying amounts available resources, and types of queries to be executed. A sharded of denormalization depending on factors such as query workload system is an appropriate choice when an application deployed on [29]. Truică et al. study the performance of three document stand-alone system becomes resource intensive resulting in an databases (including MongoDB) and three relational databases for increase in data volumes and read/write throughput. Another create, read, update, and delete operations [30]. The replication of scenario considers the high cost incurred from running a stand- data considered in their approach would be interesting to consider alone system as compared to a sharded system. When operating as part of a future study using benchmark queries. with a sharded system, we have to be mindful of the following aspects. Firstly, the shard key is immutable, i.e., it cannot be altered The research work can be furthered in different aspects such as after the collection is sharded. Secondly, queries have to be varying the dataset sizes, increasing the scalability, and using structured to use a shard key to prevent broadcasting across all the different benchmarks that are suited for schema-less data. shards in the cluster. This allows the mongos to route the query to MongoDB can be implemented on both a sharded and stand-alone a specific shard providing better performance. On the other hand, system. For the case of a stand-alone system, MongoDB is thread if a stand-alone system is utilized within its resource limits, it can safe; multi-threading is handled by the database on the client side. be equally effective at processing queries as compared to a sharded MongoDB allows operations to lock at either the collection level or system with equivalent resources. Since it does not encounter the database level. In a collection level lock, only a single thread is constraint of a shard key, indexing can be applied to any field and allowed to perform write operations to the collection. Aggregation a wide variety of queries can be directed to the system. Based on queries usually involve multiple collections that are queried the experimental results we can conclude that a stand-alone system individually followed by the aggregation operations. Since is a suitable choice when queries with varying predicates are MongoDB uses collection level locks, individual threads can be directed to the system and a sharded system is a suitable choice used to query each collection in parallel and then perform when specific queries containing the shard key are directed to the aggregation on a single thread that runs after the completion of the cluster. other threads. In our research, the entire query was performed on a 7. CONCLUSIONS AND FUTURE WORK single thread; using multithreading may reduce the query execution In this section, we summarize the contribution of our research and times. The same concept can be used in the sharded environment elaborate possible extensions for future work. where individual collections reside on different shards and multiple threads can be issued to query each collection in parallel. We address the performance impact organizations may face if they Parallelizing can also be performed for aggregation over collection choose to run complex analytical queries on a stand-alone and subsets using multiple threads on a sharded database. sharded document store. We highlight the importance of data modeling coupled with deployment environments. Different The performance of MongoDB can be tested further by employing experimental setups are implemented to evaluate the combination a more varied dataset in terms of number of fields and datatypes, of a data model and a deployment environment and its impact on and also by using a different benchmark intended for a document query performance and scaling. store database. Different denormalized data models could be deployed on the sharded cluster and its performance can be studied For conducting performance evaluation of analytical queries, we by varying the number of shards and the number of mongos use TPC-DS as our chosen benchmark to generate data and instances, and implementing multi-threading. Additional metrics analytical queries. We develop a data loading algorithm to migrate could be collected such as the amount of data read from disk and data generated by the TPC-DS benchmark into MongoDB for transmitted. All of the possible extensions could be executed in a performing query analysis. TCP-DS generated SQL queries employ more recent MongoDB engine such as WiredTiger [31], which join operations and SQL operations. Since MongoDB does not introduces enhanced features including compression, column- support joins, we develop an algorithm to simulate join operations oriented management techniques, and additional indexing styles. and perform aggregation operations. 8. REFERENCES We assess the performance of stand-alone MongoDB system with [1] (January 11, 2012). What is big data?. Available: a MongoDB sharded system and conclude that it is dependent on https://beta.oreilly.com/ideas/what-is-big-data. [2] A. Moniruzzaman and S. A. Hossain, "NoSQL database: New IEEE 26th International Conference On Data Engineering era of databases for big data analytics-classification, Workshops (ICDEW), 2010, pp. 41-51. characteristics and comparison," arXiv Preprint [19] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, arXiv:1307.0191, 2013. Z. Jia, Y. Shi and S. Zhang, "Bigdatabench: A big data [3] (2013). MongoDB: A Document Oriented Database. benchmark suite from internet services," in Proceedings of Available: http://www.mongodb.org/about/. 2014 IEEE 20th International Symposium On High [4] A. Lakshman and P. Malik, "Cassandra: a decentralized Performance Computer Architecture (HPCA) , 2014, pp. structured storage system," ACM SIGOPS Operating Systems 488-499. Review, vol. 44, pp. 35-40, 2010. [20] (2015). AWS Documentation. Available: [5] (April, 2012). TPC BENCHMARK ™ DS. Available: http://aws.amazon.com/documentation/. http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf. [21] HashMap (Java Platform SE 7). Available: [6] C. Baru, M. Bhandarkar, R. Nambiar, M. Poess and T. Rabl, http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.h "Setting the direction for big data benchmark standards," in tml. Selected Topics in Performance Evaluation and [22] MongoDB Java Driver Documentation. Available: Benchmarking, Springer, 2013, pp. 197-208. http://mongodb.github.io/mongo-java-driver/3.0/. [7] A. Nayak, A. Poriya and D. Poojary, "Type of NoSQL [23] ArrayList (Java Platform SE 7). Available: databases and itscomparison with relational databases," http://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.h International Journal of Applied Information Systems, vol. 5, tml. 2013. [24] B. G. Tudorica and C. Bucur, "A comparison between [8] Z. Parker, S. Poe and S. V. Vrbsky, "Comparing NoSQL several NoSQL databases with comments and notes," in Mongodb to an SQL db," in Proceedings of the 51st ACM Roedunet International Conference (RoEduNet), 2011 10th, Southeast Conference, 2013, pp. 5. 2011, pp. 1-5. [9] (April 9, 2015). Sharding. Available: [25] M. Fotache and D. Cogean, "NoSQL and SQL databases for http://docs.mongodb.org/manual/sharding/. mobile applications. Case Study: MongoDB versus [10] (June 04, 2015). MongoDB Documentation Release 3.0.3. PostgreSQL," Informatica Economica, vol. 17, pp. 41-58, Available: http://docs.mongodb.org/master/MongoDB- 2013. manual.pdf. [26] V. Abramova and J. Bernardino, "NoSQL databases: [11] BSON. Available: http://bsonspec.org/. MongoDB vs. Cassandra," in Proceedings of the International C* Conference on Computer Science and [12] N. Nurseitov, M. Paulson, R. Reynolds and C. Izurieta, Software Engineering, 2013, pp. 14-22. "Comparison of JSON and XML data interchange formats: A case study." Caine, vol. 2009, pp. 157-162, 2009. [27] TPC Documentation as of 04/27/2015. Available: http://www.tpc.org/information/current_specifications.asp. [13] D. Pritchett, "Base: An acid alternative," Queue, vol. 6, pp. 48-55, 2008. [28] (Aug 26, 2013). Using TPC-DS to generate RDBMS performance and benchmark data. Available: [14] (2014, September 25). Sharding Methods for MongoDB. http://www.innovation- Available: http://www.slideshare.net/mongodb/sharding-v- brigade.com/index.php?module=Content&type=user&func= final. display&tid=1&pid=3. [15] R. Han, X. Lu and J. Xu, "On big data benchmarking," in Big [29] V. Herrero, A. Abello and O. Romero, "NOSQL design for Data Benchmarks, Performance Optimization, and Emerging analytical workloads: variability matters," in Proceedings of Hardware, Springer, 2014, pp. 3-18. the 35th International Conference on Conceptual Modeling [16] M. Poess, R. O. Nambiar and D. Walrath, "Why you should (ER), 2016, pp. 50-64. run TPC-DS: A workload analysis," in Proceedings of the [30] C.-O. Truică, I.I. Bucur and A. Boicea, "Performance 33rd International Conference on very Large Data Bases, evaluation for CRUD operations in asynchronously 2007, pp. 1138-1149. replicated document oriented database," in Proceedings of [17] A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte the International Conference on Control Systems and and H. Jacobsen, "BigBench: Towards an industry standard Computer Science, 2015, pp. 191-196. benchmark for big data analytics," in Proceedings of the [31] (January 15, 2017), WiredTiger, Available: 2013 ACM SIGMOD International Conference on http://www.wiredtiger.com/. Management of Data, 2013, pp. 1197-1208. [18] S. Huang, J. Huang, J. Dai, T. Xie and B. Huang, "The HiBench benchmark suite: Characterization of the MapReduce-based data analysis," in Proceedings of the 2010