=Paper= {{Paper |id=Vol-2212/paper55 |storemode=property |title=The creation of scalable tools for solving big data analysis problems based on the MongoDB database |pdfUrl=https://ceur-ws.org/Vol-2212/paper55.pdf |volume=Vol-2212 |authors=Olga Vasilchuk,Aleksandr Nechitaylo,Dmitrii Savenkov,Kseniia Vasilchuk }} ==The creation of scalable tools for solving big data analysis problems based on the MongoDB database== https://ceur-ws.org/Vol-2212/paper55.pdf
The Creation of Scalable Tools for Solving Big Data
Analysis Problems Based on the MongoDB Database


                    O I Vasilchuk1, A A Nechitaylo2, D L Savenkov3 and K S Vasilchuk4

                    1
                      Volga Region State University of Service, Gagarin st. 4, Togliatti, Russia, 445677
                    2
                      Samara National Research University, Moskovskoye shosse 34, Samara, Russia, 443086
                    3
                      Samara State University of Economics, Sovetskoi Armii st. 141, Samara, Russia, 443090
                    4
                      National Research University of Electronic Technology (MIET), Shokin Square 1, Zelenograd,
                    Moscow, Russia, 124498



                    Abstract. This article presents analyze of using MongoDB database to storing and effective
                    data mining from open network sources. This paper attempts to use NoSQL instead of
                    traditional SQL database in systems with strongly related information, comparing to relational
                    and non relational approach in the performance and architecture.




1. Introduction
Modern data storage technologies provided a practical opportunity to accumulate huge amounts
of information, which allowed a qualitative change in the attitude to the results of analysis
of stored information. It became possible to move from a descriptive process of analyzing
the results obtained over a certain period of time to predictive data processing technologies
that make it possible to offer valid recommendations for the future. The using of relational
databases (MySQL, PostgreSQL, Oracle Database and others) to solve large data storage
problems becomes problematic. The main advantage of relational databases is the availability
of techniques for maintaining data integrity, achieved by storing links between data elements.
However, the storage and validation of these links require additional time resources, which, with
significant amounts and poor data structure, makes the use of relational databases difficult in
some real-time systems.
   As an alternative, there is a NoSQL database. One of the advantages of these databases in
alternative storage formats and the links between them.

2. Alternative Data Storage for Unified Text Formats
2.1. The Problem Formulation
With the development of metaprogramming, the concept of reflection developed – the ability
of the program to use and modify its structure[1]. Reflection, in the context of object-oriented
programming, spread the technique where objects created by programs, based on knowledge of
the structure of a class, are serialized into representations of given formats[2]. Most often this
technique is used in the context of web programming when data is serialized to xml, json and
other formats for text data transmission.


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




   This led to the task of filtering such data across different data fields of text formats, including
using standard filters (for example, XPath[3]).
   A common problem is the problem of data storage, the final form of which is some unified
format (hereinafter referred to as UF)[4]. Since a web application user (usually a client
application) works only with UF, and internal data views are not available for it, the filters
available to the client are reduced to UF fields.
   In classical relational databases, storing such formats is associated with the creation of several
linked tables and subsequent cross queries [5]. To speed up such queries, indexes are created
for the corresponding keys.
   Consider an example of a web service that provides access to the UF type JSON for some
class ”book”:
   Consider definition of scheme for UF:


 1      {
 2          "definitions": {},
 3          "$schema": "http://json-schema.org/example/schema#",
 4          "$id": "http://itnt18.ru/itnt18.json",
 5          "type": "object",
 6          "properties": {
 7            "title": {
 8              "$id": "/properties/title",
 9              "type": "string",
10              "title": "The Title Schema."},
11            "description": {
12              "$id": "/properties/description",
13              "type": "string",
14              "title": "The Description Schema."},
15            "comments": {
16              "$id": "/properties/comments",
17              "type": "array",
18              "items": {
19                "$id": "/properties/comments/items",
20                "type": "object",
21                "properties": {
22                  "user": {
23                    "$id": "/properties/comments/items/properties/user",
24                    "type": "string",
25                    "title": "The User Schema." },
26                  "comment": {
27                    "$id": "/properties/comments/items/properties/comment",
28                    "type": "string",
29                    "title": "The Comment Schema." }}}}}
30      }


                                      Listing 1: UF definition in JSON format.

  We assume that most often the book is filtered by title. And we assume that sometimes
we just want to find a book, for example, for a search function on a web site, and sometimes


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)            416
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




download a page with a book where besides the book and its description we also want to display
all the comments.
    It is common practice in the relational database to compile the following schema 1.




                                      Figure 1. SQL data relations scheme.

   MongoDB uses JSON-like format to keep data and can keep data in the output view.

2.2. Object-Relational mapping advantages
ORM or Object-Relational mapping is a programming technique designed to map database
relations with object-oriented programming languages entities. It creates virtual objects
database inside specific language representation.
    The main the goal of technique is get rid of the need to write SQL queries to access database
data. The dual way of data representation, relation and object-oriented, usually requires from
programmers to write code for getting data from database in relation way, after transforming
it to object-oriented, and transform back to relation data to safe changes. Relational databases
operate over sets of tables with simple data representations, it leads to use SQL ”JOIN”
operation to get full object information. Since relational database management systems usually
do not implement a relational representation of the physical link layer, the execution of several
consecutive queries (referring to one ”object-oriented” data structure) can be too expensive.
    Relational database management systems work with good performance with global queries,
affecting a large area of memory, but object-oriented access is more effective in the work
with small amount of data, as it reduces semantic gap between the object and relational data
representation[6].
    Two way of data representation increases the complexity of object-oriented code to work with
relational databases, it becomes more prone to errors.

2.3. Object-Relational mapping disadvantages
The most common problem with ORM as an abstraction over SQL is that it can’t fully abstract
realization details. Some of the program realizations of ORM works as SQL code generations
tools, some of them do not use SQL equivalents at the external level.
   The reason why abstraction make sense is simplification of code writing, but if you use ORM
framework with knowing SQL as a requirement, it doubles programmers effort, for example the
popular ORM framework Hibernate use HQL language SQL for complex requests, which is very


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)       417
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




                                           Figure 2. ORM data mapping.


semantically close to SQL. It brakes uniformity of the code abstraction when programmer needs
a specific union of data processing.
   Inefficiency is another common problem of ORM. If programmer need extract object data
from relation database, ORM cannot know which of the object property are going to be used
or changed, so it forced to extract all, it cause many requests instead few. The lack of context
sensitivity means that ORM can’t consolidate requests, which leads to the impossibility of data
caching or other compensation mechanisms.

2.4. Object-Document mapping
ODM or Object-Document mapping is an alternative for ORM in document-oriented databases.
The basic idea of ODM frameworks is the same, match data to the object, but we have few
differences here.
    Firstly, in ORM we should complete data for the object, and complete data for backward
mapping inside database. In ODM there is no requirement for fully completeness of data. The
document can be mapped partly to a database, without multiple table changing.
    Secondly, ODM can make data mapping independently from the data source. It makes easier
to operate over data in a program.




IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)       418
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




2.5. NoSQL databases
Relation databases usually based on ACID – Atomicity, Consistency, Isolation, Durability[7].
ACID is common requirement for transaction systems.
   NoSQL databases usually based on BASE:
  • basic availability – every request will be completed (successfully or not)
  • soft state – The system state can be changed without any data changes due to data
    consistency
  • eventual consistency – The data can be inconsistent for some time but will be consistent
    after a while.
   It’s obviously that NoSQL databases can not be used in any application. Some application
are requires for transaction systems (like banking, ecommerce, etc), but at the same time usual
ACID system doesn’t suit for systems based on large data storages, like amazon.com and other.
So NoSQL databases sacrifice data consistent to make more scalable system, to operate over
large amount of data.
   Also, NoSQL databases represent following features:
  • Application of various types of storage facilities.
  • Ability to develop a database without specifying schemas.
  • Linear scalability (adding CPUs increases performance).
  • Innovation: A lot of opportunities for data storage and processing

2.6. NoSQL databases common types
Unlike relation databases, NoSQL databases have various data schemas, implemented through
the use different data structures.
    Depending on the data schema and the approaches to distribution and replication, four types
of storage can be distinguished: key-value store, document store, column database stores, graph
databases.
    Key-value storage
    The key-value store is the simplest data store that uses the key to access the value. Such
repositories are used to store media images, create specific file systems, as caches for objects,
as systems well scalable by design. Examples of such storage facilities are Berkeley DB,
MemcacheDB, Redis, Riak, Amazon DynamoDB [8].
    Bigtable-like databases (column database stores)
    In this store, data is stored as a sparse matrix, the rows and columns of which are used as
keys. A typical application of this type of database is web indexing, as well as tasks related to
large data, with reduced requirements for data consistency. Examples of databases of this type
are: Apache HBase, Apache Cassandra, Apache Accumulo, Hypertable, SimpleDB.
    Column family stores and document-based repositories have similar usage scenarios: content
management systems, blogs, event logging. The use of timestamp allows using this type of
storage for the organization of counters, as well as the registration and processing of various
data related to time.
    The family column stores should not be confused with column stores. The latter are relational
databases with separate storage of columns (in contrast to the more traditional line-by-line data
storage)[9].
    Document-based database management system
    Document-oriented databases serve to store hierarchical data structures. They are used in
content management systems, publishing, document search, and so on. Examples of this type
of database are CouchDB, Couchbase, MarkLogic, MongoDB, eXist, Berkeley DB XML.



IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)       419
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




   Databases based on graphs Graph databases are used for tasks in which data has a large
number of links, for example, social networks, fraud detection. Examples: Neo4j, OrientDB,
AllegroGraph, Blazegraph, InfiniteGraph, FlockDB, Titan [10].
   Since the edges of a graph are materialized, that is, they are stored, traversing the graph
does not require additional computation (like JOIN in SQL), but to find the initial vertex of
the traversal requires the presence of indices. Graphical databases generally support ACID, and
also have different query languages, like Gremlin and Cypher (Neo4j).

2.7. Motivation for MongoDB
As an alternative to the classical approach, there are NoSQL databases. For tasks in which UF
is involved, the most common use of document-based one.
    It has following advantages:
  • It works with unstructured data, which make possible to add new data field with no
    additional cost.
  • It available to compromise finding in performance-reliability.
  • Working with ODM frameworks (Object-Document mapping) – alternative for ORM
    [11]. In cases of optional data it makes possible to map object without cross queries, make
    the mapping operation faster.
  • The JavaScript support on server side.
   As a working example, the authors used MongoDB. This choice is due to prevalence and
testing in high load projects[12].
   Based on benchmarking top NoSQL databases performance tests conducted by ”End Point
Corporation”, the authors systematized MongoDB performance indicators f or various hardware
configurations, which were summarized in the table 1[13].


             Table 1. Query time analysis results (operations per second).
     Node numbers Reading Reading/Writing Reading/Writing/Changing
     1             2 149      1 278                  1 261
     2             2 588      1 441                  1 480
     4             2 752      1 801                  1 754
     8             2 165      2 195                  2 028
     16            7 782      1 230                  1 114
     32            6 983      2 335                  2 263

   The performance comparison experiment was conducted in the cloud services of ”Amazon
Web Services EC2”, which provides an industrial platform for systems that require a horizontal
extension of the architecture, such as distributed non-relational databases. In order to minimize
the errors in measurements related to the current load of ”Amazon Web Services EC2” services,
each set of test scenarios was played three times, at least 24 hours apart, using newly created
clusters with hardware configurations described in the table 2 of hardware configurations of
cloud services ”Amazon Web Services EC2”:


                Table 2. Hardware configuration of ”Amazon Web Services EC2”.
      Node Class Configuration                                   Application
      i2.xlarge     30.5 GB RAM, 4 CPU, one SSD with 800GB database nodes
      c3.xlarge     7.5GB RAM, 4 CPU                             database client nodes



IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)        420
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




   As an operating system in the nodes used Ubuntu 14.04 LTS AMI in the HVM mode (virtual
hardware virtual machine) virtualization, customized with Oracle Java 7. For each test, an
empty database was used as the starting point. The client applications were programmed
to enter randomly generated information into the database. After the database was finally
populated, each of the test scenarios was executed sequentially. All clients performed requests
in parallel, and then waited for all operations to be performed and the corresponding results
obtained. The client software was supplemented by the installation of the YCSB free software
package designed to analyze the performance of NoSQL databases[14]. The study of the company
”End Point Corporation” allows you to determine the number of necessary nodes for placing
MongoDB in the solution of certain business tasks, based on the anticipated load, when using
a SSD drive in each machine-node. Nevertheless, despite its obvious competitive advantages,
such as small size and weight, as well as the number of random IOPS that exceed by an order
of magnitude the more common HDDs, the SSD is inferior to the latter by cost[15].

2.8. Preconditions
The SQL database with 5 millions entries for book instance and 10 millions for comments (two
for each book entry) was used. And 5 millions entries for full (with comments inside view, also
two comments on each book) book instance was used in MongoDB collection.
   Three main scenarios were examined:
  • Find by title
  • Find by comment
  • Find comments for the book
   Since MongoDB stores the data immediately in the final UF view, it will be absolutely
identical for searching the book and requesting an extended output.
   Averaged values over 10 experiments are used to except the non-deterministic influence of
external factors.

2.9. Results


                        Table 3. MongoDB and MySQL comparison (in seconds).
                      DB                         MongoDB 3.4 MySql 5.4
                      Find by title              0.434527       1.646251
                      Find by comment            5.826707       1.387221
                      Find comments for the book 0.434527       4.360173


    As can be seen from the results(Figure 3 and Table 3) of the research, the search query
in the collection of the MongoBD is more effective than the search in the MySQL. The most
effective scenario is to search for comments on the book, that is, additional information related
to the main entity. The speed of work of MongoDB in this case surpasses MySQL almost
5 times. Nevertheless, it is important to understand that if the search target is additional
information(comments here), the velocity of the query for MongoDB will be greatly worse then
traditional relational databases.




IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)        421
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




                           Figure 3. MongoDB and MySQL results comparison.


3. Conclusion
The conducted research allowed to draw a conclusion about the expediency of using document-
oriented databases for storing large amounts of data for indexing purposes with a small number
of supported links or their absence. Feasibility is confirmed by the fact that the use of document-
oriented databases for storing large amounts of data for indexing with a small number of
supported links or their absence allows you to select the optimal configurations of computing
systems in the framework of current and future business tasks, with the possibility of horizontal
and vertical scaling, in conditions better performance than relational equivalents.


4. Reference
[1] Kiczales G J d R, Bobrow D G 1991 The Art of the Metaobject Protocol (MIT Press)
[2] JavaScript Reflect G lobal Object (Access mode: https://developer.mozilla.org/en-US/docs/
Web/JavaScript/Reference/GlobalObjects/Reflect) ( 18.10.2017)
[3] XPath 3.1 Specification (Access mode: https://www.w3.org/TR/xpath-31/) ( 18.10.2017)
[4] Bell C 2012 Expert MySQL (APress)
[5] Protsenko V I, Kazanskiy N L and Serafimovich P G 2015 Computer Optics 39(4)
582-591 DOI: 10.18287/0134-2452-2015-39-4-582-591
[6] Loureno J R, Cabral B and Carreiro P 2015 Choosing the right nosql database for the
job: a quality attribute evaluation Journal Of Big Data 2 18
[7] Lake P and Crowther P 2013 Nosql databases Concise Guide to Databases Undergraduate Topics in
Computer Science DOI: 10.1007/97814471560175
[8] Kazanskiy N L, Protsenko V I and Serafimovich P G 2014 Computer Optics 38(4) 804-810
[9] Singh M and Kaur K 2015 Sql2neo: Moving health-care data from relational to graph
databases IEEE International Advance Computing Conference 7154801


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)          422
Data Science
O I Vasilchuk, A A Nechitaylo, D L Savenkov and K S Vasilchuk




[10] Kazanskiy N L, Protsenko V I and Serafimovich P G 2017 Procedia Engineering 201 817
DOI: 10.1016/j.proeng.2017.09.602
[11] Richardson L and Ruby S 2007 RESTful Web Services (Beijing: O’Reilly)
[12] MongoDB Official site (Access mode: https://www.mongodb.org) (18.10.2017)
[13] Benchmarking NoSQL Top Databases (Access mode: https://www.datastax.com/wp-content/
themes/datastax-2014-08/files/NoSQL Benchmarks EndPoint.pdf) (18.10.2017)
[14] Yahoo! Cloud System Benchmark (Access mode: https://github.com/joshwilliams/YCSB)
(18.10.2017)
[15] Amazon EC2 Instance Types (Access mode: https://aws.amazon.com/ru/ec2/instance-types/)
(18.10.2017)




IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)   423