=Paper= {{Paper |id=Vol-2786/Paper27 |storemode=property |title=Analytics and Storage of Big Data |pdfUrl=https://ceur-ws.org/Vol-2786/Paper27.pdf |volume=Vol-2786 |authors=Shubham Upadhyay,Rakesh Manwani,Saksham Varshney,Sarika Jain |dblpUrl=https://dblp.org/rec/conf/isic2/UpadhyayMV021 }} ==Analytics and Storage of Big Data== https://ceur-ws.org/Vol-2786/Paper27.pdf
                                                                                                                                                                                  202




ANALYTICS AND STORAGE OF BIG DATA
Shubham Upadhyay, Rakesh Manwani, Saksham Varshney and Sarika Jain
NIT Kurukshetra, Kurukshetra, Haryana, 136119, India


                                          Abstract
                                          Data generated by the devices and the users in modern times is high in volume and variable in structure. Collectively termed
                                          as Big Data, it is difficult to store and process using traditional processing tools. Traditional systems store data on physical
                                          servers or cloud resulting in higher cost and space complexity. In this paper, we provide a survey of various state-of-the-art
                                          research works done to handle the inefficient storage problem of Big Data. We have provided comparative literature to
                                          compare existing works to handle Big Data. As a solution to the problem encountered, we propose to split the Big Data into
                                          small chunks and provide each chunk to a different cluster for removing the redundant data and compressing it. Once every
                                          cluster has completed its task, the data chunks are combined back and stored on the cloud as compared to physical servers.
                                          This effectively reduces storage space and achieves parallel processing, thereby decreasing the processing time for very large
                                          data sets.

                                          Keywords
                                          Big Data, Cloud Computing, Data Analytics, Data Compression, Storage System,



1. Introduction                                                                                                    data is generally scattered and broadly classified into
                                                                                                                   three types:
Digitization is generating a huge amount of data, and
Information Technology Organizations have to deal with                                                             • Structured: Data that can be stored in tables, made
this huge data [1], which is very difficult to manage and                                                            of rows and columns comprising a database is termed
store. This data comes from various sources like social                                                              as structured data. This type of data can be easily
media, IoT devices, sensors, mobile networks, etc. Ac-                                                               processed. Eg. Relational data.
cording to some figures, 2/3rd of the total data has been                                                          • Semi-Structured: Similarly, data that cannot be stored
generated in the last 2 years [2]. According to Intel, smart                                                         in form of a database but can be easily analyzed is
cars generate 4000GB of data per day.                                                                                termed semi-structured data. They usually occupy less
Traditionally, companies prefer to deploy their servers                                                              space. Eg. XML data.
for data storage, but as the volume of data increases it
becomes challenging for the companies to manage the                                                                • Unstructured data: Unstructured data have an alter-
infrastructure required and the cost associated with it,                                                             native platform for storage and management that is
this also poses flexibility issues. The problem related to                                                           mainly used in an organization with business intelli-
the management of data can be handled by using infras-                                                               gence. Eg. Pdf, Media files.
tructures like Cloud, which provide close to unlimited
storage along with services such as data security because                                                             As each structure has different features, they are needed
of which data owners don’t have to put much effort into                                                            to be processed by different tools and hence, making it
it and can focus on their day-to-day tasks.                                                                        difficult to define a single mechanism to process big data
Along with being large, the data is also complex which                                                             efficiently. Along with complex structure, big data is also
possesses problems when it has to be processed with                                                                characterized by 5 V’s which defines the things a system
traditional processing tools. For this, we need some ded-                                                          developer has to keep in mind while dealing with Big
icated tools which will facilitate the processing of this                                                          Data. These V’s are:
data, which are part of Big Data Computing. It involves                                                            • Velocity: Velocity is the speed of data generation, anal-
master-slave architecture in which there is a single mas-                                                            ysis, and collection. With each day, the velocity of data
ter node that assigns a task to slave nodes, which works                                                             keeps on increasing.
in a parallel fashion. It facilitates faster processing. The
                                                                                                                   • Volume: Volume is the amount of data, which is gener-
                                                                                                                     ated from social media, credit cards, sensors, etc. The
International Semantic Intelligence Conference (ISIC 2021)                                                           volume of data is so large that it is difficult to manage,
Envelope-Open supadhyay567@gmail.com (S. Upadhyay);
rehan.manwani.56@gmail.com (R. Manwani);
                                                                                                                     store, and analyze it.
sakshamvarshney8@gmail.com (S. Varshney); jasarika@gmail.com
                                                                                                                   • Value: In the current scenario, data is money. It’s
(S. Jain)
Orcid 9729467453 (S. Upadhyay)                                                                                       worthless if we can’t extract value from it. Having
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                     huge data is something, but if we can’t extract value
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                          from it, then it is useless.
                                                                                                                               203




                                                                 Big Data Storage is quite challenging as Hard Disk
                                                                 Drives fail (HDDs) very often and can’t assure effi-
                                                                 cient data storage with several data protection mecha-
                                                                 nisms. Adding to it, the velocity of big data generates
                                                                 a need for scalable storage management to cope up
                                                                 with it. Though, Cloud provides a potential solution
                                                                 to address the problem with unlimited storage options
                                                                 which highly faults tolerant, transferring the Big Data
                                                                 to and hosting on the cloud is quite expensive with
                                                                 this huge amount of data [3].

                                                              • Data Transmission: Transmission of data consists
Figure 1: Parallel Processing Environment.                      of several stages like a) Collecting data coming from
                                                                different origins, like sensors, social media, etc. b) Inte-
                                                                gration of the collected data that comes from different
• Variety: Data is available in three forms that are struc-     origins. c) the Controlled transfer of integrated data
   tured, semi-structured, and unstructured. The volume         to processing platforms. d) Movement of data from
   of structured data is very less as compared to unstruc-      server to host [4]. This amount of transferring data
   tured data.                                                  within different stages is quite challenging in various
                                                                manners.
• Veracity: Veracity refers to the quality of the data. It
   means how accurate data is. It tests the reliability and • Data Processing: Preparing enormous volumes of
   accuracy of the content.                                     information requires devoted computing assets and
                                                                this is mostly taken care of by the expanding pace of
   Cloud is a better storage platform where our data is eas-    CPU, network, and capacity. Anyway, the comput-
ily accessible and secure. So business firms have started       ing assets required for handling huge information far
storing their data in the cloud. But the rate of growth of      surpassed the preparing power offered by customary
data is exponential; as a result, cloud servers also lack       driving ideal models.
such a huge volume of storage. Therefore, there emerges
a need to select important data and store it in a way • Data Duplication: In a big data environment, most of
that it could fit in less memory space and it should be         the data sets have identical content and are redundant.
cost-effective. Now to achieve this objective we require        Data duplication increases the cost and also takes more
a system that can perform this task in less time but the        space.
single system is not able to do this task efficiently thus we
require an environment where we can achieve parallel Benefits of Data Analytics and Compression
processing to perform this task fast.                         Data Analytics comprises collecting data, removing data
Fig. 1 shows a way to process it faster by dividing data redundancy, analyzing results, etc. and compression com-
into small chunks and assigning each chunk to a cluster prises reducing the file size to save space. These both
for processing the data chunk provided by the master combined can be used to benefit our research question.
node. This will achieve parallel processing and increases The following are the benefits of data analytics and com-
the rate of processing.                                       pression of data:
   Challenges with Big Data Storage System
                                                              • Less disk space: As the size of big data is very large,
Where there are opportunities, there are challenges. With
                                                                compressing it after analyzing can help in reducing
various benefits that can be gained with large data sets,
                                                                disk space to a great extent. This process releases a lot
big data storage possesses various challenges too. Var-
                                                                of space on the drive and as a result memory results
ious computational methods that work well with small
                                                                are closed up, which reduces the time that is required
data sets won’t work well with big data sets. The follow-
                                                                to retrieve the data from a server.
ing are some major challenges with big data storage:
                                                              • Faster file transfer: After the file compression, the
• Format of Data: While working for Big Data Storage            size of the file will be reduced. So time for transmitting
   Management one of the prime challenge is to make             a file with a reduced size will be faster.
   the system which will deal with both structured and
   unstructured data, equally.                                • Accessibility: Accessing managed data is relatively
                                                                easier as it allows a faster searching from hugely pop-
• Data Storage: Volume, Variety, and Velocity of Big            ulated data. Also, data can be accessed remotely from
   Data lead to various storage challenges. Traditional         any place with internet connectivity.
                                                                                                                      204




• Cost Reduction: Using the cloud as a storage medium         by Apache. It is used for providing query and analysis
   helps in reducing hardware costs as well as cutoff en-     of Big Data. It has a SQL like interface.
   ergy costs. We can rent as much space as we want at
   a minimal cost.                                          • Apache Spark: Spark is an open-source big data ana-
                                                              lytical engine that provides an environment for cluster
• Virtualization: Cloud provides backup for big data          computing and parallel processing of big data. It comes
   and takes off the burden of growing data from the          with inbuilt modules of SQL, machine learning, graph
   enterprises. To provide a backup, it is recommended        processing, etc.
   to make virtual copies of the applications instead of
   making the physical copies of the analytics.             • Hadoop Map Reduce Another Tool: that can be
                                                              used for programming the model of Big Data is Hadoop
• Scalability: Big data management allows the applica-        Map Reduce. To write the programs for Map Reduce
   tion to grow exponentially as it deals with the storage    various languages like Java, Python, C++ are used pop-
   space by itself. It reduces the need for new servers       ularly.
   or supporting hardware as it manages the data in the
   existing ones.                                           • Apache Giraph: Apache Giraph is real-time graph
                                                              processing software built for high scalability which is
   A lot of work has been done in the concerned direction     used to analyze social media data. Many multinational
by different authors, and the following are the contribu-     companies are using Giraph, tweaking the software
tions of this work: (1) An exhaustive study of the existing   for their purposes.
systems has been done and based on three parameters,
                                                            • PySpark: PySpark is a python API processing on the
namely data processing tools used, data compression
                                                              programming model of Spark to Python. It is a neces-
technique, and data storage option used, a review has
                                                              sary supplement API when one wants to process Big
been done and summarized to get a gist of all the existing
                                                              Data and apply machine learning algorithms at the
ways to deal with Big Data. (2) Based on the compar-
                                                              same time.
ative study mentioned above few gaps in the existing
system are extracted and a solution to fill those gaps is • Prophet Method: Prophet Prophet is a procedure for
discussed. This paper is structured into four sections.       foreseeing time game plan data reliant on an additional
After introducing the paper in section 1, we move to sec-     substance model where examples are fit annually, step
tion 2 discussing the works done so far in the concerned      by step, and ordinary consistency, notwithstanding
direction and a comparative study of systems of different     event results. It’s best results are with the time course
authors that deal with big data. Section 3 provides the       of action having strong ordinary effects and a couple
gaps that come out of the previously done work. And           of times of recorded data.
a solution is proposed to fill these gaps. Finally, section
4 concludes the work, and targeted future work is also • Neural Network: Neural frameworks are a ton of
mentioned.                                                    computations, shown openly after a man’s cerebrum,
                                                              that is expected to see plans. They unravel mate-
                                                              rial data through such a machine acknowledgment,
2. Related Work                                               naming, or grouping unrefined information. Neural
                                                              frameworks or connectionist structures are preparing
Various systems tried to achieve an efficient data stor-      systems questionably spurred by the natural neural
age system and tried various techniques. Based on some        frameworks that set up human cerebrums. Such struc-
differentiating parameters, we have done a comparative        tures ”learn” to perform tasks by pondering models,
study between different Storage Management Systems            generally without being altered with task-unequivocal
proposed by different authors in their research work. So,     guidelines.
to draw a better technique between various systems we
will use the following differentiating parameters (a) Data • MongoDB: MongoDB is an intermediate database man-
Processing, (b) Data Compression, and (c) Data Storage.       agement system between key-value and traditional Re-
Different data processing/ pre-processing techniques are      lational Database Management System. It is a document-
used by various systems to analyze the big data and re-       oriented DB system and is classified as a NoSQL database
duce the data redundancy and we will differentiate the        system and is the database for Big Data processing.
existing systems based on these techniques also. The
following are some important data processing tools for • NoSQL Database: It stands for “Not Only SQL” and
big data:                                                     is used against RDBMS where we used to build the
                                                              schema before the actual database and the data is
• Apache Hive: Hive is a software project developed           stored in the form of tables.
                                                                                                                              205




• Hashing: A hash work is any capacity that can be uti-         make the system cost-effective. So, we will also be differ-
  lized to delineate subjective size to fixed-size qualities.   entiating the storage option used by different systems to
  The features that are used to fill a table of fixed-size,     draw a better system out of all. The following are some
  termed as a hash table. The consequence of a hash             important storage options that can be used in a system
  work is known as a hash value or basically, a hash.           to store data files:
• Docker: DOCKER is one of the top organizations on • Physical Servers: Traditionally, Big Data is stored
   the planet which is a light computing framework that          on physical servers and is extracted, transformed, and
   provides containers for service. These containers are         translated on that same server. They are generally
   open source containers and are easily accessible to any-      managed, owned, and maintained by the company’s
   one free of cost. It is because of this only that container   staff.
   administrations are enjoying enormous interest. The
   Docker gives the containers progressively secure and • Cloud: It is a virtual server running in a cloud com-
   easy to use benefits. Because it gives customary up-          puting environment. It is accessed via the internet and
   dates to the containers which is the reason the holder        can be accessed remotely and is maintained by a third
   won’t bargain with the speed of its execution.                party. Customers need not pay for hardware and soft-
                                                                 ware, rather they need to pay for resources. Some of
   Data Compression techniques are also used by the sys-         the cloud providers are AWS, Microsoft Azure, Google
tems as compression of data results in the efficient and         Cloud Platform, IBM Cloud, Rackspace, Oracle Cloud,
faster upload of data on the storage. Again we will dif-         and Verizon Cloud.
ferentiate the systems based on different compression al-
gorithms and techniques used by different systems. The • Hard Disk Drives: Major improvements are going
following are some important data compression algo-              on by the manufacturers to provide a better perfor-
rithms for structured big data files:                            mance of newer 10,000 rpm 2.5-inch Hard Disk Drives
                                                                 than older, 15,000 rpm 3-inch devices. These advance-
• Huffman Coding: It is a greedy algorithm used for              ments include heat-assisted magnetic recording which
   lossless compression of data. It makes sure that the          boots up the storage capacity of the device. These
   prefix of one code should not be the prefix of another        better Hard Disk Drives are growing rapidly and are
   code. It works on character bits. It ensures that there       providing a better environment to store Big Data.
   should be no ambiguity while decoding the bitstream.
                                                               • Federated Cloud System: Federated cloud systems
• Entropy Encoding: The father of data theory, Shan-             are a combination of pre-existing or newly generated
   non proposed some form of entropy encoding used               internal or external clouds to fulfill business needs. In
   for lossless compression frequently. This compression         this combination of several clouds, clouds may perform
   technique is based on data theoretic techniques.              different actions or common action.
• Simple Repetition: For the n-times successive ap-
                                                               The efficient storage of big data is a major issue in
   pearance of the same token sequence in a series can be
                                                            most organizations and that’s why many researchers
   replaced with a token and a count representing several
                                                            have tried to deal with many different ways. A discus-
   appearances. A flag is used to represent whenever the
                                                            sion of a few works is done here:
   repeating token appears.
                                                            Jahanara et.al.[5] proposed more secure big data storage
• Gzip: GNU zip is a modern-day compression algo- protocols. The writers implemented this using an access
   rithm where its main function is to compress and de- control model (using parameters) and honeypot (for in-
   compress the files for faster network transfer. It re- valid users) in a cloud computing environment. They
   duces the size of the named file using the technique concluded that it is needed to change faith and admission
   Lempel-Ziv coding. It is based on Huffman Encoding control procedure in a cloud location for big data pro-
   and uses an approach of LZ77 which looks at partial cessing. This system was suitable for analyzing, storing,
   strings within the text.                                 and retrieval of big data in a cloud environment.
                                                            Krish et al. [6] aimed to enhance the overall I/O through-
• Bzip2: It is based on Burrows-Wheeler sorting text put of Big Data storage using Solid-state drives (SSD’s)
   algorithm and Huffman Encoding which works on and for that, they designed and implemented a dynamic
   blocks that go from 100 to 900 KB. It is open to all, an data management system for big data processing. The
   open-source compression program for files, and is also system used a map-reduce function for data processing
   free to all.                                             and SSD for storage. They divided SSD into two tiers
   Various systems have either opted for traditional phys- and kept a faster tier as a cache for frequently accessed
ical servers to store their big data or cloud servers to data. This system, as a result, shows only 5% overhead in
                                                                                                                            206




performance and offers inexpensive and efficient storage      loading a huge amount of data they found that MongoDB
management.                                                   shows better performance. They also found out that
Hongbo et al. [7] proposed a system that reduces the          MySQL took double the time of MongoDB for database
movement of the data and releases intense I/O perfor-         loading. For data query and retrieval, MySQL was found
mance congestion. They came up with a combination of          out to be more efficient. So, they concluded that Mon-
two data compression algorithms that effectively selects      goDB is better for database loading and MySQL is better
and switches to accomplish the most optimum input-            for data query and data retrieval.
output results. They experimented with a real-time ap-        Containers reduce the downtime in the backend and pro-
plication that was conducted on a cluster machine with        vide a better service to the clients. It can be quoted from
1280 cores and each core was comprised of 80-nodes.           another research paper as Avanish et. al. [29] proved
They come up with the result that to affect the decision      in their research paper that Docker containers present
for compression, processors available and the compres-        better results to their benchmark testing than clusters
sion ratio are the two most important factors.                made on Virtual Machines. They also stated that cluster
Eugen et al. [8] aimed to evaluate the performance and        on containers shows better efficiency than those on Vir-
energy friendliness of various Big data tools and applica-    tual Machines.
tions in the cloud environment which use Hadoop map-
reduce. They conducted this evaluation on physical as            Based on the above-mentioned study a conclusive com-
well as on virtual clusters via different configurations.     parison has been derived which depicts how existing sys-
They concluded that although Hadoop is popularly used         tems differ from each other based on our differentiating
in a virtual cloud environment, we are still unable to find   parameters. Existing systems can be categorized into the
its correct configuration.                                    following two categories:
Mingchen et. al. [23] proposes a systematic approach to       1. Systems with compression (in Table 1)
identify trends and patterns with big data. Here Big Data     2. Systems without compression (in Table 2)
Analytics is applied to criminal data. Criminal data were        Findings from systems with compression:
collected and analyzed. They used the prophet model
and neural network model for identifying the trend. In        • System 1 , there is a requirement for RDF-explicit ca-
conclusion, they found that the prophet model works             pacity procedures and calculations for building pro-
better than the neural network model. They haven’t              ductive and superior RDF stores.
mentioned any compression technique that they used in         • System 2 performs parallel compression on different
their research.                                                 processors using different compression techniques. Dis-
Lu et. al. [24] aimed to explore duplicate detection in         covered that the proportion after compression and ac-
a news article. They proposed a tool NDFinder using             cessible processor are the most significant elements to
the hash technique to detect article duplication. They          affect the compression choice.
checked 33,244 news articles and detects 2150 duplicate
articles. Their precision reached 97%. They matched the         Findings from systems without compression:
hash values of different articles and report those having
                                                              • System 1 processes Big Data using Deep Learning, Neu-
similar values. They used this research in finding plagia-
                                                                ral network, and Prophet model and discovered that
rism in various articles. They also found that the 3 fields
                                                                the Deep learning model and Prophet model and have
with the highest
                                                                better precision than the neural network model.
proportion of plagiarism are sports news, technology
news, and military news. Moustafa et.al.[13] proposed         • System 2 implemented a tool NDFinder which detects
an algorithm to minimize bandwidth cost as big data             duplicated data with a precision of 97%. From 33,240
requires a lot of resources for computation and high            records, they detected 2,150 positive duplicate records.
bandwidth to allow data transfer. They proposed that
a federated cloud system provides a cost-effective solu-      • System 3 reduced operation cost coming out because of
tion for analyzing, storing, and computing big data. This       big data applications getting deployed on a federated
algorithm also proposed a way to minimize electricity           cloud.
costs for analyzing and storing big data. They concluded
                                                              • System 4 compares Spark and Hadoop and found out
that their proposed algorithms perform better than the
                                                                that Spark is a faster tool for Big Data processing as
existing approach when the application is large. They
                                                                compared to Hadoop.
proposed an algorithm that minimizes the cost of energy
and bandwidth utilization to a certain extent.                • System 5 analyzed the current technologies for big
Carlos et. al. [26] proposed a comparative study based          data storage and processing and came out with the
on obesity based on electronic health records. They col-        result that MongoDB is better for database loading and
lected 20,706,947 records from different hospitals. While       MySQL is better for data query and data retrieval
                                                                                                                             207




Table 1
Comparative study of systems with compression

       System No.          Paper Reference            Data Processing      Data Compression      Storage
            1          Yuan et al. (2019) [28]      Hashing & Indexing     Huffman Encoding      Physical Servers
            2        HongboZou et al. (2014) [7]      Not mentioned         bzip2 and gzip       Not mentioned


Table 2
Comparative study of systems without compression

       System No.            Paper Reference                       Data Processing               Storage
            1.       Feng, Mingchen, et al. (2019) [23]   Prophet method and neural network      Cloud Storage
            2.              Lu et al. (2019) [24]           NDfinder, a hashing based tool       Physical Servers
            3.        MoustafaNajm et al. (2019) [13]         Virtual Machines Federated         Cloud System
            4.           Cheng et al. (2019) [25]                  Spark and Hadoop              Cloud Storage
            5.            Carlos et al. (2019) [26]             MongoDB and MySQL                Cloud Storage
            6.           Avanish et al. (2018) [29]          Docker Containers vs Hadoop         Physical Storage
            7.           Pandey et al. (2016) [27]                      NoSQL                    Cloud Storage
            8.             Krish et. al (2016)[6]           Hadoop/MapReduce and spark           Hard Disk Drive
            9.         Feller, Eugen et al. (2015) [8]            Hadoop/MapReduce               Physical Servers
           10.              Jahanara, et al. [5]                 Hadoop/ MapReduce               Cloud Storage




• In system 6, Docker containers perform way faster            mission, there is a higher probability of data loss orbit
  than the Hadoop framework on containers.                     corruption.

• System 7 states that NoSQL gives a faster result than         As the size of data is increasing day by day, servers are
  SQL                                                        getting exhausted, requiring the installation of high-cost
                                                             hardware. That’s why there is a need to store it in such a
• System 8 shows Hadoop/ Map Reduce requires a higher        way that it utilizes less space and should be cost-effective.
  specified environment, making the process almost im-       The main objective is to reduce the storage size of big
  possible for low specific systems. Though parallel pro-    data and to provide an approach to keep on working with
  cessing gives a better performance                         the existing physical data storage even with an increased
• System 8 depicts that data locality is an important        number of users and files. The system aims to reduce
  factor in energy efficiency.                               data redundancy and provides a cost-effective and flexible
                                                             environment.
• System 9 proposed a model to secure Big Data while            Expected output and outcome of the solution:
  storing it in the cloud. Honeypot is used as a trap to        A storage management system that provides Space-
  catch a thief or hacker and unauthorized user.             efficient storage of big data, reduction of data redundancy,
                                                             and a cost-effective and flexible way to access data Based
                                                             on the facts mentioned in the previous section our re-
3. Research Question &                                       search rotates around the question that can such a system
   Hypothesis                                                be built using different tools and algorithms which can
                                                             provide efficient but faster storage of big data. This paper
After summarizing the works depicted in Table 1 and          can give an idea of how our research towards our goal
Table 2 following gaps are identified out:                   progresses. We hypothesize that different techniques can
                                                             be used and combined to create a better system than the
• Existing systems tried to reduce the size by compres-      existing systems.
  sion but the results were not worthy of the time spent        Design of proposed solution:
  on it. Because compression of such a huge amount of           The aim is to reduce the storage space for storing big
  data requires a lot of time and results, we got is not     data. As the data sets are huge and also contain a lot of
  enough fast to compensate for that time.                   redundant data, first we will try to remove the redundant
                                                             data from the original data set. After removing redundant
• The transmission of big data involves a large number
                                                             data our data set would be more accurate and precise.
  of bits and a lot of time is consumed. So, during trans-
                                                             After removing redundant data, we will compress our
                                                                                                                        208




                                                           • Huffman Encoding

                                                           • Entropy Encoding

                                                           • Simple Repetition

                                                           • Bzip2

                                                           • Gzip

                                                           3.3. Data Storage:
Figure 2: Proposed Architecture.                           Data storage refers to storing information on a storage
                                                           medium. There are many storage devices available in
                                                           which we can store data. Some of the storage devices are
data using any good data compression technique. Then magnetic tape, disks like floppy, and ROMs, or we can
data will be stored on any desirable cloud server. We pro- store them in Cloud.
pose a system that can be produced with three modules         Benchmark System: The following two works can
as depicted in Fig2 and depicted below:                    serve as the benchmark for any such projects:

                                                          • The system discussed by Haoyu et al. in their work
3.1. Data Processing:                                       “CloST: A Hadoop-based Storage System for Big Spatio-
                                                            Temporal Data Analytics” [9]. They tried to reduce the
Big data processing is a process of handling a large vol-
                                                            storage space and query processing time. They pro-
ume of data. Data processing is the act of manipulation
                                                            posed a three-level hierarchal partitioning approach
of data by computer. Its main purpose is to find valu-
                                                            for parallel processing for all the objects at different
able information from raw data. Data processing can
                                                            nodes. They used Hadoop Map-Reduce for data pro-
be of three types: manual data processing, batch data
                                                            cessing and column level gzip for data compression.
processing, and real-time data processing. We have the
following tools and techniques for Data processing:       • The one discussed by Avanish et al in their work “Com-
• Apache Giraph                                             parative Study of Hadoop over Containers and Hadoop
                                                            Over Virtual Machine”. They tried to attain a faster par-
• Apache Hive                                               allel processing environment which is developed using
                                                            Docker Containers rather than the Hadoop framework
• Apache Spark
                                                            on containers.
• Hadoop MapReduce
• PySpark                                                  4. Conclusion
• Prophet Method                                         A review has been done in this paper, for data processing
• Neural Network Method                                  and compression and mentioned that each technique has
                                                         its benefit for a specific type of data set. This effort is
• MongoDB                                                made to explore the maximum possible, important tech-
• NoSQL                                                  niques coming from all the existing techniques used for
                                                         the purpose. Also, we have proposed an architecture that
• Hashing                                                can achieve the initial objective of reducing storage space.
                                                         To achieve this, the architecture first performs analytics
3.2. Data Compression:                                   on the dataset, and then its size is reduced to an extent
                                                         making it easy to store. After analysis and compression,
Data Compression is the way towards diminishing the the resultant dataset is stored in a cloud environment to
information required for capacity or transmission. It provide better scalability and accessibility. Many types
includes changing, encoding, and changing over piece of research have been conducted already up to a scope
structures so it consumes less space. A typical compres- to resolve the issue of efficient big data storage but some
sion procedure eliminates and replaces tedious pieces more persuasive steps are still essential to be taken.
and images to decrease size. There can be two types of
compressions one is losing and the other is lossless. We
have the following tools and techniques for Data Com-
pression:
                                                                                                                        209




References                                                        and knowledge management. 2012.

 [1] Li, Yang, and Yike Guo. ”Wiki-health: from quan-
     tified self to self-understanding.” Future Genera-      [10] Prasad, Bakshi Rohit, and Sonali Agarwal. ”Com-
     tion Computer Systems 56 (2016): 333-359.                    parative study of big data computing and storage
                                                                  tools: a review.” International Journal of Database
                                                                  Theory and Application 9.1 (2016): 45-66.
 [2] Bernard Marr. ”How Much Data Do We Create
     Every Day? The Mind-Blowing Stats Everyone
     Should Read” Forbes, 21 May 2018,                       [11] Schneider, Robert D. ”Hadoop for dummies.” John
                                                                  Willey & sons (2012).

 [3] Yang, Chaowei, Yan Xu, and Douglas Nebert. ”Re-
     defining the possibility of digital Earth and geo-      [12] Prasetyo, Bayu, et al. ”A review: evolution of
     sciences with spatial cloud computing.” Interna-             big data in developing country.” Bulletin of Social
     tional Journal of Digital Earth 6.4 (2013): 297-312.         Informatics Theory and Application 3.1 (2019):
                                                                  30-37.

 [4] Huang, Qunying, et al. ”Cloud computing for geo-
     sciences: deployment of GEOSS clearinghouse on          [13] Najm, Moustafa, and Venkatesh Tamarapalli. ”Cost-
     Amazon’s EC2.” Proceedings of the ACM SIGSPA-                efficient Deployment of Big Data Applications
     TIAL international workshop on high performance              in Federated Cloud Systems.” 2019 11th Interna-
     and distributed geographic information systems.              tional Conference on Communication Systems &
     2010.Sharma, Sushil, et al. ”Image Steganography             Networks (COMSNETS). IEEE, 2019.
     using Two’s Complement.” International Journal
     of Computer Applications 145.10 (2016): 39-41.          [14] Chatterjee, Amlan, Rushabh Jitendrakumar Shah,
                                                                  and Khondker S. Hasan. ”Efficient Data Compres-
                                                                  sion for IoT Devices using Huffman Coding Based
 [5] Akhtar, Jahanara, et al. ”Big Data Security with
                                                                  Techniques.” 2018 IEEE International Conference
     Access Control Model and Honeypot in Cloud
                                                                  on Big Data (Big Data). IEEE, 2018.
     Computing.” International Journal of Computer
     Applications 975: 8887.
                                                             [15] Yin, Chao, et al. ”Robot: An efficient model for
                                                                  big data storage systems based on erasure coding.”
 [6] Krish, K. R., et al. ”On efficient hierarchical stor-
                                                                  2013 IEEE International Conference on Big Data.
     age for big data processing.” 2016 16th IEEE/ACM
                                                                  IEEE, 2013.
     International Symposium on Cluster, Cloud and
     Grid Computing (CCGrid). IEEE, 2016.
                                                             [16] Yang, Chaowei, et al. ”Big Data and cloud com-
                                                                  puting: innovation opportunities and challenges.”
 [7] Zou, Hongbo, et al. ”Improving I/O performance
                                                                  International Journal of Digital Earth 10.1 (2017):
     with adaptive data compression for big data ap-
                                                                  13-53.
     plications.” 2014 IEEE International Parallel &
     Distributed Processing Symposium Workshops.
     IEEE, 2014.                                             [17] Jagadish, Hosagrahar V., et al. ”Big data and
                                                                  its technical challenges.” Communications of the
                                                                  ACM 57.7 (2014): 86-94.
 [8] Feller, Eugen, Lavanya Ramakrishnan, and Chris-
     tine Morin. ”Performance and energy efficiency
     of big data applications in cloud environments: A       [18] Bryant, Randal, Randy H. Katz, and Edward D.
     Hadoop case study.” Journal of Parallel and Dis-             Lazowska. ”Big-data computing: creating revolu-
     tributed Computing 79 (2015): 80-89.                         tionary breakthroughs in commerce, science and
                                                                  society.” (2008).
 [9] Tan, Haoyu, Wuman Luo, and Lionel M. Ni. ”Clost:
     a hadoop-based storage system for big spatio-           [19] Agrawal, Divyakant, et al. ”Challenges and op-
     temporal data analytics.” Proceedings of the 21st            portunities with Big Data 2011-1.” (2011).
     ACM international conference on Information
                                                                                                                    210




[20] Xin, Luna Dong. ”Big Data Integration (Synthesis        [29] Singh, Avanish, et al. ”Comparitive Study of
     Lectures on Data Management).” (2015).                       Hadoop over Containers and Hadoop Over Vir-
                                                                  tual Machine.” International Journal of Applied
                                                                  Engineering Research 13.6 (2018): 4373-4378.
[21] Khan, Nawsher, et al. ”Big data: survey, technolo-
     gies, opportunities, and challenges.” The scientific
     world journal 2014 (2014).                              [30] Sayood, Khalid. Introduction to data compres-
                                                                  sion. Morgan Kaufmann, 2017.
[22] Kodituwakku, S. R., and U. S. Amarasinghe. ”Com-
     parison of lossless data compression algorithms
     for text data.” Indian journal of computer science
     and engineering 1.4 (2010): 416-425.


[23] Feng, Mingchen, et al. ”Big data analytics and
     mining for effective visualization and trends fore-
     casting of crime data.” IEEE Access 7 (2019): 106111-
     106123.


[24] Lu, Lu, and Pengcheng Wang. ”Duplication De-
     tection in News Articles Based on Big Data.” 2019
     IEEE 4th International Conference on Cloud Com-
     puting and Big Data Analysis (ICCCBDA). IEEE,
     2019.


[25] Cheng, Yan, Qiang Zhang, and Ziming Ye. ”Re-
     search on the Application of Agricultural Big
     Data Processing with Hadoop and Spark.” 2019
     IEEE International Conference on Artificial In-
     telligence and Computer Applications (ICAICA).
     IEEE, 2019.


[26] Martinez-Millana, Carlos, et al. ”Comparing data
     base engines for building big data analytics in
     obesity detection.” 2019 IEEE 32nd International
     Symposium on Computer-Based Medical Systems
     (CBMS). IEEE, 2019.


[27] Pandey, Manish Kumar, and Karthikeyan Sub-
     biah. ”A novel storage architecture for facilitat-
     ing efficient analytics of health informatics Big
     Data in cloud.” 2016 IEEE International Confer-
     ence on Computer and Information Technology
     (CIT). IEEE, 2016.


[28] Yuan, Pingpeng, et al. ”Big RDF Data Storage,
     Computation, and Analysis: A Strawman’s Ar-
     guments.” 2019 IEEE 39th International Confer-
     ence on Distributed Computing Systems (ICDCS).
     IEEE, 2019.