<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BIG DATA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shubham Upadhyay</string-name>
          <email>supadhyay567@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rakesh Manwani</string-name>
          <email>rehan.manwani.56@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saksham Varshney</string-name>
          <email>sakshamvarshney8@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarika Jain</string-name>
          <email>jasarika@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Big Data</institution>
          ,
          <addr-line>Cloud Computing, Data Analytics, Data Compression, Storage System</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NIT Kurukshetra</institution>
          ,
          <addr-line>Kurukshetra, Haryana, 136119</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>gence. Eg. Pdf</institution>
          ,
          <addr-line>Media files</addr-line>
        </aff>
      </contrib-group>
      <fpage>202</fpage>
      <lpage>210</lpage>
      <abstract>
        <p>data sets. Data generated by the devices and the users in modern times is high in volume and variable in structure. Collectively termed as Big Data, it is dificult to store and process using traditional processing tools. Traditional systems store data on physical servers or cloud resulting in higher cost and space complexity. In this paper, we provide a survey of various state-of-the-art research works done to handle the ineficient storage problem of Big Data. We have provided comparative literature to compare existing works to handle Big Data. As a solution to the problem encountered, we propose to split the Big Data into small chunks and provide each chunk to a diferent cluster for removing the redundant data and compressing it. Once every cluster has completed its task, the data chunks are combined back and stored on the cloud as compared to physical servers. This efectively reduces storage space and achieves parallel processing, thereby decreasing the processing time for very large International Semantic Intelligence Conference (ISIC 2021) CEUR data is generally scattered and broadly classified into</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Digitization is generating a huge amount of data, and
Information Technology Organizations have to deal with
this huge data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is very dificult to manage and
store. This data comes from various sources like social
media, IoT devices, sensors, mobile networks, etc.
According to some figures, 2/3rd of the total data has been
generated in the last 2 years [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. According to Intel, smart
cars generate 4000GB of data per day.
      </p>
      <p>Traditionally, companies prefer to deploy their servers
for data storage, but as the volume of data increases it
becomes challenging for the companies to manage the
infrastructure required and the cost associated with it,
this also poses flexibility issues. The problem related to
the management of data can be handled by using
infrastructures like Cloud, which provide close to unlimited
storage along with services such as data security because
of which data owners don’t have to put much efort into
it and can focus on their day-to-day tasks.</p>
      <p>Along with being large, the data is also complex which
possesses problems when it has to be processed with
traditional processing tools. For this, we need some
dedicated tools which will facilitate the processing of this
data, which are part of Big Data Computing. It involves
master-slave architecture in which there is a single
master node that assigns a task to slave nodes, which works
in a parallel fashion. It facilitates faster processing. The
(S. Jain)
• Value: In the current scenario, data is money. It’s
worthless if we can’t extract value from it. Having
huge data is something, but if we can’t extract value
from it, then it is useless.
• Variety: Data is available in three forms that are
structured, semi-structured, and unstructured. The volume
of structured data is very less as compared to
unstructured data.
• Veracity: Veracity refers to the quality of the data. It
means how accurate data is. It tests the reliability and
accuracy of the content.</p>
      <p>Cloud is a better storage platform where our data is
easily accessible and secure. So business firms have started
storing their data in the cloud. But the rate of growth of
data is exponential; as a result, cloud servers also lack
such a huge volume of storage. Therefore, there emerges
a need to select important data and store it in a way
that it could fit in less memory space and it should be
cost-efective. Now to achieve this objective we require
a system that can perform this task in less time but the
single system is not able to do this task eficiently thus we
require an environment where we can achieve parallel
processing to perform this task fast.</p>
      <p>Fig. 1 shows a way to process it faster by dividing data
into small chunks and assigning each chunk to a cluster
for processing the data chunk provided by the master
node. This will achieve parallel processing and increases
the rate of processing.</p>
      <p>
        Challenges with Big Data Storage System
Where there are opportunities, there are challenges. With
various benefits that can be gained with large data sets,
big data storage possesses various challenges too.
Various computational methods that work well with small
data sets won’t work well with big data sets. The
following are some major challenges with big data storage:
• Format of Data: While working for Big Data Storage
Management one of the prime challenge is to make
the system which will deal with both structured and
unstructured data, equally.
• Data Storage: Volume, Variety, and Velocity of Big
Data lead to various storage challenges. Traditional
Big Data Storage is quite challenging as Hard Disk
Drives fail (HDDs) very often and can’t assure
eficient data storage with several data protection
mechanisms. Adding to it, the velocity of big data generates
a need for scalable storage management to cope up
with it. Though, Cloud provides a potential solution
to address the problem with unlimited storage options
which highly faults tolerant, transferring the Big Data
to and hosting on the cloud is quite expensive with
this huge amount of data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
• Data Transmission: Transmission of data consists
of several stages like a) Collecting data coming from
diferent origins, like sensors, social media, etc. b)
Integration of the collected data that comes from diferent
origins. c) the Controlled transfer of integrated data
to processing platforms. d) Movement of data from
server to host [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This amount of transferring data
within diferent stages is quite challenging in various
manners.
• Data Processing: Preparing enormous volumes of
information requires devoted computing assets and
this is mostly taken care of by the expanding pace of
CPU, network, and capacity. Anyway, the
computing assets required for handling huge information far
surpassed the preparing power ofered by customary
driving ideal models.
• Data Duplication: In a big data environment, most of
the data sets have identical content and are redundant.
Data duplication increases the cost and also takes more
space.
      </p>
      <p>Benefits of Data Analytics and Compression
Data Analytics comprises collecting data, removing data
redundancy, analyzing results, etc. and compression
comprises reducing the file size to save space. These both
combined can be used to benefit our research question.
The following are the benefits of data analytics and
compression of data:
• Less disk space: As the size of big data is very large,
compressing it after analyzing can help in reducing
disk space to a great extent. This process releases a lot
of space on the drive and as a result memory results
are closed up, which reduces the time that is required
to retrieve the data from a server.
• Faster file transfer: After the file compression, the
size of the file will be reduced. So time for transmitting
a file with a reduced size will be faster.
• Accessibility: Accessing managed data is relatively
easier as it allows a faster searching from hugely
populated data. Also, data can be accessed remotely from
any place with internet connectivity.
• Cost Reduction: Using the cloud as a storage medium
helps in reducing hardware costs as well as cutof
energy costs. We can rent as much space as we want at
a minimal cost.
• Virtualization: Cloud provides backup for big data
and takes of the burden of growing data from the
enterprises. To provide a backup, it is recommended
to make virtual copies of the applications instead of
making the physical copies of the analytics.
• Scalability: Big data management allows the
application to grow exponentially as it deals with the storage
space by itself. It reduces the need for new servers
or supporting hardware as it manages the data in the
existing ones.</p>
      <p>A lot of work has been done in the concerned direction
by diferent authors, and the following are the
contributions of this work: (1) An exhaustive study of the existing
systems has been done and based on three parameters,
namely data processing tools used, data compression
technique, and data storage option used, a review has
been done and summarized to get a gist of all the existing
ways to deal with Big Data. (2) Based on the
comparative study mentioned above few gaps in the existing
system are extracted and a solution to fill those gaps is
discussed. This paper is structured into four sections.
After introducing the paper in section 1, we move to
section 2 discussing the works done so far in the concerned
direction and a comparative study of systems of diferent
authors that deal with big data. Section 3 provides the
gaps that come out of the previously done work. And
a solution is proposed to fill these gaps. Finally, section
4 concludes the work, and targeted future work is also
mentioned.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Various systems tried to achieve an eficient data
storage system and tried various techniques. Based on some
diferentiating parameters, we have done a comparative
study between diferent Storage Management Systems
proposed by diferent authors in their research work. So,
to draw a better technique between various systems we
will use the following diferentiating parameters (a) Data
Processing, (b) Data Compression, and (c) Data Storage.</p>
      <p>Diferent data processing/ pre-processing techniques are
used by various systems to analyze the big data and
reduce the data redundancy and we will diferentiate the
existing systems based on these techniques also. The
following are some important data processing tools for
big data:
• Apache Hive: Hive is a software project developed
by Apache. It is used for providing query and analysis
of Big Data. It has a SQL like interface.
• Apache Spark: Spark is an open-source big data
analytical engine that provides an environment for cluster
computing and parallel processing of big data. It comes
with inbuilt modules of SQL, machine learning, graph
processing, etc.
• Hadoop Map Reduce Another Tool: that can be
used for programming the model of Big Data is Hadoop
Map Reduce. To write the programs for Map Reduce
various languages like Java, Python, C++ are used
popularly.
• Apache Giraph: Apache Giraph is real-time graph
processing software built for high scalability which is
used to analyze social media data. Many multinational
companies are using Giraph, tweaking the software
for their purposes.
• PySpark: PySpark is a python API processing on the
programming model of Spark to Python. It is a
necessary supplement API when one wants to process Big
Data and apply machine learning algorithms at the
same time.
• Prophet Method: Prophet Prophet is a procedure for
foreseeing time game plan data reliant on an additional
substance model where examples are fit annually, step
by step, and ordinary consistency, notwithstanding
event results. It’s best results are with the time course
of action having strong ordinary efects and a couple
of times of recorded data.
• Neural Network: Neural frameworks are a ton of
computations, shown openly after a man’s cerebrum,
that is expected to see plans. They unravel
material data through such a machine acknowledgment,
naming, or grouping unrefined information. Neural
frameworks or connectionist structures are preparing
systems questionably spurred by the natural neural
frameworks that set up human cerebrums. Such
structures ”learn” to perform tasks by pondering models,
generally without being altered with task-unequivocal
guidelines.
• MongoDB: MongoDB is an intermediate database
management system between key-value and traditional
Relational Database Management System. It is a
documentoriented DB system and is classified as a NoSQL database
system and is the database for Big Data processing.
• NoSQL Database: It stands for “Not Only SQL” and
is used against RDBMS where we used to build the
schema before the actual database and the data is
stored in the form of tables.
• Hashing: A hash work is any capacity that can be
utilized to delineate subjective size to fixed-size qualities.</p>
      <p>The features that are used to fill a table of fixed-size,
termed as a hash table. The consequence of a hash
work is known as a hash value or basically, a hash.</p>
      <p>make the system cost-efective. So, we will also be
diferentiating the storage option used by diferent systems to
draw a better system out of all. The following are some
important storage options that can be used in a system
to store data files:
• Docker: DOCKER is one of the top organizations on
the planet which is a light computing framework that
provides containers for service. These containers are
open source containers and are easily accessible to
anyone free of cost. It is because of this only that container
administrations are enjoying enormous interest. The
Docker gives the containers progressively secure and
easy to use benefits. Because it gives customary
updates to the containers which is the reason the holder
won’t bargain with the speed of its execution.</p>
      <p>Data Compression techniques are also used by the
systems as compression of data results in the eficient and
faster upload of data on the storage. Again we will
differentiate the systems based on diferent compression
algorithms and techniques used by diferent systems. The
following are some important data compression
algorithms for structured big data files:
• Hufman Coding: It is a greedy algorithm used for
lossless compression of data. It makes sure that the
prefix of one code should not be the prefix of another
code. It works on character bits. It ensures that there
should be no ambiguity while decoding the bitstream.
• Entropy Encoding: The father of data theory,
Shannon proposed some form of entropy encoding used
for lossless compression frequently. This compression
technique is based on data theoretic techniques.
• Simple Repetition: For the n-times successive
appearance of the same token sequence in a series can be
replaced with a token and a count representing several
appearances. A flag is used to represent whenever the
repeating token appears.
• Gzip: GNU zip is a modern-day compression
algorithm where its main function is to compress and
decompress the files for faster network transfer. It
reduces the size of the named file using the technique
Lempel-Ziv coding. It is based on Hufman Encoding
and uses an approach of LZ77 which looks at partial
strings within the text.
• Bzip2: It is based on Burrows-Wheeler sorting text
algorithm and Hufman Encoding which works on
blocks that go from 100 to 900 KB. It is open to all, an
open-source compression program for files, and is also
free to all.</p>
      <p>Various systems have either opted for traditional
physical servers to store their big data or cloud servers to
• Physical Servers: Traditionally, Big Data is stored
on physical servers and is extracted, transformed, and
translated on that same server. They are generally
managed, owned, and maintained by the company’s
staf.
• Cloud: It is a virtual server running in a cloud
computing environment. It is accessed via the internet and
can be accessed remotely and is maintained by a third
party. Customers need not pay for hardware and
software, rather they need to pay for resources. Some of
the cloud providers are AWS, Microsoft Azure, Google
Cloud Platform, IBM Cloud, Rackspace, Oracle Cloud,
and Verizon Cloud.
• Hard Disk Drives: Major improvements are going
on by the manufacturers to provide a better
performance of newer 10,000 rpm 2.5-inch Hard Disk Drives
than older, 15,000 rpm 3-inch devices. These
advancements include heat-assisted magnetic recording which
boots up the storage capacity of the device. These
better Hard Disk Drives are growing rapidly and are
providing a better environment to store Big Data.
• Federated Cloud System: Federated cloud systems
are a combination of pre-existing or newly generated
internal or external clouds to fulfill business needs. In
this combination of several clouds, clouds may perform
diferent actions or common action.</p>
      <p>
        The eficient storage of big data is a major issue in
most organizations and that’s why many researchers
have tried to deal with many diferent ways. A
discussion of a few works is done here:
Jahanara et.al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed more secure big data storage
protocols. The writers implemented this using an access
control model (using parameters) and honeypot (for
invalid users) in a cloud computing environment. They
concluded that it is needed to change faith and admission
control procedure in a cloud location for big data
processing. This system was suitable for analyzing, storing,
and retrieval of big data in a cloud environment.
      </p>
      <p>
        Krish et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] aimed to enhance the overall I/O
throughput of Big Data storage using Solid-state drives (SSD’s)
and for that, they designed and implemented a dynamic
data management system for big data processing. The
system used a map-reduce function for data processing
and SSD for storage. They divided SSD into two tiers
and kept a faster tier as a cache for frequently accessed
data. This system, as a result, shows only 5% overhead in
performance and ofers inexpensive and eficient storage loading a huge amount of data they found that MongoDB
management. shows better performance. They also found out that
Hongbo et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed a system that reduces the MySQL took double the time of MongoDB for database
movement of the data and releases intense I/O perfor- loading. For data query and retrieval, MySQL was found
mance congestion. They came up with a combination of out to be more eficient. So, they concluded that
Montwo data compression algorithms that efectively selects goDB is better for database loading and MySQL is better
and switches to accomplish the most optimum input- for data query and data retrieval.
output results. They experimented with a real-time ap- Containers reduce the downtime in the backend and
proplication that was conducted on a cluster machine with vide a better service to the clients. It can be quoted from
1280 cores and each core was comprised of 80-nodes. another research paper as Avanish et. al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] proved
They come up with the result that to afect the decision in their research paper that Docker containers present
for compression, processors available and the compres- better results to their benchmark testing than clusters
sion ratio are the two most important factors. made on Virtual Machines. They also stated that cluster
Eugen et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] aimed to evaluate the performance and on containers shows better eficiency than those on
Virenergy friendliness of various Big data tools and applica- tual Machines.
tions in the cloud environment which use Hadoop
mapreduce. They conducted this evaluation on physical as Based on the above-mentioned study a conclusive
comwell as on virtual clusters via diferent configurations. parison has been derived which depicts how existing
sysThey concluded that although Hadoop is popularly used tems difer from each other based on our diferentiating
in a virtual cloud environment, we are still unable to find parameters. Existing systems can be categorized into the
its correct configuration. following two categories:
Mingchen et. al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] proposes a systematic approach to 1. Systems with compression (in Table 1)
identify trends and patterns with big data. Here Big Data 2. Systems without compression (in Table 2)
Analytics is applied to criminal data. Criminal data were Findings from systems with compression:
collected and analyzed. They used the prophet model
and neural network model for identifying the trend. In • System 1 , there is a requirement for RDF-explicit
caconclusion, they found that the prophet model works pacity procedures and calculations for building
probetter than the neural network model. They haven’t ductive and superior RDF stores.
mentioned any compression technique that they used in • System 2 performs parallel compression on diferent
their research. processors using diferent compression techniques.
DisLu et. al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] aimed to explore duplicate detection in covered that the proportion after compression and
aca news article. They proposed a tool NDFinder using cessible processor are the most significant elements to
the hash technique to detect article duplication. They afect the compression choice.
checked 33,244 news articles and detects 2150 duplicate
articles. Their precision reached 97%. They matched the Findings from systems without compression:
hash values of diferent articles and report those having
similar values. They used this research in finding plagia- • System 1 processes Big Data using Deep Learning,
Neurism in various articles. They also found that the 3 fields ral network, and Prophet model and discovered that
with the highest the Deep learning model and Prophet model and have
proportion of plagiarism are sports news, technology better precision than the neural network model.
news, and military news. Moustafa et.al.[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed • System 2 implemented a tool NDFinder which detects
an algorithm to minimize bandwidth cost as big data duplicated data with a precision of 97%. From 33,240
requires a lot of resources for computation and high records, they detected 2,150 positive duplicate records.
bandwidth to allow data transfer. They proposed that
a federated cloud system provides a cost-efective solu- • System 3 reduced operation cost coming out because of
tion for analyzing, storing, and computing big data. This big data applications getting deployed on a federated
algorithm also proposed a way to minimize electricity cloud.
costs for analyzing and storing big data. They concluded
• System 4 compares Spark and Hadoop and found out
that their proposed algorithms perform better than the
that Spark is a faster tool for Big Data processing as
existing approach when the application is large. They
compared to Hadoop.
proposed an algorithm that minimizes the cost of energy
and bandwidth utilization to a certain extent. • System 5 analyzed the current technologies for big
Carlos et. al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] proposed a comparative study based data storage and processing and came out with the
on obesity based on electronic health records. They col- result that MongoDB is better for database loading and
lected 20,706,947 records from diferent hospitals. While MySQL is better for data query and data retrieval
System No.
      </p>
      <p>Paper Reference</p>
      <p>Data Processing</p>
      <p>Data Compression</p>
      <p>Storage</p>
      <p>
        Yuan et al. (2019) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
HongboZou et al. (2014) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>Hashing &amp; Indexing</p>
      <p>Not mentioned</p>
      <p>Hufman Encoding
bzip2 and gzip</p>
      <p>Physical Servers
Not mentioned</p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Question &amp;</title>
    </sec>
    <sec id="sec-4">
      <title>Hypothesis</title>
      <p>After summarizing the works depicted in Table 1 and
Table 2 following gaps are identified out:
• Existing systems tried to reduce the size by
compression but the results were not worthy of the time spent
on it. Because compression of such a huge amount of
data requires a lot of time and results, we got is not
enough fast to compensate for that time.
• The transmission of big data involves a large number
of bits and a lot of time is consumed. So, during
transAs the size of data is increasing day by day, servers are
getting exhausted, requiring the installation of high-cost
hardware. That’s why there is a need to store it in such a
way that it utilizes less space and should be cost-efective.
The main objective is to reduce the storage size of big
data and to provide an approach to keep on working with
the existing physical data storage even with an increased
number of users and files. The system aims to reduce
data redundancy and provides a cost-efective and flexible
environment.</p>
      <p>Expected output and outcome of the solution:
A storage management system that provides
Spaceeficient storage of big data, reduction of data redundancy,
and a cost-efective and flexible way to access data Based
on the facts mentioned in the previous section our
research rotates around the question that can such a system
be built using diferent tools and algorithms which can
provide eficient but faster storage of big data. This paper
can give an idea of how our research towards our goal
progresses. We hypothesize that diferent techniques can
be used and combined to create a better system than the
existing systems.</p>
      <p>Design of proposed solution:</p>
      <p>The aim is to reduce the storage space for storing big
data. As the data sets are huge and also contain a lot of
redundant data, first we will try to remove the redundant
data from the original data set. After removing redundant
data our data set would be more accurate and precise.
After removing redundant data, we will compress our
• Hufman Encoding
• Entropy Encoding
• Simple Repetition
• Bzip2
• Gzip</p>
      <sec id="sec-4-1">
        <title>3.3. Data Storage:</title>
        <p>
          data using any good data compression technique. Then
data will be stored on any desirable cloud server. We pro- store them in Cloud.
pose a system that can be produced with three modules Benchmark System: The following two works can
as depicted in Fig2 and depicted below: serve as the benchmark for any such projects:
Data storage refers to storing information on a storage
medium. There are many storage devices available in
which we can store data. Some of the storage devices are
magnetic tape, disks like floppy, and ROMs, or we can
• The system discussed by Haoyu et al. in their work
“CloST: A Hadoop-based Storage System for Big
SpatioTemporal Data Analytics” [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. They tried to reduce the
storage space and query processing time. They
proposed a three-level hierarchal partitioning approach
for parallel processing for all the objects at diferent
nodes. They used Hadoop Map-Reduce for data
processing and column level gzip for data compression.
• The one discussed by Avanish et al in their work
“Comparative Study of Hadoop over Containers and Hadoop
Over Virtual Machine”. They tried to attain a faster
parallel processing environment which is developed using
Docker Containers rather than the Hadoop framework
on containers.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <sec id="sec-5-1">
        <title>3.1. Data Processing:</title>
        <p>Big data processing is a process of handling a large
volume of data. Data processing is the act of manipulation
of data by computer. Its main purpose is to find
valuable information from raw data. Data processing can
be of three types: manual data processing, batch data
processing, and real-time data processing. We have the
following tools and techniques for Data processing:
• Apache Giraph
• Apache Hive
• Apache Spark
• Hadoop MapReduce
• PySpark
• Prophet Method
A review has been done in this paper, for data processing
• Neural Network Method and compression and mentioned that each technique has
its benefit for a specific type of data set. This efort is
• MongoDB made to explore the maximum possible, important
tech• NoSQL niques coming from all the existing techniques used for
the purpose. Also, we have proposed an architecture that
• Hashing can achieve the initial objective of reducing storage space.
To achieve this, the architecture first performs analytics
3.2. Data Compression: on the dataset, and then its size is reduced to an extent
making it easy to store. After analysis and compression,
Data Compression is the way towards diminishing the the resultant dataset is stored in a cloud environment to
information required for capacity or transmission. It provide better scalability and accessibility. Many types
includes changing, encoding, and changing over piece of research have been conducted already up to a scope
structures so it consumes less space. A typical compres- to resolve the issue of eficient big data storage but some
sion procedure eliminates and replaces tedious pieces more persuasive steps are still essential to be taken.
and images to decrease size. There can be two types of
compressions one is losing and the other is lossless. We
have the following tools and techniques for Data
Compression:
and knowledge management. 2012.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , and Yike Guo. ”
          <article-title>Wiki-health: from quantified self to self-understanding</article-title>
          .
          <source>” Future Generation Computer Systems</source>
          <volume>56</volume>
          (
          <year>2016</year>
          ):
          <fpage>333</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bernard</given-names>
            <surname>Marr. ”How Much Data Do We Create Every Day? The</surname>
          </string-name>
          Mind-Blowing
          <source>Stats Everyone Should Read” Forbes, 21 May</source>
          <year>2018</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Chaowei, Yan Xu, and Douglas Nebert.
          <article-title>”Redefining the possibility of digital Earth and geosciences with spatial cloud computing</article-title>
          .”
          <source>International Journal of Digital Earth 6.4</source>
          (
          <year>2013</year>
          ):
          <fpage>297</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Qunying</surname>
          </string-name>
          , et al. ”
          <article-title>Cloud computing for geosciences: deployment of GEOSS clearinghouse on Amazon's EC2.” Proceedings of the ACM SIGSPATIAL international workshop on high performance and distributed geographic information systems</article-title>
          .
          <year>2010</year>
          .Sharma,
          <string-name>
            <surname>Sushil</surname>
          </string-name>
          , et al. ”
          <article-title>Image Steganography using Two's Complement</article-title>
          .”
          <source>International Journal of Computer Applications</source>
          <volume>145</volume>
          .10 (
          <year>2016</year>
          ):
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jahanara</surname>
          </string-name>
          , et al. ”
          <article-title>Big Data Security with Access Control Model and Honeypot in Cloud Computing</article-title>
          .”
          <source>International Journal of Computer Applications</source>
          <volume>975</volume>
          :
          <fpage>8887</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Krish</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          , et al. ”
          <article-title>On eficient hierarchical storage for big data processing</article-title>
          .
          <source>” 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)</source>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hongbo</surname>
          </string-name>
          , et al. ”
          <string-name>
            <surname>Improving</surname>
            <given-names>I/</given-names>
          </string-name>
          <article-title>O performance with adaptive data compression for big data applications</article-title>
          .”
          <source>2014 IEEE International Parallel &amp; Distributed Processing Symposium Workshops. IEEE</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Feller</surname>
            , Eugen,
            <given-names>Lavanya</given-names>
          </string-name>
          <string-name>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and Christine Morin. ”
          <article-title>Performance and energy eficiency of big data applications in cloud environments: A Hadoop case study</article-title>
          .
          <source>” Journal of Parallel and Distributed Computing</source>
          <volume>79</volume>
          (
          <year>2015</year>
          ):
          <fpage>80</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Tan</surname>
            , Haoyu,
            <given-names>Wuman</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <surname>Lionel</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ni</surname>
          </string-name>
          . ”
          <article-title>Clost: a hadoop-based storage system for big spatiotemporal data analytics</article-title>
          .
          <source>” Proceedings of the 21st ACM international conference on Information</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Prasad</surname>
            ,
            <given-names>Bakshi</given-names>
          </string-name>
          <string-name>
            <surname>Rohit</surname>
          </string-name>
          , and Sonali Agarwal. ”
          <article-title>Comparative study of big data computing and storage tools: a review</article-title>
          .”
          <source>International Journal of Database Theory and Application</source>
          <volume>9</volume>
          .1 (
          <year>2016</year>
          ):
          <fpage>45</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>Robert D.</given-names>
          </string-name>
          ”
          <article-title>Hadoop for dummies</article-title>
          .” John Willey &amp; sons (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Prasetyo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bayu</surname>
          </string-name>
          , et al. ”
          <article-title>A review: evolution of big data in developing country</article-title>
          .
          <source>” Bulletin of Social Informatics Theory and Application</source>
          <volume>3</volume>
          .1 (
          <year>2019</year>
          ):
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Najm</surname>
          </string-name>
          , Moustafa, and Venkatesh Tamarapalli. ”
          <source>Costeficient Deployment of Big Data Applications in Federated Cloud Systems.” 2019 11th International Conference on Communication Systems &amp; Networks (COMSNETS)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Chatterjee</surname>
          </string-name>
          , Amlan, Rushabh Jitendrakumar Shah, and
          <string-name>
            <surname>Khondker</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          . ”
          <article-title>Eficient Data Compression for IoT Devices using Hufman Coding Based Techniques</article-title>
          .” 2018
          <string-name>
            <given-names>IEEE</given-names>
            <surname>International</surname>
          </string-name>
          <article-title>Conference on Big Data (Big Data)</article-title>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chao</surname>
          </string-name>
          , et al. ”
          <article-title>Robot: An eficient model for big data storage systems based on erasure coding</article-title>
          .”
          <source>2013 IEEE International Conference on Big Data. IEEE</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chaowei</surname>
          </string-name>
          , et al. ”
          <article-title>Big Data and cloud computing: innovation opportunities and challenges</article-title>
          .”
          <source>International Journal of Digital Earth 10.1</source>
          (
          <year>2017</year>
          ):
          <fpage>13</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>Hosagrahar V.</given-names>
          </string-name>
          , et al. ”
          <article-title>Big data and its technical challenges</article-title>
          .”
          <source>Communications of the ACM 57.7</source>
          (
          <year>2014</year>
          ):
          <fpage>86</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Bryant</surname>
            , Randal,
            <given-names>Randy H.</given-names>
          </string-name>
          <string-name>
            <surname>Katz</surname>
            , and
            <given-names>Edward D.</given-names>
          </string-name>
          <string-name>
            <surname>Lazowska</surname>
          </string-name>
          . ”
          <article-title>Big-data computing: creating revolutionary breakthroughs in commerce, science and society</article-title>
          .” (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <surname>Divyakant</surname>
          </string-name>
          , et al. ”
          <article-title>Challenges and opportunities with Big Data 2011-1</article-title>
          .” (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>Luna</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
          </string-name>
          . ”
          <article-title>Big Data Integration (Synthesis Lectures on Data Management)</article-title>
          .
          <source>”</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nawsher</surname>
          </string-name>
          , et al. ”
          <article-title>Big data: survey, technologies, opportunities, and challenges</article-title>
          .”
          <source>The scientific world journal</source>
          <year>2014</year>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Kodituwakku</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>U. S.</given-names>
            <surname>Amarasinghe</surname>
          </string-name>
          . ”
          <article-title>Comparison of lossless data compression algorithms for text data</article-title>
          .”
          <source>Indian journal of computer science and engineering 1</source>
          .4 (
          <year>2010</year>
          ):
          <fpage>416</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mingchen</surname>
          </string-name>
          , et al. ”
          <article-title>Big data analytics and mining for efective visualization and trends forecasting of crime data</article-title>
          .
          <source>” IEEE Access 7</source>
          (
          <year>2019</year>
          ):
          <fpage>106111</fpage>
          -
          <lpage>106123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Lu</surname>
            , Lu, and
            <given-names>Pengcheng</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          . ”
          <source>Duplication Detection in News Articles Based on Big Data.” 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25] Cheng, Yan, Qiang Zhang, and Ziming Ye.
          <source>”Research on the Application of Agricultural Big Data Processing with Hadoop and Spark.” 2019 IEEE International Conference on Artificial Intelligence and Computer</source>
          Applications (ICAICA). IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Martinez-Millana</surname>
          </string-name>
          , Carlos, et al. ”
          <article-title>Comparing data base engines for building big data analytics in obesity detection</article-title>
          .”
          <source>2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>Manish</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , and Karthikeyan Subbiah.
          <article-title>”A novel storage architecture for facilitating eficient analytics of health informatics Big Data in cloud</article-title>
          .”
          <source>2016 IEEE International Conference on Computer and Information Technology (CIT)</source>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pingpeng</surname>
          </string-name>
          , et al. ”
          <article-title>Big RDF Data Storage, Computation, and Analysis: A Strawman's Arguments</article-title>
          .”
          <source>2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Avanish</surname>
          </string-name>
          , et al. ”
          <article-title>Comparitive Study of Hadoop over Containers and Hadoop Over Virtual Machine</article-title>
          .”
          <source>International Journal of Applied Engineering Research 13.6</source>
          (
          <year>2018</year>
          ):
          <fpage>4373</fpage>
          -
          <lpage>4378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Sayood</surname>
            ,
            <given-names>Khalid.</given-names>
          </string-name>
          <article-title>Introduction to data compression</article-title>
          . Morgan Kaufmann,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>