=Paper=
{{Paper
|id=Vol-2786/Paper27
|storemode=property
|title=Analytics and Storage of Big Data
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper27.pdf
|volume=Vol-2786
|authors=Shubham Upadhyay,Rakesh Manwani,Saksham Varshney,Sarika Jain
|dblpUrl=https://dblp.org/rec/conf/isic2/UpadhyayMV021
}}
==Analytics and Storage of Big Data==
202
ANALYTICS AND STORAGE OF BIG DATA
Shubham Upadhyay, Rakesh Manwani, Saksham Varshney and Sarika Jain
NIT Kurukshetra, Kurukshetra, Haryana, 136119, India
Abstract
Data generated by the devices and the users in modern times is high in volume and variable in structure. Collectively termed
as Big Data, it is difficult to store and process using traditional processing tools. Traditional systems store data on physical
servers or cloud resulting in higher cost and space complexity. In this paper, we provide a survey of various state-of-the-art
research works done to handle the inefficient storage problem of Big Data. We have provided comparative literature to
compare existing works to handle Big Data. As a solution to the problem encountered, we propose to split the Big Data into
small chunks and provide each chunk to a different cluster for removing the redundant data and compressing it. Once every
cluster has completed its task, the data chunks are combined back and stored on the cloud as compared to physical servers.
This effectively reduces storage space and achieves parallel processing, thereby decreasing the processing time for very large
data sets.
Keywords
Big Data, Cloud Computing, Data Analytics, Data Compression, Storage System,
1. Introduction data is generally scattered and broadly classified into
three types:
Digitization is generating a huge amount of data, and
Information Technology Organizations have to deal with • Structured: Data that can be stored in tables, made
this huge data [1], which is very difficult to manage and of rows and columns comprising a database is termed
store. This data comes from various sources like social as structured data. This type of data can be easily
media, IoT devices, sensors, mobile networks, etc. Ac- processed. Eg. Relational data.
cording to some figures, 2/3rd of the total data has been • Semi-Structured: Similarly, data that cannot be stored
generated in the last 2 years [2]. According to Intel, smart in form of a database but can be easily analyzed is
cars generate 4000GB of data per day. termed semi-structured data. They usually occupy less
Traditionally, companies prefer to deploy their servers space. Eg. XML data.
for data storage, but as the volume of data increases it
becomes challenging for the companies to manage the • Unstructured data: Unstructured data have an alter-
infrastructure required and the cost associated with it, native platform for storage and management that is
this also poses flexibility issues. The problem related to mainly used in an organization with business intelli-
the management of data can be handled by using infras- gence. Eg. Pdf, Media files.
tructures like Cloud, which provide close to unlimited
storage along with services such as data security because As each structure has different features, they are needed
of which data owners don’t have to put much effort into to be processed by different tools and hence, making it
it and can focus on their day-to-day tasks. difficult to define a single mechanism to process big data
Along with being large, the data is also complex which efficiently. Along with complex structure, big data is also
possesses problems when it has to be processed with characterized by 5 V’s which defines the things a system
traditional processing tools. For this, we need some ded- developer has to keep in mind while dealing with Big
icated tools which will facilitate the processing of this Data. These V’s are:
data, which are part of Big Data Computing. It involves • Velocity: Velocity is the speed of data generation, anal-
master-slave architecture in which there is a single mas- ysis, and collection. With each day, the velocity of data
ter node that assigns a task to slave nodes, which works keeps on increasing.
in a parallel fashion. It facilitates faster processing. The
• Volume: Volume is the amount of data, which is gener-
ated from social media, credit cards, sensors, etc. The
International Semantic Intelligence Conference (ISIC 2021) volume of data is so large that it is difficult to manage,
Envelope-Open supadhyay567@gmail.com (S. Upadhyay);
rehan.manwani.56@gmail.com (R. Manwani);
store, and analyze it.
sakshamvarshney8@gmail.com (S. Varshney); jasarika@gmail.com
• Value: In the current scenario, data is money. It’s
(S. Jain)
Orcid 9729467453 (S. Upadhyay) worthless if we can’t extract value from it. Having
© 2020 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
huge data is something, but if we can’t extract value
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) from it, then it is useless.
203
Big Data Storage is quite challenging as Hard Disk
Drives fail (HDDs) very often and can’t assure effi-
cient data storage with several data protection mecha-
nisms. Adding to it, the velocity of big data generates
a need for scalable storage management to cope up
with it. Though, Cloud provides a potential solution
to address the problem with unlimited storage options
which highly faults tolerant, transferring the Big Data
to and hosting on the cloud is quite expensive with
this huge amount of data [3].
• Data Transmission: Transmission of data consists
Figure 1: Parallel Processing Environment. of several stages like a) Collecting data coming from
different origins, like sensors, social media, etc. b) Inte-
gration of the collected data that comes from different
• Variety: Data is available in three forms that are struc- origins. c) the Controlled transfer of integrated data
tured, semi-structured, and unstructured. The volume to processing platforms. d) Movement of data from
of structured data is very less as compared to unstruc- server to host [4]. This amount of transferring data
tured data. within different stages is quite challenging in various
manners.
• Veracity: Veracity refers to the quality of the data. It
means how accurate data is. It tests the reliability and • Data Processing: Preparing enormous volumes of
accuracy of the content. information requires devoted computing assets and
this is mostly taken care of by the expanding pace of
Cloud is a better storage platform where our data is eas- CPU, network, and capacity. Anyway, the comput-
ily accessible and secure. So business firms have started ing assets required for handling huge information far
storing their data in the cloud. But the rate of growth of surpassed the preparing power offered by customary
data is exponential; as a result, cloud servers also lack driving ideal models.
such a huge volume of storage. Therefore, there emerges
a need to select important data and store it in a way • Data Duplication: In a big data environment, most of
that it could fit in less memory space and it should be the data sets have identical content and are redundant.
cost-effective. Now to achieve this objective we require Data duplication increases the cost and also takes more
a system that can perform this task in less time but the space.
single system is not able to do this task efficiently thus we
require an environment where we can achieve parallel Benefits of Data Analytics and Compression
processing to perform this task fast. Data Analytics comprises collecting data, removing data
Fig. 1 shows a way to process it faster by dividing data redundancy, analyzing results, etc. and compression com-
into small chunks and assigning each chunk to a cluster prises reducing the file size to save space. These both
for processing the data chunk provided by the master combined can be used to benefit our research question.
node. This will achieve parallel processing and increases The following are the benefits of data analytics and com-
the rate of processing. pression of data:
Challenges with Big Data Storage System
• Less disk space: As the size of big data is very large,
Where there are opportunities, there are challenges. With
compressing it after analyzing can help in reducing
various benefits that can be gained with large data sets,
disk space to a great extent. This process releases a lot
big data storage possesses various challenges too. Var-
of space on the drive and as a result memory results
ious computational methods that work well with small
are closed up, which reduces the time that is required
data sets won’t work well with big data sets. The follow-
to retrieve the data from a server.
ing are some major challenges with big data storage:
• Faster file transfer: After the file compression, the
• Format of Data: While working for Big Data Storage size of the file will be reduced. So time for transmitting
Management one of the prime challenge is to make a file with a reduced size will be faster.
the system which will deal with both structured and
unstructured data, equally. • Accessibility: Accessing managed data is relatively
easier as it allows a faster searching from hugely pop-
• Data Storage: Volume, Variety, and Velocity of Big ulated data. Also, data can be accessed remotely from
Data lead to various storage challenges. Traditional any place with internet connectivity.
204
• Cost Reduction: Using the cloud as a storage medium by Apache. It is used for providing query and analysis
helps in reducing hardware costs as well as cutoff en- of Big Data. It has a SQL like interface.
ergy costs. We can rent as much space as we want at
a minimal cost. • Apache Spark: Spark is an open-source big data ana-
lytical engine that provides an environment for cluster
• Virtualization: Cloud provides backup for big data computing and parallel processing of big data. It comes
and takes off the burden of growing data from the with inbuilt modules of SQL, machine learning, graph
enterprises. To provide a backup, it is recommended processing, etc.
to make virtual copies of the applications instead of
making the physical copies of the analytics. • Hadoop Map Reduce Another Tool: that can be
used for programming the model of Big Data is Hadoop
• Scalability: Big data management allows the applica- Map Reduce. To write the programs for Map Reduce
tion to grow exponentially as it deals with the storage various languages like Java, Python, C++ are used pop-
space by itself. It reduces the need for new servers ularly.
or supporting hardware as it manages the data in the
existing ones. • Apache Giraph: Apache Giraph is real-time graph
processing software built for high scalability which is
A lot of work has been done in the concerned direction used to analyze social media data. Many multinational
by different authors, and the following are the contribu- companies are using Giraph, tweaking the software
tions of this work: (1) An exhaustive study of the existing for their purposes.
systems has been done and based on three parameters,
• PySpark: PySpark is a python API processing on the
namely data processing tools used, data compression
programming model of Spark to Python. It is a neces-
technique, and data storage option used, a review has
sary supplement API when one wants to process Big
been done and summarized to get a gist of all the existing
Data and apply machine learning algorithms at the
ways to deal with Big Data. (2) Based on the compar-
same time.
ative study mentioned above few gaps in the existing
system are extracted and a solution to fill those gaps is • Prophet Method: Prophet Prophet is a procedure for
discussed. This paper is structured into four sections. foreseeing time game plan data reliant on an additional
After introducing the paper in section 1, we move to sec- substance model where examples are fit annually, step
tion 2 discussing the works done so far in the concerned by step, and ordinary consistency, notwithstanding
direction and a comparative study of systems of different event results. It’s best results are with the time course
authors that deal with big data. Section 3 provides the of action having strong ordinary effects and a couple
gaps that come out of the previously done work. And of times of recorded data.
a solution is proposed to fill these gaps. Finally, section
4 concludes the work, and targeted future work is also • Neural Network: Neural frameworks are a ton of
mentioned. computations, shown openly after a man’s cerebrum,
that is expected to see plans. They unravel mate-
rial data through such a machine acknowledgment,
2. Related Work naming, or grouping unrefined information. Neural
frameworks or connectionist structures are preparing
Various systems tried to achieve an efficient data stor- systems questionably spurred by the natural neural
age system and tried various techniques. Based on some frameworks that set up human cerebrums. Such struc-
differentiating parameters, we have done a comparative tures ”learn” to perform tasks by pondering models,
study between different Storage Management Systems generally without being altered with task-unequivocal
proposed by different authors in their research work. So, guidelines.
to draw a better technique between various systems we
will use the following differentiating parameters (a) Data • MongoDB: MongoDB is an intermediate database man-
Processing, (b) Data Compression, and (c) Data Storage. agement system between key-value and traditional Re-
Different data processing/ pre-processing techniques are lational Database Management System. It is a document-
used by various systems to analyze the big data and re- oriented DB system and is classified as a NoSQL database
duce the data redundancy and we will differentiate the system and is the database for Big Data processing.
existing systems based on these techniques also. The
following are some important data processing tools for • NoSQL Database: It stands for “Not Only SQL” and
big data: is used against RDBMS where we used to build the
schema before the actual database and the data is
• Apache Hive: Hive is a software project developed stored in the form of tables.
205
• Hashing: A hash work is any capacity that can be uti- make the system cost-effective. So, we will also be differ-
lized to delineate subjective size to fixed-size qualities. entiating the storage option used by different systems to
The features that are used to fill a table of fixed-size, draw a better system out of all. The following are some
termed as a hash table. The consequence of a hash important storage options that can be used in a system
work is known as a hash value or basically, a hash. to store data files:
• Docker: DOCKER is one of the top organizations on • Physical Servers: Traditionally, Big Data is stored
the planet which is a light computing framework that on physical servers and is extracted, transformed, and
provides containers for service. These containers are translated on that same server. They are generally
open source containers and are easily accessible to any- managed, owned, and maintained by the company’s
one free of cost. It is because of this only that container staff.
administrations are enjoying enormous interest. The
Docker gives the containers progressively secure and • Cloud: It is a virtual server running in a cloud com-
easy to use benefits. Because it gives customary up- puting environment. It is accessed via the internet and
dates to the containers which is the reason the holder can be accessed remotely and is maintained by a third
won’t bargain with the speed of its execution. party. Customers need not pay for hardware and soft-
ware, rather they need to pay for resources. Some of
Data Compression techniques are also used by the sys- the cloud providers are AWS, Microsoft Azure, Google
tems as compression of data results in the efficient and Cloud Platform, IBM Cloud, Rackspace, Oracle Cloud,
faster upload of data on the storage. Again we will dif- and Verizon Cloud.
ferentiate the systems based on different compression al-
gorithms and techniques used by different systems. The • Hard Disk Drives: Major improvements are going
following are some important data compression algo- on by the manufacturers to provide a better perfor-
rithms for structured big data files: mance of newer 10,000 rpm 2.5-inch Hard Disk Drives
than older, 15,000 rpm 3-inch devices. These advance-
• Huffman Coding: It is a greedy algorithm used for ments include heat-assisted magnetic recording which
lossless compression of data. It makes sure that the boots up the storage capacity of the device. These
prefix of one code should not be the prefix of another better Hard Disk Drives are growing rapidly and are
code. It works on character bits. It ensures that there providing a better environment to store Big Data.
should be no ambiguity while decoding the bitstream.
• Federated Cloud System: Federated cloud systems
• Entropy Encoding: The father of data theory, Shan- are a combination of pre-existing or newly generated
non proposed some form of entropy encoding used internal or external clouds to fulfill business needs. In
for lossless compression frequently. This compression this combination of several clouds, clouds may perform
technique is based on data theoretic techniques. different actions or common action.
• Simple Repetition: For the n-times successive ap-
The efficient storage of big data is a major issue in
pearance of the same token sequence in a series can be
most organizations and that’s why many researchers
replaced with a token and a count representing several
have tried to deal with many different ways. A discus-
appearances. A flag is used to represent whenever the
sion of a few works is done here:
repeating token appears.
Jahanara et.al.[5] proposed more secure big data storage
• Gzip: GNU zip is a modern-day compression algo- protocols. The writers implemented this using an access
rithm where its main function is to compress and de- control model (using parameters) and honeypot (for in-
compress the files for faster network transfer. It re- valid users) in a cloud computing environment. They
duces the size of the named file using the technique concluded that it is needed to change faith and admission
Lempel-Ziv coding. It is based on Huffman Encoding control procedure in a cloud location for big data pro-
and uses an approach of LZ77 which looks at partial cessing. This system was suitable for analyzing, storing,
strings within the text. and retrieval of big data in a cloud environment.
Krish et al. [6] aimed to enhance the overall I/O through-
• Bzip2: It is based on Burrows-Wheeler sorting text put of Big Data storage using Solid-state drives (SSD’s)
algorithm and Huffman Encoding which works on and for that, they designed and implemented a dynamic
blocks that go from 100 to 900 KB. It is open to all, an data management system for big data processing. The
open-source compression program for files, and is also system used a map-reduce function for data processing
free to all. and SSD for storage. They divided SSD into two tiers
Various systems have either opted for traditional phys- and kept a faster tier as a cache for frequently accessed
ical servers to store their big data or cloud servers to data. This system, as a result, shows only 5% overhead in
206
performance and offers inexpensive and efficient storage loading a huge amount of data they found that MongoDB
management. shows better performance. They also found out that
Hongbo et al. [7] proposed a system that reduces the MySQL took double the time of MongoDB for database
movement of the data and releases intense I/O perfor- loading. For data query and retrieval, MySQL was found
mance congestion. They came up with a combination of out to be more efficient. So, they concluded that Mon-
two data compression algorithms that effectively selects goDB is better for database loading and MySQL is better
and switches to accomplish the most optimum input- for data query and data retrieval.
output results. They experimented with a real-time ap- Containers reduce the downtime in the backend and pro-
plication that was conducted on a cluster machine with vide a better service to the clients. It can be quoted from
1280 cores and each core was comprised of 80-nodes. another research paper as Avanish et. al. [29] proved
They come up with the result that to affect the decision in their research paper that Docker containers present
for compression, processors available and the compres- better results to their benchmark testing than clusters
sion ratio are the two most important factors. made on Virtual Machines. They also stated that cluster
Eugen et al. [8] aimed to evaluate the performance and on containers shows better efficiency than those on Vir-
energy friendliness of various Big data tools and applica- tual Machines.
tions in the cloud environment which use Hadoop map-
reduce. They conducted this evaluation on physical as Based on the above-mentioned study a conclusive com-
well as on virtual clusters via different configurations. parison has been derived which depicts how existing sys-
They concluded that although Hadoop is popularly used tems differ from each other based on our differentiating
in a virtual cloud environment, we are still unable to find parameters. Existing systems can be categorized into the
its correct configuration. following two categories:
Mingchen et. al. [23] proposes a systematic approach to 1. Systems with compression (in Table 1)
identify trends and patterns with big data. Here Big Data 2. Systems without compression (in Table 2)
Analytics is applied to criminal data. Criminal data were Findings from systems with compression:
collected and analyzed. They used the prophet model
and neural network model for identifying the trend. In • System 1 , there is a requirement for RDF-explicit ca-
conclusion, they found that the prophet model works pacity procedures and calculations for building pro-
better than the neural network model. They haven’t ductive and superior RDF stores.
mentioned any compression technique that they used in • System 2 performs parallel compression on different
their research. processors using different compression techniques. Dis-
Lu et. al. [24] aimed to explore duplicate detection in covered that the proportion after compression and ac-
a news article. They proposed a tool NDFinder using cessible processor are the most significant elements to
the hash technique to detect article duplication. They affect the compression choice.
checked 33,244 news articles and detects 2150 duplicate
articles. Their precision reached 97%. They matched the Findings from systems without compression:
hash values of different articles and report those having
• System 1 processes Big Data using Deep Learning, Neu-
similar values. They used this research in finding plagia-
ral network, and Prophet model and discovered that
rism in various articles. They also found that the 3 fields
the Deep learning model and Prophet model and have
with the highest
better precision than the neural network model.
proportion of plagiarism are sports news, technology
news, and military news. Moustafa et.al.[13] proposed • System 2 implemented a tool NDFinder which detects
an algorithm to minimize bandwidth cost as big data duplicated data with a precision of 97%. From 33,240
requires a lot of resources for computation and high records, they detected 2,150 positive duplicate records.
bandwidth to allow data transfer. They proposed that
a federated cloud system provides a cost-effective solu- • System 3 reduced operation cost coming out because of
tion for analyzing, storing, and computing big data. This big data applications getting deployed on a federated
algorithm also proposed a way to minimize electricity cloud.
costs for analyzing and storing big data. They concluded
• System 4 compares Spark and Hadoop and found out
that their proposed algorithms perform better than the
that Spark is a faster tool for Big Data processing as
existing approach when the application is large. They
compared to Hadoop.
proposed an algorithm that minimizes the cost of energy
and bandwidth utilization to a certain extent. • System 5 analyzed the current technologies for big
Carlos et. al. [26] proposed a comparative study based data storage and processing and came out with the
on obesity based on electronic health records. They col- result that MongoDB is better for database loading and
lected 20,706,947 records from different hospitals. While MySQL is better for data query and data retrieval
207
Table 1
Comparative study of systems with compression
System No. Paper Reference Data Processing Data Compression Storage
1 Yuan et al. (2019) [28] Hashing & Indexing Huffman Encoding Physical Servers
2 HongboZou et al. (2014) [7] Not mentioned bzip2 and gzip Not mentioned
Table 2
Comparative study of systems without compression
System No. Paper Reference Data Processing Storage
1. Feng, Mingchen, et al. (2019) [23] Prophet method and neural network Cloud Storage
2. Lu et al. (2019) [24] NDfinder, a hashing based tool Physical Servers
3. MoustafaNajm et al. (2019) [13] Virtual Machines Federated Cloud System
4. Cheng et al. (2019) [25] Spark and Hadoop Cloud Storage
5. Carlos et al. (2019) [26] MongoDB and MySQL Cloud Storage
6. Avanish et al. (2018) [29] Docker Containers vs Hadoop Physical Storage
7. Pandey et al. (2016) [27] NoSQL Cloud Storage
8. Krish et. al (2016)[6] Hadoop/MapReduce and spark Hard Disk Drive
9. Feller, Eugen et al. (2015) [8] Hadoop/MapReduce Physical Servers
10. Jahanara, et al. [5] Hadoop/ MapReduce Cloud Storage
• In system 6, Docker containers perform way faster mission, there is a higher probability of data loss orbit
than the Hadoop framework on containers. corruption.
• System 7 states that NoSQL gives a faster result than As the size of data is increasing day by day, servers are
SQL getting exhausted, requiring the installation of high-cost
hardware. That’s why there is a need to store it in such a
• System 8 shows Hadoop/ Map Reduce requires a higher way that it utilizes less space and should be cost-effective.
specified environment, making the process almost im- The main objective is to reduce the storage size of big
possible for low specific systems. Though parallel pro- data and to provide an approach to keep on working with
cessing gives a better performance the existing physical data storage even with an increased
• System 8 depicts that data locality is an important number of users and files. The system aims to reduce
factor in energy efficiency. data redundancy and provides a cost-effective and flexible
environment.
• System 9 proposed a model to secure Big Data while Expected output and outcome of the solution:
storing it in the cloud. Honeypot is used as a trap to A storage management system that provides Space-
catch a thief or hacker and unauthorized user. efficient storage of big data, reduction of data redundancy,
and a cost-effective and flexible way to access data Based
on the facts mentioned in the previous section our re-
3. Research Question & search rotates around the question that can such a system
Hypothesis be built using different tools and algorithms which can
provide efficient but faster storage of big data. This paper
After summarizing the works depicted in Table 1 and can give an idea of how our research towards our goal
Table 2 following gaps are identified out: progresses. We hypothesize that different techniques can
be used and combined to create a better system than the
• Existing systems tried to reduce the size by compres- existing systems.
sion but the results were not worthy of the time spent Design of proposed solution:
on it. Because compression of such a huge amount of The aim is to reduce the storage space for storing big
data requires a lot of time and results, we got is not data. As the data sets are huge and also contain a lot of
enough fast to compensate for that time. redundant data, first we will try to remove the redundant
data from the original data set. After removing redundant
• The transmission of big data involves a large number
data our data set would be more accurate and precise.
of bits and a lot of time is consumed. So, during trans-
After removing redundant data, we will compress our
208
• Huffman Encoding
• Entropy Encoding
• Simple Repetition
• Bzip2
• Gzip
3.3. Data Storage:
Figure 2: Proposed Architecture. Data storage refers to storing information on a storage
medium. There are many storage devices available in
which we can store data. Some of the storage devices are
data using any good data compression technique. Then magnetic tape, disks like floppy, and ROMs, or we can
data will be stored on any desirable cloud server. We pro- store them in Cloud.
pose a system that can be produced with three modules Benchmark System: The following two works can
as depicted in Fig2 and depicted below: serve as the benchmark for any such projects:
• The system discussed by Haoyu et al. in their work
3.1. Data Processing: “CloST: A Hadoop-based Storage System for Big Spatio-
Temporal Data Analytics” [9]. They tried to reduce the
Big data processing is a process of handling a large vol-
storage space and query processing time. They pro-
ume of data. Data processing is the act of manipulation
posed a three-level hierarchal partitioning approach
of data by computer. Its main purpose is to find valu-
for parallel processing for all the objects at different
able information from raw data. Data processing can
nodes. They used Hadoop Map-Reduce for data pro-
be of three types: manual data processing, batch data
cessing and column level gzip for data compression.
processing, and real-time data processing. We have the
following tools and techniques for Data processing: • The one discussed by Avanish et al in their work “Com-
• Apache Giraph parative Study of Hadoop over Containers and Hadoop
Over Virtual Machine”. They tried to attain a faster par-
• Apache Hive allel processing environment which is developed using
Docker Containers rather than the Hadoop framework
• Apache Spark
on containers.
• Hadoop MapReduce
• PySpark 4. Conclusion
• Prophet Method A review has been done in this paper, for data processing
• Neural Network Method and compression and mentioned that each technique has
its benefit for a specific type of data set. This effort is
• MongoDB made to explore the maximum possible, important tech-
• NoSQL niques coming from all the existing techniques used for
the purpose. Also, we have proposed an architecture that
• Hashing can achieve the initial objective of reducing storage space.
To achieve this, the architecture first performs analytics
3.2. Data Compression: on the dataset, and then its size is reduced to an extent
making it easy to store. After analysis and compression,
Data Compression is the way towards diminishing the the resultant dataset is stored in a cloud environment to
information required for capacity or transmission. It provide better scalability and accessibility. Many types
includes changing, encoding, and changing over piece of research have been conducted already up to a scope
structures so it consumes less space. A typical compres- to resolve the issue of efficient big data storage but some
sion procedure eliminates and replaces tedious pieces more persuasive steps are still essential to be taken.
and images to decrease size. There can be two types of
compressions one is losing and the other is lossless. We
have the following tools and techniques for Data Com-
pression:
209
References and knowledge management. 2012.
[1] Li, Yang, and Yike Guo. ”Wiki-health: from quan-
tified self to self-understanding.” Future Genera- [10] Prasad, Bakshi Rohit, and Sonali Agarwal. ”Com-
tion Computer Systems 56 (2016): 333-359. parative study of big data computing and storage
tools: a review.” International Journal of Database
Theory and Application 9.1 (2016): 45-66.
[2] Bernard Marr. ”How Much Data Do We Create
Every Day? The Mind-Blowing Stats Everyone
Should Read” Forbes, 21 May 2018, [11] Schneider, Robert D. ”Hadoop for dummies.” John
Willey & sons (2012).
[3] Yang, Chaowei, Yan Xu, and Douglas Nebert. ”Re-
defining the possibility of digital Earth and geo- [12] Prasetyo, Bayu, et al. ”A review: evolution of
sciences with spatial cloud computing.” Interna- big data in developing country.” Bulletin of Social
tional Journal of Digital Earth 6.4 (2013): 297-312. Informatics Theory and Application 3.1 (2019):
30-37.
[4] Huang, Qunying, et al. ”Cloud computing for geo-
sciences: deployment of GEOSS clearinghouse on [13] Najm, Moustafa, and Venkatesh Tamarapalli. ”Cost-
Amazon’s EC2.” Proceedings of the ACM SIGSPA- efficient Deployment of Big Data Applications
TIAL international workshop on high performance in Federated Cloud Systems.” 2019 11th Interna-
and distributed geographic information systems. tional Conference on Communication Systems &
2010.Sharma, Sushil, et al. ”Image Steganography Networks (COMSNETS). IEEE, 2019.
using Two’s Complement.” International Journal
of Computer Applications 145.10 (2016): 39-41. [14] Chatterjee, Amlan, Rushabh Jitendrakumar Shah,
and Khondker S. Hasan. ”Efficient Data Compres-
sion for IoT Devices using Huffman Coding Based
[5] Akhtar, Jahanara, et al. ”Big Data Security with
Techniques.” 2018 IEEE International Conference
Access Control Model and Honeypot in Cloud
on Big Data (Big Data). IEEE, 2018.
Computing.” International Journal of Computer
Applications 975: 8887.
[15] Yin, Chao, et al. ”Robot: An efficient model for
big data storage systems based on erasure coding.”
[6] Krish, K. R., et al. ”On efficient hierarchical stor-
2013 IEEE International Conference on Big Data.
age for big data processing.” 2016 16th IEEE/ACM
IEEE, 2013.
International Symposium on Cluster, Cloud and
Grid Computing (CCGrid). IEEE, 2016.
[16] Yang, Chaowei, et al. ”Big Data and cloud com-
puting: innovation opportunities and challenges.”
[7] Zou, Hongbo, et al. ”Improving I/O performance
International Journal of Digital Earth 10.1 (2017):
with adaptive data compression for big data ap-
13-53.
plications.” 2014 IEEE International Parallel &
Distributed Processing Symposium Workshops.
IEEE, 2014. [17] Jagadish, Hosagrahar V., et al. ”Big data and
its technical challenges.” Communications of the
ACM 57.7 (2014): 86-94.
[8] Feller, Eugen, Lavanya Ramakrishnan, and Chris-
tine Morin. ”Performance and energy efficiency
of big data applications in cloud environments: A [18] Bryant, Randal, Randy H. Katz, and Edward D.
Hadoop case study.” Journal of Parallel and Dis- Lazowska. ”Big-data computing: creating revolu-
tributed Computing 79 (2015): 80-89. tionary breakthroughs in commerce, science and
society.” (2008).
[9] Tan, Haoyu, Wuman Luo, and Lionel M. Ni. ”Clost:
a hadoop-based storage system for big spatio- [19] Agrawal, Divyakant, et al. ”Challenges and op-
temporal data analytics.” Proceedings of the 21st portunities with Big Data 2011-1.” (2011).
ACM international conference on Information
210
[20] Xin, Luna Dong. ”Big Data Integration (Synthesis [29] Singh, Avanish, et al. ”Comparitive Study of
Lectures on Data Management).” (2015). Hadoop over Containers and Hadoop Over Vir-
tual Machine.” International Journal of Applied
Engineering Research 13.6 (2018): 4373-4378.
[21] Khan, Nawsher, et al. ”Big data: survey, technolo-
gies, opportunities, and challenges.” The scientific
world journal 2014 (2014). [30] Sayood, Khalid. Introduction to data compres-
sion. Morgan Kaufmann, 2017.
[22] Kodituwakku, S. R., and U. S. Amarasinghe. ”Com-
parison of lossless data compression algorithms
for text data.” Indian journal of computer science
and engineering 1.4 (2010): 416-425.
[23] Feng, Mingchen, et al. ”Big data analytics and
mining for effective visualization and trends fore-
casting of crime data.” IEEE Access 7 (2019): 106111-
106123.
[24] Lu, Lu, and Pengcheng Wang. ”Duplication De-
tection in News Articles Based on Big Data.” 2019
IEEE 4th International Conference on Cloud Com-
puting and Big Data Analysis (ICCCBDA). IEEE,
2019.
[25] Cheng, Yan, Qiang Zhang, and Ziming Ye. ”Re-
search on the Application of Agricultural Big
Data Processing with Hadoop and Spark.” 2019
IEEE International Conference on Artificial In-
telligence and Computer Applications (ICAICA).
IEEE, 2019.
[26] Martinez-Millana, Carlos, et al. ”Comparing data
base engines for building big data analytics in
obesity detection.” 2019 IEEE 32nd International
Symposium on Computer-Based Medical Systems
(CBMS). IEEE, 2019.
[27] Pandey, Manish Kumar, and Karthikeyan Sub-
biah. ”A novel storage architecture for facilitat-
ing efficient analytics of health informatics Big
Data in cloud.” 2016 IEEE International Confer-
ence on Computer and Information Technology
(CIT). IEEE, 2016.
[28] Yuan, Pingpeng, et al. ”Big RDF Data Storage,
Computation, and Analysis: A Strawman’s Ar-
guments.” 2019 IEEE 39th International Confer-
ence on Distributed Computing Systems (ICDCS).
IEEE, 2019.