1. Introduction

BIG DATA

Shubham Upadhyay

supadhyay567@gmail.com 0 1 2 3

Rakesh Manwani

rehan.manwani.56@gmail.com 0 1 2 3

Saksham Varshney

sakshamvarshney8@gmail.com 0 1 2 3

Sarika Jain

jasarika@gmail.com 0 1 2 3 0 Big Data , Cloud Computing, Data Analytics, Data Compression, Storage System 1 NIT Kurukshetra , Kurukshetra, Haryana, 136119 , India 2 Workshop Proce dings 3 gence. Eg. Pdf , Media files

202 210

data sets. Data generated by the devices and the users in modern times is high in volume and variable in structure. Collectively termed as Big Data, it is dificult to store and process using traditional processing tools. Traditional systems store data on physical servers or cloud resulting in higher cost and space complexity. In this paper, we provide a survey of various state-of-the-art research works done to handle the ineficient storage problem of Big Data. We have provided comparative literature to compare existing works to handle Big Data. As a solution to the problem encountered, we propose to split the Big Data into small chunks and provide each chunk to a diferent cluster for removing the redundant data and compressing it. Once every cluster has completed its task, the data chunks are combined back and stored on the cloud as compared to physical servers. This efectively reduces storage space and achieves parallel processing, thereby decreasing the processing time for very large International Semantic Intelligence Conference (ISIC 2021) CEUR data is generally scattered and broadly classified into

1. Introduction

Digitization is generating a huge amount of data, and Information Technology Organizations have to deal with this huge data [ 1 ], which is very dificult to manage and store. This data comes from various sources like social media, IoT devices, sensors, mobile networks, etc. According to some figures, 2/3rd of the total data has been generated in the last 2 years [ 2 ]. According to Intel, smart cars generate 4000GB of data per day.

Traditionally, companies prefer to deploy their servers for data storage, but as the volume of data increases it becomes challenging for the companies to manage the infrastructure required and the cost associated with it, this also poses flexibility issues. The problem related to the management of data can be handled by using infrastructures like Cloud, which provide close to unlimited storage along with services such as data security because of which data owners don’t have to put much efort into it and can focus on their day-to-day tasks.

Along with being large, the data is also complex which possesses problems when it has to be processed with traditional processing tools. For this, we need some dedicated tools which will facilitate the processing of this data, which are part of Big Data Computing. It involves master-slave architecture in which there is a single master node that assigns a task to slave nodes, which works in a parallel fashion. It facilitates faster processing. The (S. Jain) • Value: In the current scenario, data is money. It’s worthless if we can’t extract value from it. Having huge data is something, but if we can’t extract value from it, then it is useless. • Variety: Data is available in three forms that are structured, semi-structured, and unstructured. The volume of structured data is very less as compared to unstructured data. • Veracity: Veracity refers to the quality of the data. It means how accurate data is. It tests the reliability and accuracy of the content.

Cloud is a better storage platform where our data is easily accessible and secure. So business firms have started storing their data in the cloud. But the rate of growth of data is exponential; as a result, cloud servers also lack such a huge volume of storage. Therefore, there emerges a need to select important data and store it in a way that it could fit in less memory space and it should be cost-efective. Now to achieve this objective we require a system that can perform this task in less time but the single system is not able to do this task eficiently thus we require an environment where we can achieve parallel processing to perform this task fast.

Fig. 1 shows a way to process it faster by dividing data into small chunks and assigning each chunk to a cluster for processing the data chunk provided by the master node. This will achieve parallel processing and increases the rate of processing.

Challenges with Big Data Storage System Where there are opportunities, there are challenges. With various benefits that can be gained with large data sets, big data storage possesses various challenges too. Various computational methods that work well with small data sets won’t work well with big data sets. The following are some major challenges with big data storage: • Format of Data: While working for Big Data Storage Management one of the prime challenge is to make the system which will deal with both structured and unstructured data, equally. • Data Storage: Volume, Variety, and Velocity of Big Data lead to various storage challenges. Traditional Big Data Storage is quite challenging as Hard Disk Drives fail (HDDs) very often and can’t assure eficient data storage with several data protection mechanisms. Adding to it, the velocity of big data generates a need for scalable storage management to cope up with it. Though, Cloud provides a potential solution to address the problem with unlimited storage options which highly faults tolerant, transferring the Big Data to and hosting on the cloud is quite expensive with this huge amount of data [ 3 ]. • Data Transmission: Transmission of data consists of several stages like a) Collecting data coming from diferent origins, like sensors, social media, etc. b) Integration of the collected data that comes from diferent origins. c) the Controlled transfer of integrated data to processing platforms. d) Movement of data from server to host [ 4 ]. This amount of transferring data within diferent stages is quite challenging in various manners. • Data Processing: Preparing enormous volumes of information requires devoted computing assets and this is mostly taken care of by the expanding pace of CPU, network, and capacity. Anyway, the computing assets required for handling huge information far surpassed the preparing power ofered by customary driving ideal models. • Data Duplication: In a big data environment, most of the data sets have identical content and are redundant. Data duplication increases the cost and also takes more space.

Benefits of Data Analytics and Compression Data Analytics comprises collecting data, removing data redundancy, analyzing results, etc. and compression comprises reducing the file size to save space. These both combined can be used to benefit our research question. The following are the benefits of data analytics and compression of data: • Less disk space: As the size of big data is very large, compressing it after analyzing can help in reducing disk space to a great extent. This process releases a lot of space on the drive and as a result memory results are closed up, which reduces the time that is required to retrieve the data from a server. • Faster file transfer: After the file compression, the size of the file will be reduced. So time for transmitting a file with a reduced size will be faster. • Accessibility: Accessing managed data is relatively easier as it allows a faster searching from hugely populated data. Also, data can be accessed remotely from any place with internet connectivity. • Cost Reduction: Using the cloud as a storage medium helps in reducing hardware costs as well as cutof energy costs. We can rent as much space as we want at a minimal cost. • Virtualization: Cloud provides backup for big data and takes of the burden of growing data from the enterprises. To provide a backup, it is recommended to make virtual copies of the applications instead of making the physical copies of the analytics. • Scalability: Big data management allows the application to grow exponentially as it deals with the storage space by itself. It reduces the need for new servers or supporting hardware as it manages the data in the existing ones.

A lot of work has been done in the concerned direction by diferent authors, and the following are the contributions of this work: (1) An exhaustive study of the existing systems has been done and based on three parameters, namely data processing tools used, data compression technique, and data storage option used, a review has been done and summarized to get a gist of all the existing ways to deal with Big Data. (2) Based on the comparative study mentioned above few gaps in the existing system are extracted and a solution to fill those gaps is discussed. This paper is structured into four sections. After introducing the paper in section 1, we move to section 2 discussing the works done so far in the concerned direction and a comparative study of systems of diferent authors that deal with big data. Section 3 provides the gaps that come out of the previously done work. And a solution is proposed to fill these gaps. Finally, section 4 concludes the work, and targeted future work is also mentioned.

2. Related Work

Various systems tried to achieve an eficient data storage system and tried various techniques. Based on some diferentiating parameters, we have done a comparative study between diferent Storage Management Systems proposed by diferent authors in their research work. So, to draw a better technique between various systems we will use the following diferentiating parameters (a) Data Processing, (b) Data Compression, and (c) Data Storage.

Diferent data processing/ pre-processing techniques are used by various systems to analyze the big data and reduce the data redundancy and we will diferentiate the existing systems based on these techniques also. The following are some important data processing tools for big data: • Apache Hive: Hive is a software project developed by Apache. It is used for providing query and analysis of Big Data. It has a SQL like interface. • Apache Spark: Spark is an open-source big data analytical engine that provides an environment for cluster computing and parallel processing of big data. It comes with inbuilt modules of SQL, machine learning, graph processing, etc. • Hadoop Map Reduce Another Tool: that can be used for programming the model of Big Data is Hadoop Map Reduce. To write the programs for Map Reduce various languages like Java, Python, C++ are used popularly. • Apache Giraph: Apache Giraph is real-time graph processing software built for high scalability which is used to analyze social media data. Many multinational companies are using Giraph, tweaking the software for their purposes. • PySpark: PySpark is a python API processing on the programming model of Spark to Python. It is a necessary supplement API when one wants to process Big Data and apply machine learning algorithms at the same time. • Prophet Method: Prophet Prophet is a procedure for foreseeing time game plan data reliant on an additional substance model where examples are fit annually, step by step, and ordinary consistency, notwithstanding event results. It’s best results are with the time course of action having strong ordinary efects and a couple of times of recorded data. • Neural Network: Neural frameworks are a ton of computations, shown openly after a man’s cerebrum, that is expected to see plans. They unravel material data through such a machine acknowledgment, naming, or grouping unrefined information. Neural frameworks or connectionist structures are preparing systems questionably spurred by the natural neural frameworks that set up human cerebrums. Such structures ”learn” to perform tasks by pondering models, generally without being altered with task-unequivocal guidelines. • MongoDB: MongoDB is an intermediate database management system between key-value and traditional Relational Database Management System. It is a documentoriented DB system and is classified as a NoSQL database system and is the database for Big Data processing. • NoSQL Database: It stands for “Not Only SQL” and is used against RDBMS where we used to build the schema before the actual database and the data is stored in the form of tables. • Hashing: A hash work is any capacity that can be utilized to delineate subjective size to fixed-size qualities.

The features that are used to fill a table of fixed-size, termed as a hash table. The consequence of a hash work is known as a hash value or basically, a hash.

make the system cost-efective. So, we will also be diferentiating the storage option used by diferent systems to draw a better system out of all. The following are some important storage options that can be used in a system to store data files: • Docker: DOCKER is one of the top organizations on the planet which is a light computing framework that provides containers for service. These containers are open source containers and are easily accessible to anyone free of cost. It is because of this only that container administrations are enjoying enormous interest. The Docker gives the containers progressively secure and easy to use benefits. Because it gives customary updates to the containers which is the reason the holder won’t bargain with the speed of its execution.

Data Compression techniques are also used by the systems as compression of data results in the eficient and faster upload of data on the storage. Again we will differentiate the systems based on diferent compression algorithms and techniques used by diferent systems. The following are some important data compression algorithms for structured big data files: • Hufman Coding: It is a greedy algorithm used for lossless compression of data. It makes sure that the prefix of one code should not be the prefix of another code. It works on character bits. It ensures that there should be no ambiguity while decoding the bitstream. • Entropy Encoding: The father of data theory, Shannon proposed some form of entropy encoding used for lossless compression frequently. This compression technique is based on data theoretic techniques. • Simple Repetition: For the n-times successive appearance of the same token sequence in a series can be replaced with a token and a count representing several appearances. A flag is used to represent whenever the repeating token appears. • Gzip: GNU zip is a modern-day compression algorithm where its main function is to compress and decompress the files for faster network transfer. It reduces the size of the named file using the technique Lempel-Ziv coding. It is based on Hufman Encoding and uses an approach of LZ77 which looks at partial strings within the text. • Bzip2: It is based on Burrows-Wheeler sorting text algorithm and Hufman Encoding which works on blocks that go from 100 to 900 KB. It is open to all, an open-source compression program for files, and is also free to all.

Various systems have either opted for traditional physical servers to store their big data or cloud servers to • Physical Servers: Traditionally, Big Data is stored on physical servers and is extracted, transformed, and translated on that same server. They are generally managed, owned, and maintained by the company’s staf. • Cloud: It is a virtual server running in a cloud computing environment. It is accessed via the internet and can be accessed remotely and is maintained by a third party. Customers need not pay for hardware and software, rather they need to pay for resources. Some of the cloud providers are AWS, Microsoft Azure, Google Cloud Platform, IBM Cloud, Rackspace, Oracle Cloud, and Verizon Cloud. • Hard Disk Drives: Major improvements are going on by the manufacturers to provide a better performance of newer 10,000 rpm 2.5-inch Hard Disk Drives than older, 15,000 rpm 3-inch devices. These advancements include heat-assisted magnetic recording which boots up the storage capacity of the device. These better Hard Disk Drives are growing rapidly and are providing a better environment to store Big Data. • Federated Cloud System: Federated cloud systems are a combination of pre-existing or newly generated internal or external clouds to fulfill business needs. In this combination of several clouds, clouds may perform diferent actions or common action.

The eficient storage of big data is a major issue in most organizations and that’s why many researchers have tried to deal with many diferent ways. A discussion of a few works is done here: Jahanara et.al.[ 5 ] proposed more secure big data storage protocols. The writers implemented this using an access control model (using parameters) and honeypot (for invalid users) in a cloud computing environment. They concluded that it is needed to change faith and admission control procedure in a cloud location for big data processing. This system was suitable for analyzing, storing, and retrieval of big data in a cloud environment.

Krish et al. [ 6 ] aimed to enhance the overall I/O throughput of Big Data storage using Solid-state drives (SSD’s) and for that, they designed and implemented a dynamic data management system for big data processing. The system used a map-reduce function for data processing and SSD for storage. They divided SSD into two tiers and kept a faster tier as a cache for frequently accessed data. This system, as a result, shows only 5% overhead in performance and ofers inexpensive and eficient storage loading a huge amount of data they found that MongoDB management. shows better performance. They also found out that Hongbo et al. [ 7 ] proposed a system that reduces the MySQL took double the time of MongoDB for database movement of the data and releases intense I/O perfor- loading. For data query and retrieval, MySQL was found mance congestion. They came up with a combination of out to be more eficient. So, they concluded that Montwo data compression algorithms that efectively selects goDB is better for database loading and MySQL is better and switches to accomplish the most optimum input- for data query and data retrieval. output results. They experimented with a real-time ap- Containers reduce the downtime in the backend and proplication that was conducted on a cluster machine with vide a better service to the clients. It can be quoted from 1280 cores and each core was comprised of 80-nodes. another research paper as Avanish et. al. [ 29 ] proved They come up with the result that to afect the decision in their research paper that Docker containers present for compression, processors available and the compres- better results to their benchmark testing than clusters sion ratio are the two most important factors. made on Virtual Machines. They also stated that cluster Eugen et al. [ 8 ] aimed to evaluate the performance and on containers shows better eficiency than those on Virenergy friendliness of various Big data tools and applica- tual Machines. tions in the cloud environment which use Hadoop mapreduce. They conducted this evaluation on physical as Based on the above-mentioned study a conclusive comwell as on virtual clusters via diferent configurations. parison has been derived which depicts how existing sysThey concluded that although Hadoop is popularly used tems difer from each other based on our diferentiating in a virtual cloud environment, we are still unable to find parameters. Existing systems can be categorized into the its correct configuration. following two categories: Mingchen et. al. [ 23 ] proposes a systematic approach to 1. Systems with compression (in Table 1) identify trends and patterns with big data. Here Big Data 2. Systems without compression (in Table 2) Analytics is applied to criminal data. Criminal data were Findings from systems with compression: collected and analyzed. They used the prophet model and neural network model for identifying the trend. In • System 1 , there is a requirement for RDF-explicit caconclusion, they found that the prophet model works pacity procedures and calculations for building probetter than the neural network model. They haven’t ductive and superior RDF stores. mentioned any compression technique that they used in • System 2 performs parallel compression on diferent their research. processors using diferent compression techniques. DisLu et. al. [ 24 ] aimed to explore duplicate detection in covered that the proportion after compression and aca news article. They proposed a tool NDFinder using cessible processor are the most significant elements to the hash technique to detect article duplication. They afect the compression choice. checked 33,244 news articles and detects 2150 duplicate articles. Their precision reached 97%. They matched the Findings from systems without compression: hash values of diferent articles and report those having similar values. They used this research in finding plagia- • System 1 processes Big Data using Deep Learning, Neurism in various articles. They also found that the 3 fields ral network, and Prophet model and discovered that with the highest the Deep learning model and Prophet model and have proportion of plagiarism are sports news, technology better precision than the neural network model. news, and military news. Moustafa et.al.[ 13 ] proposed • System 2 implemented a tool NDFinder which detects an algorithm to minimize bandwidth cost as big data duplicated data with a precision of 97%. From 33,240 requires a lot of resources for computation and high records, they detected 2,150 positive duplicate records. bandwidth to allow data transfer. They proposed that a federated cloud system provides a cost-efective solu- • System 3 reduced operation cost coming out because of tion for analyzing, storing, and computing big data. This big data applications getting deployed on a federated algorithm also proposed a way to minimize electricity cloud. costs for analyzing and storing big data. They concluded • System 4 compares Spark and Hadoop and found out that their proposed algorithms perform better than the that Spark is a faster tool for Big Data processing as existing approach when the application is large. They compared to Hadoop. proposed an algorithm that minimizes the cost of energy and bandwidth utilization to a certain extent. • System 5 analyzed the current technologies for big Carlos et. al. [ 26 ] proposed a comparative study based data storage and processing and came out with the on obesity based on electronic health records. They col- result that MongoDB is better for database loading and lected 20,706,947 records from diferent hospitals. While MySQL is better for data query and data retrieval System No.

Paper Reference

Data Processing

Data Compression

Storage

Yuan et al. (2019) [ 28 ] HongboZou et al. (2014) [ 7 ]

Hashing & Indexing

Not mentioned

Hufman Encoding bzip2 and gzip

Physical Servers Not mentioned

3. Research Question & Hypothesis

After summarizing the works depicted in Table 1 and Table 2 following gaps are identified out: • Existing systems tried to reduce the size by compression but the results were not worthy of the time spent on it. Because compression of such a huge amount of data requires a lot of time and results, we got is not enough fast to compensate for that time. • The transmission of big data involves a large number of bits and a lot of time is consumed. So, during transAs the size of data is increasing day by day, servers are getting exhausted, requiring the installation of high-cost hardware. That’s why there is a need to store it in such a way that it utilizes less space and should be cost-efective. The main objective is to reduce the storage size of big data and to provide an approach to keep on working with the existing physical data storage even with an increased number of users and files. The system aims to reduce data redundancy and provides a cost-efective and flexible environment.

Expected output and outcome of the solution: A storage management system that provides Spaceeficient storage of big data, reduction of data redundancy, and a cost-efective and flexible way to access data Based on the facts mentioned in the previous section our research rotates around the question that can such a system be built using diferent tools and algorithms which can provide eficient but faster storage of big data. This paper can give an idea of how our research towards our goal progresses. We hypothesize that diferent techniques can be used and combined to create a better system than the existing systems.

Design of proposed solution:

The aim is to reduce the storage space for storing big data. As the data sets are huge and also contain a lot of redundant data, first we will try to remove the redundant data from the original data set. After removing redundant data our data set would be more accurate and precise. After removing redundant data, we will compress our • Hufman Encoding • Entropy Encoding • Simple Repetition • Bzip2 • Gzip

3.3. Data Storage:

data using any good data compression technique. Then data will be stored on any desirable cloud server. We pro- store them in Cloud. pose a system that can be produced with three modules Benchmark System: The following two works can as depicted in Fig2 and depicted below: serve as the benchmark for any such projects: Data storage refers to storing information on a storage medium. There are many storage devices available in which we can store data. Some of the storage devices are magnetic tape, disks like floppy, and ROMs, or we can • The system discussed by Haoyu et al. in their work “CloST: A Hadoop-based Storage System for Big SpatioTemporal Data Analytics” [ 9 ]. They tried to reduce the storage space and query processing time. They proposed a three-level hierarchal partitioning approach for parallel processing for all the objects at diferent nodes. They used Hadoop Map-Reduce for data processing and column level gzip for data compression. • The one discussed by Avanish et al in their work “Comparative Study of Hadoop over Containers and Hadoop Over Virtual Machine”. They tried to attain a faster parallel processing environment which is developed using Docker Containers rather than the Hadoop framework on containers.

4. Conclusion 3.1. Data Processing:

Big data processing is a process of handling a large volume of data. Data processing is the act of manipulation of data by computer. Its main purpose is to find valuable information from raw data. Data processing can be of three types: manual data processing, batch data processing, and real-time data processing. We have the following tools and techniques for Data processing: • Apache Giraph • Apache Hive • Apache Spark • Hadoop MapReduce • PySpark • Prophet Method A review has been done in this paper, for data processing • Neural Network Method and compression and mentioned that each technique has its benefit for a specific type of data set. This efort is • MongoDB made to explore the maximum possible, important tech• NoSQL niques coming from all the existing techniques used for the purpose. Also, we have proposed an architecture that • Hashing can achieve the initial objective of reducing storage space. To achieve this, the architecture first performs analytics 3.2. Data Compression: on the dataset, and then its size is reduced to an extent making it easy to store. After analysis and compression, Data Compression is the way towards diminishing the the resultant dataset is stored in a cloud environment to information required for capacity or transmission. It provide better scalability and accessibility. Many types includes changing, encoding, and changing over piece of research have been conducted already up to a scope structures so it consumes less space. A typical compres- to resolve the issue of eficient big data storage but some sion procedure eliminates and replaces tedious pieces more persuasive steps are still essential to be taken. and images to decrease size. There can be two types of compressions one is losing and the other is lossless. We have the following tools and techniques for Data Compression: and knowledge management. 2012.

[1] Li , Yang , and Yike Guo. ” Wiki-health: from quantified self to self-understanding . ” Future Generation Computer Systems 56 ( 2016 ): 333 - 359 .

[2]

Bernard

Marr. ”How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read” Forbes, 21 May 2018 ,

[3] Yang , Chaowei, Yan Xu, and Douglas Nebert. ”Redefining the possibility of digital Earth and geosciences with spatial cloud computing .” International Journal of Digital Earth 6.4 ( 2013 ): 297 - 312 .

[4] Huang , Qunying , et al. ” Cloud computing for geosciences: deployment of GEOSS clearinghouse on Amazon's EC2.” Proceedings of the ACM SIGSPATIAL international workshop on high performance and distributed geographic information systems . 2010 .Sharma, Sushil , et al. ” Image Steganography using Two's Complement .” International Journal of Computer Applications 145 .10 ( 2016 ): 39 - 41 .

[5] Akhtar , Jahanara , et al. ” Big Data Security with Access Control Model and Honeypot in Cloud Computing .” International Journal of Computer Applications 975 : 8887 .

[6] Krish , K. R. , et al. ” On eficient hierarchical storage for big data processing . ” 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) . IEEE, 2016 .

[7] Zou , Hongbo , et al. ” Improving

O performance with adaptive data compression for big data applications .” 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE , 2014 .

[8] Feller , Eugen, Lavanya Ramakrishnan , and Christine Morin. ” Performance and energy eficiency of big data applications in cloud environments: A Hadoop case study . ” Journal of Parallel and Distributed Computing 79 ( 2015 ): 80 - 89 .

[9] Tan , Haoyu, Wuman Luo , and Lionel

Ni . ” Clost: a hadoop-based storage system for big spatiotemporal data analytics . ” Proceedings of the 21st ACM international conference on Information

[10] Prasad , Bakshi Rohit , and Sonali Agarwal. ” Comparative study of big data computing and storage tools: a review .” International Journal of Database Theory and Application 9 .1 ( 2016 ): 45 - 66 .

[11] Schneider , Robert D. ” Hadoop for dummies .” John Willey & sons ( 2012 ).

[12] Prasetyo , Bayu , et al. ” A review: evolution of big data in developing country . ” Bulletin of Social Informatics Theory and Application 3 .1 ( 2019 ): 30 - 37 .

[13] Najm , Moustafa, and Venkatesh Tamarapalli. ” Costeficient Deployment of Big Data Applications in Federated Cloud Systems.” 2019 11th International Conference on Communication Systems & Networks (COMSNETS) . IEEE, 2019 .

[14] Chatterjee , Amlan, Rushabh Jitendrakumar Shah, and Khondker

Hasan . ” Eficient Data Compression for IoT Devices using Hufman Coding Based Techniques .” 2018

IEEE

International Conference on Big Data (Big Data) . IEEE, 2018 .

[15] Yin , Chao , et al. ” Robot: An eficient model for big data storage systems based on erasure coding .” 2013 IEEE International Conference on Big Data. IEEE , 2013 .

[16] Yang , Chaowei , et al. ” Big Data and cloud computing: innovation opportunities and challenges .” International Journal of Digital Earth 10.1 ( 2017 ): 13 - 53 .

[17] Jagadish , Hosagrahar V. , et al. ” Big data and its technical challenges .” Communications of the ACM 57.7 ( 2014 ): 86 - 94 .

[18] Bryant , Randal, Randy H.

Katz , and Edward D.

Lazowska . ” Big-data computing: creating revolutionary breakthroughs in commerce, science and society .” ( 2008 ).

[19] Agrawal , Divyakant , et al. ” Challenges and opportunities with Big Data 2011-1 .” ( 2011 ).

[20] Xin , Luna Dong . ” Big Data Integration (Synthesis Lectures on Data Management) . ” ( 2015 ).

[21] Khan , Nawsher , et al. ” Big data: survey, technologies, opportunities, and challenges .” The scientific world journal 2014 ( 2014 ).

[22] Kodituwakku , S. R. , and

U. S.

Amarasinghe . ” Comparison of lossless data compression algorithms for text data .” Indian journal of computer science and engineering 1 .4 ( 2010 ): 416 - 425 .

[23] Feng , Mingchen , et al. ” Big data analytics and mining for efective visualization and trends forecasting of crime data . ” IEEE Access 7 ( 2019 ): 106111 - 106123 .

[24] Lu , Lu, and Pengcheng Wang . ” Duplication Detection in News Articles Based on Big Data.” 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) . IEEE, 2019 .

[25] Cheng, Yan, Qiang Zhang, and Ziming Ye. ”Research on the Application of Agricultural Big Data Processing with Hadoop and Spark.” 2019 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). IEEE, 2019 .

[26] Martinez-Millana , Carlos, et al. ” Comparing data base engines for building big data analytics in obesity detection .” 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS) . IEEE, 2019 .

[27] Pandey , Manish Kumar , and Karthikeyan Subbiah. ”A novel storage architecture for facilitating eficient analytics of health informatics Big Data in cloud .” 2016 IEEE International Conference on Computer and Information Technology (CIT) . IEEE, 2016 .

[28] Yuan , Pingpeng , et al. ” Big RDF Data Storage, Computation, and Analysis: A Strawman's Arguments .” 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) . IEEE, 2019 .

[29] Singh , Avanish , et al. ” Comparitive Study of Hadoop over Containers and Hadoop Over Virtual Machine .” International Journal of Applied Engineering Research 13.6 ( 2018 ): 4373 - 4378 .

[30] Sayood , Khalid. Introduction to data compression . Morgan Kaufmann, 2017 .