<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Vulnerabilities in Hadoop Map Reduce Framework: A Review</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shubham Jambhulkar</string-name>
          <email>shubham.pj0806@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deepak Singh Tomar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R K Pateriya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Big Data</institution>
          ,
          <addr-line>Map-Reduce, Hadoop, Vulnerability, Kerberos</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maulana Azad National Institute of Technology</institution>
          ,
          <addr-line>Bhopal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Enormous Data is an assortment of various equipment and programming advancements, which have a heterogeneous framework. Hadoop system assumes the main part in managing and putting it away. It provides intelligent financial and fast data applied in various regions such as clinical benefits, social networks, and safeguard. Hadoop Framework is based on distributed streaming model and is used to manage and store data within wide range of product PCs. Because of the adaptability of the system, a few weaknesses emerge. These weaknesses are dangers to the information and lead to assaults. In this paper, various sorts of weaknesses are talked about and potential arrangements are given to diminish or take out these weaknesses. The test arrangement used to perform normal assaults to comprehend the idea and execution of an answer for staying away from those assaults is introduced. The outcomes show the impact of assaults on the presentation. As per results, there is a need to ensure information utilizing guards inside and out to security.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Big Data is a gathering of exceptionally enormous informational collections [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which are
extremely perplexing or too huge to ever be worried about by customary information handling
applications. For any data to be regarded as big data it must satisfy the 4 V’s namely Velocity,
Veracity, Volume and Variety [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With the advancement of technology in today’s world, a large
amount of information is produced in various fields such as social networking sites, transaction
records, data sensors, log files, etc. Due to this various source terabytes of assembled,
semiassembled, and unassembled data are produced at every point of time. Therefore, if this data is not
stored or pre-processed there is a chance of loss of this important data. To avoid this loss, the Hadoop
framework is used with different analytics tools and they are often much quicker than conventional
analytical methods of the past.
      </p>
      <p>Big data is a word that is equally associated with Hadoop. As previously discussed, for any data to</p>
      <p>2022 Copyright for this paper by its authors.</p>
      <p>
        Big data accumulates all the interesting values from the data pool and many countries are operating
on dominant schemes on the basis of big data. As a result of global level schemes many latest models
and framework have been developed. Some frameworks were developed for providing a considerable
and good amount of storage capacity, real-time data analysis, and parallel processing of data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One
such popular framework is Hadoop. The advantages offered by big data are very vast. The technology
offers better scalability, flexibility, with fulfillment-based in a affordable rates. Subsequently recent
growth in sustainable technology, the cost associated with the processing and storage section
continues to decrease [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The recently developed technology is designed to guarantees privacy and security aspects in
comparison to traditional previous technologies. But even with these advantages, they are becoming
prone to negative purposes. With the recent growth in fields and organizations using this technology
for storing and processing their private organization’s data, it has become prone to negative data
attacks.
1.1.</p>
    </sec>
    <sec id="sec-2">
      <title>Hadoop Framework</title>
      <p>Apache Hadoop provides a way to process parallelly the same distribution of very large or
complex databases. The Hadoop framework provides advantages like distributed computing and
parallel processing for datasets. Hadoop comprises a component such as HDFS, Map Reduce, and
YARN. HDFS supervises the repository, Map Reduce supervises processing in parallelly and YARN
is responsible for resource management in Hadoop’s Cluster.
1.1.1. Hadoop</p>
      <p>
        During 2005 Hadoop initially appeared and was introduced at later 2011 to help spread the web
searching tool scheme to Yahoo. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The delivery had very little safety assist, made for people who
were loyal to the Climate. Hadoop since then has emerged among the modern state-of-the-art
advancements to store, process, and examine large information through utilizing bunch of out-spread
climate [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The Framework clients subsequently unrolled from one side of the planet to the other,
generally enormous organizations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>1.1.2. Hadoop Distributed File System</title>
      <p>Hadoop distributed file system is responsible for the storage mechanism of the Hadoop framework
and can operate without any hurdles. In HDFS, a big complex file is distributed over the cluster
network that is comprised with multiple nodes of data and associated repository. During this cycle
segments of the Node Name, the first record becomes square, 64 MB in size and repeats on various
Data Nodes depending on the rules with the previous characters. Name Node additionally comes with
metadata for this duplication and distribution. Every information block has been redesigned multiple
times for maximum access, two from one Data Node site and one from various Data Node racks. The
group Information Node stores a small portion of all text. Name Node always remembers which
information block has the location where the file is located, where the information blocks are set, and
where the power limits are involved. Using periodic signals, Name Node invariably knows which
Data Nodes are still available. When the signal (heartbeat) is missing, the Name Node detects a Data
Node failure, eliminates the Data Node bombed in the Hadoop group, and attempts to distribute the
information load evenly across the current Data Node. Alternatively, the Name Node ensures that a
specified number of duplicates of information is kept constant for maximum access. The diagram
below (Figure 2) shows the Hadoop distribute file structure architecture</p>
    </sec>
    <sec id="sec-4">
      <title>1.1.3. Map Reduce</title>
      <p>MapReduce is an equal handling structure work dependent on the expert slave guideline, like
Hadoop Distributed File System. Map Reduce is mixture consists of three slave agents per slave and
one expert agent in group. Map Reduce management depends equally on various calculations for
direction and downtime. This works in two stages, the map function and the reduction function. This
JobTracker divides the database into separate clusters called map operations and directs them into
three Data Nodes naturally across all related product computers distributed across the organization for
equal management.</p>
      <p>Let us have a look at block diagram of Map reduce phases in Figure 3.</p>
      <p>Ordinarily, the guide assignments run on a similar bunch of Data Nodes where information lives
(Data area). Assuming a hub is as of now vigorously stacked, another hub that is near the information,
i.e., ideally a hub in a similar rack, is chosen. Moderate outcomes are inaccessible to the client and are
traded among the nodes (Shuffling), and from there on, converged by the decreased undertakings to
get the outcome. The Figure 4 shown below shows the internal algorithm of Map Reduce.
The Table 1. Shows the internal structure Key value pair in the Map Reduce.</p>
      <sec id="sec-4-1">
        <title>Output (Key, Value) (Key, list (Value)) (Key, Value)</title>
        <p>Transitional aftereffects of guide stages were amassed with storing short information size as
conceivable within the transfer undertakings to diminish assignments. Medium results are stored in
the nearest Data Node record system. JobTracker responds by carefully resetting any function in the
event of a disruption. If an undertaking doesn't advise any advancement is still up in the air time, or
on the other hand, assuming Node of data flops totally, whole assignment will be booted on other
server counting errands however, will not be wrapped up. Assuming an errand runs very leisurely, the
JobTracker likewise restarts the assignment on one more server to execute the general occupation at a
suitable period.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>1.1.4. Yet Another Resource Negotiator</title>
      <p>
        MapReduce was split into two categories: Yet Another Resource Negotiator and Map Reduce [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Yet another Resource Negotiator primary rule is to isolate the assets of the executives and occupation
planning functionalities into independent daemons. An asset administrator refers assets among
framework implementations, with hub chief assistance. The asset director has two fundamental parts:
application supervisor and scheduler. scheduler assigns assets to different operating systems and
books depending on the asset requirements of the applications. The asset director acknowledges work
entries, and each occupation is distributed to the application supervisor. The diagram below figure 5
represents the YARN log file architecture.
      </p>
    </sec>
    <sec id="sec-6">
      <title>2. Related Work</title>
      <p>This section reviews and provide an analysis of different vulnerabilities on the Hadoop framework.
Generally, our concerned area lies in the Map-Reduce functionalities provided by the Apache Hadoop
framework. There has been numerous tools and method defined to tackle the problem of
vulnerabilities but every tools is able to provide one functionality while leaving several other holes in
the system. The use of Kerberos system with proper authorization is suggested to tackle multiple
problems responsible for vulnerabilities.</p>
    </sec>
    <sec id="sec-7">
      <title>3. Literature Review</title>
      <p>
        In a report [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the vulnerability is categorized as data privacy, infrastructure security, and data
management. These classifications are further classified into three different categories: Dimensional
modeling, architecture dimension, and information flow. The life cycle of data comprises data in
transit and data privacy comprises data at rest.
      </p>
      <p>
        As per another report [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the big data security and privacy issues are categorized into five types
namely, Hadoop Security, Key management, anonymization, monitoring, and auditing. The author
also proposed some algorithms concerning security and monitoring aspects of sensitive information
like the Bull eye algorithm.
      </p>
      <p>
        As per another report [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] on cloud security, a security model with some proposed cloud
infrastructure layered was designed. This model was then further classified into four categories:
logical, basic, governance, and value-added security. This report specifies the infrastructure policy
framework of Hadoop.
      </p>
      <p>
        As per another report [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the author specifies different types of attacks that have taken place in
the Hadoop framework. The attacks namely comprised of Denial of Service, Man in the Middle,
impersonation, repudiation and replay attacks. According to the author, because of the distributed
nature of the Map-Reduce component of Hadoop possible wide range of attacks were possible leading
it to a vulnerable state. The ideal Map Reduce component would be comprised of proper
authentication control, access control, authorization, confidentiality of data, and lastly data availability
for Map and reducer class of Map-reduce. For better authentication control the author recommends
the use of Kerberos protocol.
      </p>
      <p>
        From one more report [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the security and privacy aspects faced challenges that were categorized
in different model names namely access control, access control policy, Data confidentiality, and lastly
smart objects. This report puts forwards the challenges of research faced in regards to comprehensive
solutions for securing security and privacy aspects.
      </p>
      <p>
        Another report [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] lists out the challenges faced when the privacy and security aspects are needed
to be ensured to be safe. The challenges were broadly classified into Risks concerning privacy,
Credibility of data, lacking of recent technologies, and threats. To cope with these challenges author
introduces supervising data, protection mechanism, protection agency, and quality of data.
      </p>
      <p>
        As per another report [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], different categories of security and privacy aspects and the connection
between them were discussed briefly. The aspects were classified as Confidentiality, analytics,
integrity, privacy, stream processing, data format and lastly visualization.
      </p>
      <p>
        This report [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], has showcased an investigation with the corporate perspectives relying on big
data aspects simply and most effectively. Accordingly based on this corporate perspectives economic
perspective, investment decisions, fighting cybercrimes, and cyber insurance. The Table 2 represents
the vulnerabilities reported in online databases as shown below.
      </p>
      <p>Cross-Site Scripting
7
6
9
6
5
17
22
9
14
16
10
3.1.</p>
    </sec>
    <sec id="sec-8">
      <title>Vulnerability Databases</title>
      <p>There are numerous online databases currently available all over the internet, that are mainly
responsible for exposing the possible security vulnerabilities on numerous products and hardware.
There are numerous such online databases namely, Common Vulnerabilities and exposures, Computer
emergency readiness team, National Vulnerability databases, and Open-Source Vulnerability
databases.</p>
      <p>
        The CVEs uniquely identify the vulnerabilities based on an identification number. Based on CVE
the list of vulnerabilities encountered in Hadoop has been shown in the tabular format below [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The
Table 3. shows different vulnerability reported in CVE.
      </p>
      <sec id="sec-8-1">
        <title>An issue was discovered in gif2apng 1.9. There is a heap-based buffer overflow in the main function. It allows an attacker to write 2 bytes outside the boundaries of the buffer.</title>
      </sec>
      <sec id="sec-8-2">
        <title>OpenWrt 21.02.1 allows XSS via the NAT Rules Name screen.</title>
      </sec>
      <sec id="sec-8-3">
        <title>In Apache Hadoop 2.8.0 the LinuxContainerExecutor runs docker commands as root with insufficient input validation. When the docker feature is enabled, authenticated users can run commands as root.</title>
      </sec>
      <sec id="sec-8-4">
        <title>HDFS clients interact with a servlet on the Data Node to browse the HDFS namespace. The Name Node is provided as a query parameter that is not validated in Apache Hadoop before 2.7.0.</title>
      </sec>
      <sec id="sec-8-5">
        <title>The HDFS web UI in Apache Hadoop before 2.7.0 is vulnerable to a cross-site scripting (XSS) attack through an unescaped query parameter.</title>
      </sec>
      <sec id="sec-8-6">
        <title>Vulnerability in Apache Hadoop 3.0.0 allows a cluster user to expose private files owned by the user running the MapReduce job history server process. The malicious user can construct a configuration file containing XML directives that reference sensitive files on the MapReduce job history server host.</title>
        <p>3.2.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Patch Management</title>
      <p>
        Patch management is a mechanism for detecting and eliminating the vulnerabilities before any
attackers try to exploit them. The throughput is directly proportional to the fast detection of
vulnerabilities, rectified, compressed with some methods like scanning and testing for reviewing of
code. As per the report from 2017, the application of scanning methods has been gradually rising
internationally [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
3.3.
      </p>
    </sec>
    <sec id="sec-10">
      <title>Security Issues in Hadoop</title>
      <p>It is known that Hadoop was designed primarily for a performance basis and not on security basis.
The Developers decided that security functionalities will be added over time to increase the
framework efficiency. Due to this the security mechanism of Hadoop was very weak and prone to
many attacks. Hadoop was mainly designed with a focus on improving efficiency. But due to recent
attacks researchers are now focusing on the security aspects of Hadoop. However, presently there
does not exist any evaluation method for the security policies of Hadoop.</p>
      <p>Due to the recent growth of Big Data, the security policies available are not up to the benchmark to
be even considered for evaluation. The ecosystem of Hadoop comprises a collection of different
applications, where every application requires some security mechanism to function accordingly for
Big Data.</p>
      <p>
        Out of all the models proposed previously to work with big data, Hadoop was uniquely identified
because of its distribution system with parallel processing but was lacking in the security policies.
Whereas, the distributed nature of Hadoop was favored previously, now the distributed nature of
computing is posing a set of new vulnerabilities for professional and security managers [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
3.4.
      </p>
    </sec>
    <sec id="sec-11">
      <title>Security threats and possible attacks</title>
      <p>
        Any possible danger for the information system can be referred to as a threat. A threat is basically
what an attacker tries to identify and use as an attack against any company or organization [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Also
are already familiar with the CIA triad. For any system to be regarded as secure it must satisfy
Confidentiality, integrity, and Availability also known as CIA triads. To comply with confidentiality,
an authenticated server can be implemented that can access the whole system.
      </p>
    </sec>
    <sec id="sec-12">
      <title>3.4.1. Impersonation Attacks</title>
      <p>This type of attack occurs when an attacker tries to impersonate the registered or legitimate
authority for accessing the resources. The attackers can make use of different sets of tools and
methods to steal sensitive information attacks directly on the Hadoop Clusters leaving the system
vulnerable. To perform an impersonation attacks an attacker can try to replay the acknowledgement
received from Kerberos protocol. At last, when the attackers gain access to the Hadoop framework,
performing actions like leaking and throttling the processing time of Map Reduce.</p>
    </sec>
    <sec id="sec-13">
      <title>3.4.2. Denial-of-Service Attacks</title>
      <p>
        A Denial-of-Service [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is a type of attack where an attacker floods the system with an enormous
request which makes the system unable to allocate resources to legitimate users. As per the report,
more than 11247 attacks have been taken place among which 5 attacks were able to breach the
security. Denial of Service attacks is basically where a system is flooded with large request or traffic
causing the system servers to crash or halt all the operations. Denial of Services can be initiated in two
ways: by crashing the services, flooding the services. The Hadoop Component like Name Node and
the authentication server is prone to Denial-of-Service attacks. A simple Denial of Service attack on
Name Node is enough to halt all the operations of Map Reduce and stop the read-write operation of
the Hadoop Distributed file system.
      </p>
    </sec>
    <sec id="sec-14">
      <title>3.4.3. Cross-Site Scripting</title>
      <p>
        Cross-site Scripting [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] is a type of attack where malicious code is injected into any web
application that is vulnerable. Cross-Site scripting is different from other attacks such in a way not
intended for the implementation in question. But actually, the users of web applications are at risk
here. The Cross-site script attacks can be categorized into two types: stored, reflected. Stored attacks
also go by the name persistent and are more damaging than the reflected attacks as it is directly
injected into the vulnerable web applications. Whereas in reflected, the malicious script is reflected
directly onto the user web browser.
      </p>
    </sec>
    <sec id="sec-15">
      <title>3.4.4. Present Attacks</title>
      <p>
        The Hadoop framework due to its open ports and IP address has always been an object of attacks
by all the attackers, due to which around 5307 of Hadoop Cluster has been exposed with the
vulnerable security settings that attackers use to exploit the framework [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. There was an online
search engine designed to show all the details of the servers and all peripheral devices connected to
them over the internet, its name was shodan2. The advantages of shodan2 were that it was possible to
recommend any security policy but the disadvantage is that it was used by attackers to exploit the
system. To tackle this attack and stop stealing some strategies with high high-security policies must
be implemented.
      </p>
      <p>Here, is the following table. 4 that gives a comparative analysis of attacks that had been taken at
Hadoop.</p>
      <sec id="sec-15-1">
        <title>Challenges</title>
        <p>How to
authenticate if
the person is
actually legit and
not
impersonated by
an attacker.</p>
      </sec>
      <sec id="sec-15-2">
        <title>The collection of attacks can be diverse or complex</title>
      </sec>
      <sec id="sec-15-3">
        <title>Set up some anti</title>
        <p>triggered
methods to
avoid hijacking of
the user
accounts</p>
      </sec>
      <sec id="sec-15-4">
        <title>Description</title>
        <p>This type of
attack occurs
when an
attacker tries
to impersonate
the registered
or legitimate
authority for
accessing the
resources.</p>
      </sec>
      <sec id="sec-15-5">
        <title>A Denial of</title>
        <p>Service is a
type of attack
where an
attacker floods
the system</p>
        <p>with an
enormous
request which
makes the
system unable
to allocate
resources to
legitimate</p>
        <p>users.</p>
      </sec>
      <sec id="sec-15-6">
        <title>Cross-site</title>
        <p>Scripting is a
type of attack
where
malicious code
is injected into
any web
application</p>
        <p>that is
vulnerable.</p>
      </sec>
      <sec id="sec-15-7">
        <title>Confidentiality, authentication, authorization 2015</title>
      </sec>
      <sec id="sec-15-8">
        <title>Cloud Attacks</title>
      </sec>
      <sec id="sec-15-9">
        <title>Confidentiality</title>
      </sec>
      <sec id="sec-15-10">
        <title>Jose Ancy</title>
      </sec>
      <sec id="sec-15-11">
        <title>Sherin [29]</title>
      </sec>
      <sec id="sec-15-12">
        <title>M Mizukoshi [26]</title>
      </sec>
      <sec id="sec-15-13">
        <title>Xianqing Yu [27]</title>
      </sec>
      <sec id="sec-15-14">
        <title>Bhathal</title>
      </sec>
      <sec id="sec-15-15">
        <title>Gurjeet Singh [4]</title>
      </sec>
      <sec id="sec-15-16">
        <title>Bhathal</title>
      </sec>
      <sec id="sec-15-17">
        <title>Gurjeet Singh [4] 2014 2019</title>
        <p>2019
2019</p>
      </sec>
      <sec id="sec-15-18">
        <title>DNS reflection</title>
        <p>amplification</p>
      </sec>
      <sec id="sec-15-19">
        <title>Confidentiality</title>
      </sec>
      <sec id="sec-15-20">
        <title>Distributed</title>
      </sec>
      <sec id="sec-15-21">
        <title>Denial of</title>
      </sec>
      <sec id="sec-15-22">
        <title>Service</title>
      </sec>
      <sec id="sec-15-23">
        <title>Avoid leaking,</title>
        <p>destruction, and
corruption of
confidential
information.</p>
      </sec>
      <sec id="sec-15-24">
        <title>Misconfiguration</title>
        <p>of DNS leads to</p>
      </sec>
      <sec id="sec-15-25">
        <title>DDoS</title>
      </sec>
      <sec id="sec-15-26">
        <title>Manual intervention requirement is too much</title>
      </sec>
      <sec id="sec-15-27">
        <title>How to avoid misconfiguration, unauthorized access, hijacking</title>
      </sec>
      <sec id="sec-15-28">
        <title>How to avoid</title>
        <p>overcomplication
of the
application</p>
      </sec>
      <sec id="sec-15-29">
        <title>How to configure</title>
        <p>the firewall,
setting up an IPS
Data Leakage is
the
unapproved
transmission of
information
from inside an
association to
an outer
objective or
beneficiary.</p>
      </sec>
      <sec id="sec-15-30">
        <title>DNS reflection</title>
        <p>attack is
basically a type
of Distributed</p>
      </sec>
      <sec id="sec-15-31">
        <title>Denial of</title>
      </sec>
      <sec id="sec-15-32">
        <title>Service attack.</title>
        <p>A DDoS attack
includes
different
associated
web-based
gadgets,
altogether
known as a
botnet, which
are utilized to
overpower an
objective site
with
counterfeit</p>
        <p>traffic.</p>
        <p>Using the
public cloud
connection
characteristics
an attacker can
try to hide his
breaches.</p>
      </sec>
      <sec id="sec-15-33">
        <title>Sending of</title>
        <p>packets to a
specific port on
the host
Repetitive
initiation of
connection</p>
        <p>without
establishing it
to make the
server busy.</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>4. Acknowledgements</title>
      <p>This paper and the research behind it were only possible because of the guidance of my guide Dr.
Deepak Singh Tomar, Associate Professor at MANIT Bhopal. His attention to detail and helping to
keep my work on track from the first encounter.</p>
      <p>I would also like to thank my other Supervisor Dr. R K Pateriya, Professor at MANIT Bhopal for
their encouragement and guidance in carrying out the project work. I also thank MANIT Bhopal for
giving me the opportunity to embark on this project.</p>
    </sec>
    <sec id="sec-17">
      <title>5. Conclusion</title>
    </sec>
    <sec id="sec-18">
      <title>6. References</title>
      <p>In this study, an analysis of big data vulnerabilities, security threats, and possible attacks was
reviewed for a popular framework like Hadoop. Although it was observed that Hadoop was designed
in mind to provide maximum efficiency but with the exponential growth of big data has led to Hadoop
being left vulnerable to possible attacks, lack of security policies, mechanisms, proper access control,
etc. To make Hadoop a more reliable and secure framework a proper authentication server with
authorization and auditing is required. At the same time, some mechanism to ensure data protection
will be what an ideal framework would be.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Lai</surname>
            <given-names>TL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Stochastic approximation: from statistical origin to big-data, multidisciplinary applications</article-title>
          .
          <source>Statistical Science</source>
          . 2021 Apr;
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>291</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Yun</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Manzhu</given-names>
            <surname>Yu</surname>
          </string-name>
          , Mengchao Xu,
          <string-name>
            <given-names>Jingchao</given-names>
            <surname>Yang</surname>
          </string-name>
          , Dexuan Sha, Qian Liu, and
          <string-name>
            <given-names>Chaowei</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>"Big data and cloud computing."</article-title>
          <source>In Manual of Digital Earth</source>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>355</lpage>
          . Springer, Singapore,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Oussous</surname>
          </string-name>
          , Ahmed,
          <string-name>
            <surname>Fatima-Zahra</surname>
            <given-names>Benjelloun</given-names>
          </string-name>
          , Ayoub Ait Lahcen, and
          <string-name>
            <given-names>Samir</given-names>
            <surname>Belfkih</surname>
          </string-name>
          .
          <article-title>"Big Data technologies: A survey."</article-title>
          <source>Journal of King Saud University-Computer and Information Sciences</source>
          <volume>30</volume>
          , no.
          <issue>4</issue>
          (
          <year>2018</year>
          ):
          <fpage>431</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bhathal</surname>
            ,
            <given-names>Gurjit</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>and Amardeep</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          .
          <article-title>"Big data: Hadoop framework vulnerabilities</article-title>
          ,
          <source>security issues and attacks." Array</source>
          <volume>1</volume>
          (
          <year>2019</year>
          ):
          <fpage>100002</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Brauna</surname>
            , T. D.,
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Siegelb</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Beckc</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Bölönid</surname>
            ,
            <given-names>Albert Muthucumaru Maheswarane</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertsong</surname>
            <given-names>IR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theysh</surname>
            <given-names>JP</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yaoi</surname>
            <given-names>MD.</given-names>
          </string-name>
          "
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Hensgenj</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          and Freundk, RF, “
          <article-title>A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems</article-title>
          ,”.
          <source>" Journal of Parallel and Distributed Computing</source>
          <volume>61</volume>
          , no.
          <issue>6</issue>
          (
          <year>2001</year>
          ):
          <fpage>810</fpage>
          -
          <lpage>837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gautam</surname>
            , Akansha, and
            <given-names>Indranath</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
          </string-name>
          .
          <article-title>"Big data and cloud computing: A critical review."</article-title>
          <source>International Journal of Operations Research and Information Systems (IJORIS) 11</source>
          , no.
          <issue>3</issue>
          (
          <year>2020</year>
          ):
          <fpage>19</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cai</surname>
            , Xiaojun,
            <given-names>Feng</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Ping</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Lei</given-names>
          </string-name>
          <string-name>
            <surname>Ju</surname>
            , and
            <given-names>Zhiping</given-names>
          </string-name>
          <string-name>
            <surname>Jia</surname>
          </string-name>
          .
          <article-title>"SLA-aware energy-efficient scheduling scheme for Hadoop YARN."</article-title>
          <source>The Journal of Supercomputing</source>
          <volume>73</volume>
          , no.
          <issue>8</issue>
          (
          <year>2017</year>
          ):
          <fpage>3526</fpage>
          -
          <lpage>3546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Dunn-Rankin</surname>
            , Peter,
            <given-names>Gerald A.</given-names>
          </string-name>
          <string-name>
            <surname>Knezek</surname>
          </string-name>
          ,
          <string-name>
            <surname>Susan R. Wallace</surname>
          </string-name>
          , and Shuqiang Zhang. Scaling methods. Psychology Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mavridis</surname>
            , Ilias, and
            <given-names>Helen</given-names>
          </string-name>
          <string-name>
            <surname>Karatza</surname>
          </string-name>
          .
          <article-title>"Performance evaluation of cloud-based log file analysis with Apache Hadoop</article-title>
          and
          <string-name>
            <given-names>Apache</given-names>
            <surname>Spark</surname>
          </string-name>
          .
          <source>" Journal of Systems and Software</source>
          <volume>125</volume>
          (
          <year>2017</year>
          ):
          <fpage>133</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ye</surname>
            , Haina, Xinzhou Cheng, Mingqiang Yuan, Lexi Xu,
            <given-names>Jie</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , and Chen Cheng.
          <article-title>"A survey of security and privacy in big data."</article-title>
          <source>In 2016 16th international symposium on communications and information technologies (ist)</source>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>272</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Terzi</surname>
            ,
            <given-names>Duygu</given-names>
          </string-name>
          <string-name>
            <surname>Sinanc</surname>
            , Ramazan Terzi, and
            <given-names>Seref</given-names>
          </string-name>
          <string-name>
            <surname>Sagiroglu</surname>
          </string-name>
          .
          <article-title>"A survey on security and privacy issues in big data."</article-title>
          <source>In 2015 10th International Conference for Internet Technology and Secured Transactions (ICITST)</source>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>207</lpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Sharif</surname>
            , Ather, Sarah Cooney, Shengqi Gong, and
            <given-names>Drew</given-names>
          </string-name>
          <string-name>
            <surname>Vitek</surname>
          </string-name>
          .
          <article-title>"Current security threats and prevention measures relating to cloud services</article-title>
          ,
          <source>Hadoop concurrent processing, and big data." In 2015 IEEE International Conference on Big Data (Big Data)</source>
          , pp.
          <fpage>1865</fpage>
          -
          <lpage>1870</lpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Derbeko</surname>
            , Philip, Shlomi Dolev, Ehud Gudes, and
            <given-names>Shantanu</given-names>
          </string-name>
          <string-name>
            <surname>Sharma</surname>
          </string-name>
          .
          <article-title>"Security and privacy aspects in MapReduce on clouds: A survey." Computer science review 20 (</article-title>
          <year>2016</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Bertino</surname>
            , Elisa, and
            <given-names>Elena</given-names>
          </string-name>
          <string-name>
            <surname>Ferrari</surname>
          </string-name>
          .
          <article-title>"Big data security and privacy." In A comprehensive guide through the Italian database research over the last 25 years</article-title>
          , pp.
          <fpage>425</fpage>
          -
          <lpage>439</lpage>
          . Springer, Cham,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Dongpo.
          <article-title>"Big data security and privacy protection."</article-title>
          <source>In 8th International Conference on Management and Computer Science (ICMCS 2018)</source>
          , vol.
          <volume>77</volume>
          , pp.
          <fpage>275</fpage>
          -
          <lpage>278</lpage>
          . Atlantis Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Nelson</surname>
            , Boel, and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Olovsson</surname>
          </string-name>
          .
          <article-title>"Security and privacy for big data: A systematic literature review." In 2016 IEEE international conference on big data (big data)</article-title>
          , pp.
          <fpage>3693</fpage>
          -
          <lpage>3702</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Tao</surname>
          </string-name>
          , Hai, Md Zakirul Alam Bhuiyan, Md Arafatur Rahman, Guojun Wang,
          <string-name>
            <surname>Tian</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Md Manjur Ahmed, and
          <string-name>
            <given-names>Jing</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>"Economic perspective analysis of protecting big data security and privacy</article-title>
          .
          <source>" Future Generation Computer Systems</source>
          <volume>98</volume>
          (
          <year>2019</year>
          ):
          <fpage>660</fpage>
          -
          <lpage>671</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Erraissi</surname>
            , Allae, and
            <given-names>Mouad</given-names>
          </string-name>
          <string-name>
            <surname>Banane</surname>
          </string-name>
          .
          <article-title>"Managing Big Data using Model Driven Engineering: From Big Data Meta-model to Cloudera PSM meta-model."</article-title>
          <source>In 2020 International Conference on Decision Aid Sciences and Application (DASA)</source>
          , pp.
          <fpage>1235</fpage>
          -
          <lpage>1239</lpage>
          . IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Mitre</surname>
            <given-names>Corp</given-names>
          </string-name>
          , “CVE Details”, 12
          <string-name>
            <surname>October</surname>
          </string-name>
          <year>2021</year>
          . [Online]. Available: https://www.cvedetails.com/vendor/45/Apache.html
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Salleh</surname>
            ,
            <given-names>Khairulliza</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
            , and
            <given-names>Lech</given-names>
          </string-name>
          <string-name>
            <surname>Janczewski</surname>
          </string-name>
          .
          <article-title>"Security considerations in big data solutions adoption: Lessons from a case study on a banking institution."</article-title>
          <source>Procedia Computer Science</source>
          <volume>164</volume>
          (
          <year>2019</year>
          ):
          <fpage>168</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Parmar</surname>
          </string-name>
          , Raj R.,
          <string-name>
            <surname>Sudipta</surname>
            <given-names>Roy</given-names>
          </string-name>
          , Debnath Bhattacharyya, Samir Kumar Bandyopadhyay, and
          <string-name>
            <given-names>TaiHoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>"Large-scale encryption in the Hadoop environment: Challenges and solutions." IEEE Access 5 (</article-title>
          <year>2017</year>
          ):
          <fpage>7156</fpage>
          -
          <lpage>7163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Dahbur</surname>
            , Kamal,
            <given-names>Bassil</given-names>
          </string-name>
          <string-name>
            <surname>Mohammad</surname>
          </string-name>
          , and Ahmad Bisher Tarakji.
          <article-title>"A survey of risks, threats and vulnerabilities in cloud computing."</article-title>
          <source>In Proceedings of the 2011 International conference on intelligent semantic Web-services and applications</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Gavric</surname>
            , Zeljko, and
            <given-names>Dejan</given-names>
          </string-name>
          <string-name>
            <surname>Simic</surname>
          </string-name>
          .
          <article-title>"Overview of DOS attacks on wireless sensor networks and experimental results for simulation of interference attacks</article-title>
          .
          <source>" Ingeniería e Investigación</source>
          <volume>38</volume>
          , no.
          <issue>1</issue>
          (
          <year>2018</year>
          ):
          <fpage>130</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          , Shashank, and Brij Bhooshan Gupta.
          <article-title>"Cross-Site Scripting (XSS) attacks and defense mechanisms: classification and state-of-the-art."</article-title>
          <source>International Journal of System Assurance Engineering and Management</source>
          <volume>8</volume>
          , no.
          <issue>1</issue>
          (
          <year>2017</year>
          ):
          <fpage>512</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Millman</surname>
            ,
            <given-names>Rene.</given-names>
          </string-name>
          <article-title>"Thousands of hadoop clusters still not being secured against attacks</article-title>
          .
          <source>" SC Media</source>
          <volume>10</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mizukoshi</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Munetomo</surname>
          </string-name>
          ,
          <article-title>"Distributed denial of services attack protection system with genetic algorithms on Hadoop cluster computing framework,"</article-title>
          <source>2015 IEEE Congress on Evolutionary Computation (CEC)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1575</fpage>
          -
          <lpage>1580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Xianqing</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ning</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Vouk</surname>
          </string-name>
          ,
          <article-title>"Enhancing security of Hadoop in a public cloud,"</article-title>
          <source>2015 6th International Conference on Information and Communication Systems (ICICS)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>43</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Fu</surname>
            , Xiao,
            <given-names>Yun</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            , Bin Luo, Xiaojiang Du, and
            <given-names>Mohsen</given-names>
          </string-name>
          <string-name>
            <surname>Guizani</surname>
          </string-name>
          .
          <article-title>"Security threats to Hadoop: data leakage attacks and investigation." IEEE Network 31</article-title>
          , no.
          <issue>2</issue>
          (
          <year>2017</year>
          ):
          <fpage>67</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29] Jose, Ancy Sherin,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Binu</surname>
          </string-name>
          .
          <article-title>"Automatic detection and rectification of dns reflection amplification attacks with hadoop mapreduce and chukwa."</article-title>
          <source>In 2014 Fourth International Conference on Advances in Computing and Communications</source>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>198</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>