Machine Learning Algorithms in Cyber Security

           Eljona Proko                               Alketa Hyso                        Dezdemona Gjylapi
    Computer Science Dept.                     Computer Science Dept.                 Computer Science Dept.
University “Ismail Qemali”, Vlore          University “Ismail Qemali”, Vlore      University “Ismail Qemali”, Vlore
 eljona.proko@univlora.edu.al                alketa.hyso@univlora.edu.al         dezdemona.gjylapi@univlora.edu.al


                                                           smart grid, etc. Machine Learning, a branch of AI
                                                           (Figure 1), is being successfully applied to solve a
                                                           small part of the problems. Machine Learning –
                      Abstract                             sometimes referred to more generally as Artificial
    Artificial intelligence (AI) has made                  Intelligence (AI) - is a powerful tool used by cyber-
                                                           security companies. The technology of Applied
    incredible progress, resulting in highly               Artificial Intelligence (AI powered by Machine
    capable software and advanced                          Learning) is an increasingly important way in which we
    autonomous machines. Meanwhile,                        can scale the detection and classification of malware.
    the cyber domain has become a
    battleground for access, influence,
    security and control. This paper will
    address key AI technologies including
    machine learning in an attempt to help
    in understanding their role in cyber
    security and the implications of these
    new      technologies.    This     paper
    discusses and highlights different
    applications of machine learning in
    cyber security.

1. Introduction
Technologies such as Big Data, Cloud Computing,
Artificial Intelligence, etc., have been repeated again              Figure 1: Artificial intelligence branch
and again in multiple forums, in many cases without a
clear understanding of their significance or their         Diverse machine learning methods have been
application to solving real problems effectively. AI is    successfully deployed to address such wide-ranging
the creation of intelligent machines that can learn from   problems in computer security.
experience, allowing them to work and react as a
human would. This technology enables computers to be       2 Machine Learning
trained to process large amounts of data and identify
trends and patterns. Machine learning techniques have      Artificial Intelligence [Wan 08] is the field of science
been applied in many areas of science due to their         that studies the synthesis and analysis of computational
unique properties like adaptability, scalability, and      agents that act intelligently. Machine learning is a
potential to rapidly adjust to new and unknown             subset of the broader field of Artificial Intelligence.
challenges. Cyber security is a fast-growing field         The current applications of AI are mostly restricted to
demanding a great deal of attention because of             Machine Learning (ML).
remarkable progresses in social networks, cloud and        Machine Learning and Artificial Intelligence [Mar 18]
web technologies, online banking, mobile environment,      are being connected more extensively crosswise over
                                                           industries and applications than any other time in recent
memory as computing power, storage capacities and           equivalent output variables. The aim of unsupervised
data collection increase.                                   learning is to model the construction of the data in
Machine Learning teaches a machine how to answer a          order to learn more about the data. Algorithms are
question or how to make a decision on its own. It           required to discover a structure, an inference and
contrasts with traditional programming, which requires      meaning within the data in order to arrive to a
giving a machine explicit instructions for it to answer     conclusion. These algorithms do not have any type of
specific questions. In fact, every imaginable case has to   historical data in order to predict the output unlike
be programmed ahead of time in order to cover all           supervised algorithms. That means that the machine
possible situations. ML may encompass techniques            does not know what the data represents nor what
such as statistics, mathematical optimizations, or data     answers are expected. The machine will have to figure
mining. ML algorithms try to make decisions about           out on its own the patterns and structure of the
their behavior and find ways to solve problems by           unlabeled input and discover the expected output. The
inferring them from models based on sample inputs that      classification of movie genres in Netflix is an example
represent real-life scenarios.                              of unsupervised learning.
There are multiple types of ML and each works very
differently. If we generalize the field, we can define      2.3 Reinforcement Learning
three main categories of ML (illustrate in Figure 2):
supervised learning, unsupervised learning and              In reinforcement learning, the machine interacts with its
reinforcement learning.                                     environment to achieve a certain goal. It is similar to
                                                            unsupervised learning, as the machine is trained using
2.1 Supervised Learning                                     unlabeled data. However, in reinforcement learning, the
                                                            machine receives feedback on the outcome.
In supervised learning, the machine is trained using
sample data that is labeled to tell the machine what the
data represents. A supervised learning algorithm with
an input variable denoted as P and an output variable
denoted as Q and algorithms are used to create and
learn a mapping function (f) via the input to the output.
The goal of a supervised learning algorithm is to
achieve an estimate mapping function so that for every
new input (P), a new predicted output (Q) is created. In
other words, the learning algorithm receives a set of
inputs with their corresponding outputs, and the
                                                                   Figure 2: Three main categories of Machine
algorithm learns by equating its concrete output with
                                                                                   Learning
correct outputs in order to find errors and have the
learning model modified accordingly. Supervised
learning algorithms make use of patterns to predict the
                                                            3 Cyber Security
values of the label on unlabelled data. This is achieved
                                                            Security is becoming one of the most important topics
by classification, regression, prediction, etc.
                                                            in industrial IT and Operational Technology (OT), i.e.
Based on that training, the machine should be able to
                                                            the hardware and software used in the production area.
analyze new data and predict the correct answer.
                                                            Cyber security is defined as technologies and processes
Supervised learning has applications such as disease
                                                            constructed to protect computers, computer hardware,
diagnostics, or speech recognition.
                                                            software, networks and data from unauthorized access,
                                                            vulnerabilities supplied through Internet by cyber
2.2 Unsupervised Learning
                                                            criminals, terrorist groups and hackers. Cyber security
In unsupervised learning, the machine is trained using      is related to protecting your internet and network based
data that doesn't have labels. Unsupervised learning is     digital equipments and information from unauthorized
where only an input data (P) is available with no           access and alteration. One of the most challenging
elements of cyber security is the quickly and constantly   Machine learning [Kan 17] has presented a significant
evolving nature of security risks. The enterprise          opportunity to the cyber security industry. New
network comprised of          mainframes, client-server    machine learning methods can vastly improve the
model, closed group of systems and the attacks were        accuracy of threat detection and enhance network
very limited with viruses, worms and Trojan horses         visibility thanks to the greater amount of computational
being the major cyber threats. The focus was more          analysis they can handle. They are also heralding in a
towards malwares such as virus, worms and Trojans          new era of autonomous response, where a machine
with purpose of causing damage to the systems. Cyber       system is sufficiently intelligent to understand how and
threats randomly targeted computers directly connected     when to fight back against in-progress threats.
to the Internet.                                           Different machine learning methods have been
Artificial Intelligence methods are robust and more        successfully deployed to address wide-ranging
flexible; as a result expanding security execution and     problems in computer security. We are to discuses
better defense system from an increasing number of         three areas where most cyber ML algorithms are
advance cyber threats.                                     finding application: spam detection, malware analysis
                                                           and intrusion detection.
Different AI techniques can be used in cyber security
such as intelligent agent, neural nets, expert system,     4.1 Spam and phishing detection
data mining, machine learning and deep learning.           Spam and phishing detection includes a large set of
                                                           techniques aimed at reducing the waste of time and
                                                           potential hazard caused by unwanted emails.
4 Machine Learning in Cyber Security                       Nowadays, unsolicited emails, namely phishing,
                                                           represent the preferred way through which an attacker
Machine learning is an effective tool that can be          establishes a first foothold within an enterprise
employed in many areas of information security.            network. Phishing emails include malware or links to
There exist some robust anti-phishing algorithms           compromised websites. Spam and phishing detection is
and network intrusion detection systems. Machine           increasingly difficult because of the advanced evasion
learning [Jor 15] can be successfully used for             strategies used by attackers to bypass traditional filters.
developing authentication systems, evaluating the          ML approaches can improve the spam detection
protocol implementation, assessing the security of         process.
human interaction proofs, smart meter data
                                                           Spam filtering based on the textual content of email
profiling, etc.
                                                           messages can be seen as a special case of text
                                                           categorization, with the categories being spam and non-
                                                           spam. Today the most successful spam filters are based
                                                           upon the statistical foundations of Machine Learning.
                                                           Machine Learning based spam filters [Bla 08] also
                                                           retrain themselves while put in use and minimizes
                                                           manual effort while delivering superior filtering
                                                           accuracy.
                                                           Although the task of text categorization has been
                                                           researched extensively, its particular application to
                                                           email data and detection of spam specifically is
                                                           relatively recent. Some initial research studies
                                                           primarily focused on the problem of filtering spam
                                                           whereby Naïve Bayes (NB) was applied to address the
                                                           problem of building a personal spam filter. Naive
                                                           Bayes is a classic machine learning algorithm in which
                                                           we can use all our feature to detect whether they
               Figure 3: Cyber Security                    become malicious file or not and used it for the purpose
of classification. NB was advocated due to its                4.3 Intrusion Detection
previously demonstrated robustness in the text-               An Intrusion Detection System (IDS) is a defense
classification domain and due to its ability to be easily     measure that supervises activities of the computer
implemented in a cost-sensitive decision framework.           network and reports the malicious activities to the
Although high performance levels were achieved using          network administrator. Intruders do many attempts to
word features only, it was observed that by additionally      gain access to the network and try to harm the
incorporating non-textual features and some domain            organization’s data. Thus the security is the most
knowledge, the filtering performance could be                 important aspect for any type of organization.
improved significantly.                                       Intrusion detection aims to discover illicit activities
Phishing is aimed at stealing personal sensitive              within a computer or a network through Intrusion
information. Researchers [Cha 06] have identified three       Detection Systems (IDS). Network IDS are widely
principal groups of anti-phishing methods: detective          deployed in modern enterprise networks. These systems
(monitoring, content filtering, anti-spam), preventive        were traditionally based on patterns of known attacks,
(authentication, patch and change management), and            but modern deployments include other approaches for
corrective (site takedown, forensics) ones.                   anomaly detection, threat detection [Tor 16] and
                                                              classification based on machine learning. Within the
4.1.1 E-mail Spam Filtering                                   broader intrusion detection area, two specific problems
                                                              are relevant to our analysis: the detection of botnets and
Automatic e-mail classification uses statistical              of Domain Generation Algorithms (DGA). A botnet is
approaches or machine learning techniques and aims at         a network of infected machines controlled by attackers
building a model or a classifier specifically for the task    and misused to conduct multiple illicit activities. Botnet
of filtering spam from a users mail stream. The building      detection aims to identify communications between
of the model or classifier requires a set of pre-             infected machines within the monitored network and
classified. The process of building the model is called       the external command-and-control servers. Despite
training. Machine learning algorithms have achieved           many research proposals and commercial tools that
more success among all previous techniques employed           address this threat, several botnets still exist. DGA
in the task of spam filtering. In fact, the success stories   automatically generate domain names, and are often
of Gmail, can be ascribed to their timely transition and      used by an infected machine to communicate with
successful use of Machine Learning for filtering not          external server(s) by periodically generating new
just incoming spam but other abuses like Denial-of-           hostnames. They represent a real threat for
Service (DoS), virus delivery, and other imaginative          organizations because, through DGA which relies on
attacks.                                                      language processing techniques, it is possible to evade
                                                              defenses based on static blacklists of domain names.
4.2 Malware detection                                         Network Intrusion Detection (NID) systems are used to
Malware detection is an extremely relevant problem            identify malicious network activity leading to
because modern malware can automatically generate             confidentiality, integrity, or availability violation of
novel variants with the same malicious effects but            the systems in a network. Many intrusion detection
appearing as completely different executable files.           systems are specifically based on machine learning
These polymorphic and metamorphic features defeat             techniques [Kha 10] due to their adaptability to new
traditional    rule-based     malware      identification     and unknown attacks.
approaches. Malware can be divided into several
classes depending on its purpose: virus, worm, Trojan,        Although machine learning facilitates keeping
adware, spyware, root kit, backdoor, key logger,              various systems safe, the machine learning
Ransom ware and Remote Administration Tools.                  classifiers themselves are vulnerable to malicious
ML techniques can be used to analyze malware variants         attacks. There has been some work directed to
and attributing them to the correct malware family.           improving the effectiveness of machine learning
algorithms   and    protecting   them    from    diverse    [Cha 06] M. Chandrasekaran, K. Narayanan, and S.
attacks.                                                             Upadhyaya, “Phishing Security Conference,
                                                                     2006
                                                            [Jor 15] M. I. Jordan and T. M. Mitchell, “Machine
5 Conclusions                                                        learning: Trends, perspectives, and
                                                                     prospects,” Science, 2015
                                                            [Bla 08] E. Blanzieri and A. Bryl, “A survey of
Machine learning approaches are increasingly                         learning-based techniques of email spam
employed for multiple applications and are being                     filtering,” Artificial Intelligence Review,
adopted also for cyber security, hence it is important to            2008
evaluate when and which category of algorithms can          [Jav 16] A. Javaid, Q. Niyaz, W. Sun, and M. Alam,
achieve adequate results. We analyze these techniques                “A deep learning approach for network
for three relevant cyber security problems: intrusion                intrusion detection system,” in EAI
detection, malware analysis and spam detection.                      International Conference on Bio-inspired
Machine learning as a technology has erupted vastly in               Information         and     Communications
the whole cyber implementation space. These decision                 Technologies (formerly BIONETICS), 2016
making algorithms are known to solve several                [Tzo 07] G. Tzortzis and A. Likas, “Deep belief
problems. There        are   many opportunities        in            networks for spam filtering,” in IEEE
information security to apply machine learning to                    International Conference on Tools with
address various challenges in such complex domain.                   Artificial Intelligence (ICTAI), 2007
Spam detection, virus detection, and surveillance           [Kha 10] A. Khan, B. Baharudin, L. H. Lee, and K.
camera robbery detection are only some examples.                     Khan, “A review of machine learning
Machine learning techniques have been applied in                     algorithms           for       textdocuments
many areas of science due to their unique                            classification,” Journal of advances in
properties like adaptability, scalability, and potential             information technology, 2010.
to rapidly adjust to new and unknown challenges.            [Tor 16] P. Torres, C. Catania, S. Garcia, and C. G.
                                                                      Garino, “An analysis of Recurrent Neural
References                                                            Networks for Botnet detection behavior,” in
                                                                      IEEE Biennial Congress of Argentina
[Wan 08] X. B. Wang, G. Y. Yang, Y. C. Li, D. Liu,                    (ARGENCON), 2016.
       (2008) “Review on the application of
       Artificial Intelligence in Antivirus Detection
       System”, IEEE Conference on Cybernetics and
       Intelligent Systems, pp. 506 509
[Mar 18] Marty, R. AI and Machine Learning in Cyber
        Security – Towards Data Science. March 16,
        2018, from https://towardsdatascience.com/ai-
        and-machine-learning-in-cyber-security
        Applications of Artificial Intelligence (AI) to
        Network Security
[Kan 17] Kanal, E. (2017, January). Machine Learning
        in Cybersecurity. Carnegie Mellon University
        –     Software      Engineering     Institute.
        March 9, 2018
[Tyu 07] E. Tyugu. Algorithms and Architectures of
         Artificial Intelligence. IOS Press. 2007