Machine Learning Algorithms in Cyber Security Eljona Proko Alketa Hyso Dezdemona Gjylapi Computer Science Dept. Computer Science Dept. Computer Science Dept. University “Ismail Qemali”, Vlore University “Ismail Qemali”, Vlore University “Ismail Qemali”, Vlore eljona.proko@univlora.edu.al alketa.hyso@univlora.edu.al dezdemona.gjylapi@univlora.edu.al smart grid, etc. Machine Learning, a branch of AI (Figure 1), is being successfully applied to solve a small part of the problems. Machine Learning – Abstract sometimes referred to more generally as Artificial Artificial intelligence (AI) has made Intelligence (AI) - is a powerful tool used by cyber- security companies. The technology of Applied incredible progress, resulting in highly Artificial Intelligence (AI powered by Machine capable software and advanced Learning) is an increasingly important way in which we autonomous machines. Meanwhile, can scale the detection and classification of malware. the cyber domain has become a battleground for access, influence, security and control. This paper will address key AI technologies including machine learning in an attempt to help in understanding their role in cyber security and the implications of these new technologies. This paper discusses and highlights different applications of machine learning in cyber security. 1. Introduction Technologies such as Big Data, Cloud Computing, Artificial Intelligence, etc., have been repeated again Figure 1: Artificial intelligence branch and again in multiple forums, in many cases without a clear understanding of their significance or their Diverse machine learning methods have been application to solving real problems effectively. AI is successfully deployed to address such wide-ranging the creation of intelligent machines that can learn from problems in computer security. experience, allowing them to work and react as a human would. This technology enables computers to be 2 Machine Learning trained to process large amounts of data and identify trends and patterns. Machine learning techniques have Artificial Intelligence [Wan 08] is the field of science been applied in many areas of science due to their that studies the synthesis and analysis of computational unique properties like adaptability, scalability, and agents that act intelligently. Machine learning is a potential to rapidly adjust to new and unknown subset of the broader field of Artificial Intelligence. challenges. Cyber security is a fast-growing field The current applications of AI are mostly restricted to demanding a great deal of attention because of Machine Learning (ML). remarkable progresses in social networks, cloud and Machine Learning and Artificial Intelligence [Mar 18] web technologies, online banking, mobile environment, are being connected more extensively crosswise over industries and applications than any other time in recent memory as computing power, storage capacities and equivalent output variables. The aim of unsupervised data collection increase. learning is to model the construction of the data in Machine Learning teaches a machine how to answer a order to learn more about the data. Algorithms are question or how to make a decision on its own. It required to discover a structure, an inference and contrasts with traditional programming, which requires meaning within the data in order to arrive to a giving a machine explicit instructions for it to answer conclusion. These algorithms do not have any type of specific questions. In fact, every imaginable case has to historical data in order to predict the output unlike be programmed ahead of time in order to cover all supervised algorithms. That means that the machine possible situations. ML may encompass techniques does not know what the data represents nor what such as statistics, mathematical optimizations, or data answers are expected. The machine will have to figure mining. ML algorithms try to make decisions about out on its own the patterns and structure of the their behavior and find ways to solve problems by unlabeled input and discover the expected output. The inferring them from models based on sample inputs that classification of movie genres in Netflix is an example represent real-life scenarios. of unsupervised learning. There are multiple types of ML and each works very differently. If we generalize the field, we can define 2.3 Reinforcement Learning three main categories of ML (illustrate in Figure 2): supervised learning, unsupervised learning and In reinforcement learning, the machine interacts with its reinforcement learning. environment to achieve a certain goal. It is similar to unsupervised learning, as the machine is trained using 2.1 Supervised Learning unlabeled data. However, in reinforcement learning, the machine receives feedback on the outcome. In supervised learning, the machine is trained using sample data that is labeled to tell the machine what the data represents. A supervised learning algorithm with an input variable denoted as P and an output variable denoted as Q and algorithms are used to create and learn a mapping function (f) via the input to the output. The goal of a supervised learning algorithm is to achieve an estimate mapping function so that for every new input (P), a new predicted output (Q) is created. In other words, the learning algorithm receives a set of inputs with their corresponding outputs, and the Figure 2: Three main categories of Machine algorithm learns by equating its concrete output with Learning correct outputs in order to find errors and have the learning model modified accordingly. Supervised learning algorithms make use of patterns to predict the 3 Cyber Security values of the label on unlabelled data. This is achieved Security is becoming one of the most important topics by classification, regression, prediction, etc. in industrial IT and Operational Technology (OT), i.e. Based on that training, the machine should be able to the hardware and software used in the production area. analyze new data and predict the correct answer. Cyber security is defined as technologies and processes Supervised learning has applications such as disease constructed to protect computers, computer hardware, diagnostics, or speech recognition. software, networks and data from unauthorized access, vulnerabilities supplied through Internet by cyber 2.2 Unsupervised Learning criminals, terrorist groups and hackers. Cyber security In unsupervised learning, the machine is trained using is related to protecting your internet and network based data that doesn't have labels. Unsupervised learning is digital equipments and information from unauthorized where only an input data (P) is available with no access and alteration. One of the most challenging elements of cyber security is the quickly and constantly Machine learning [Kan 17] has presented a significant evolving nature of security risks. The enterprise opportunity to the cyber security industry. New network comprised of mainframes, client-server machine learning methods can vastly improve the model, closed group of systems and the attacks were accuracy of threat detection and enhance network very limited with viruses, worms and Trojan horses visibility thanks to the greater amount of computational being the major cyber threats. The focus was more analysis they can handle. They are also heralding in a towards malwares such as virus, worms and Trojans new era of autonomous response, where a machine with purpose of causing damage to the systems. Cyber system is sufficiently intelligent to understand how and threats randomly targeted computers directly connected when to fight back against in-progress threats. to the Internet. Different machine learning methods have been Artificial Intelligence methods are robust and more successfully deployed to address wide-ranging flexible; as a result expanding security execution and problems in computer security. We are to discuses better defense system from an increasing number of three areas where most cyber ML algorithms are advance cyber threats. finding application: spam detection, malware analysis and intrusion detection. Different AI techniques can be used in cyber security such as intelligent agent, neural nets, expert system, 4.1 Spam and phishing detection data mining, machine learning and deep learning. Spam and phishing detection includes a large set of techniques aimed at reducing the waste of time and potential hazard caused by unwanted emails. 4 Machine Learning in Cyber Security Nowadays, unsolicited emails, namely phishing, represent the preferred way through which an attacker Machine learning is an effective tool that can be establishes a first foothold within an enterprise employed in many areas of information security. network. Phishing emails include malware or links to There exist some robust anti-phishing algorithms compromised websites. Spam and phishing detection is and network intrusion detection systems. Machine increasingly difficult because of the advanced evasion learning [Jor 15] can be successfully used for strategies used by attackers to bypass traditional filters. developing authentication systems, evaluating the ML approaches can improve the spam detection protocol implementation, assessing the security of process. human interaction proofs, smart meter data Spam filtering based on the textual content of email profiling, etc. messages can be seen as a special case of text categorization, with the categories being spam and non- spam. Today the most successful spam filters are based upon the statistical foundations of Machine Learning. Machine Learning based spam filters [Bla 08] also retrain themselves while put in use and minimizes manual effort while delivering superior filtering accuracy. Although the task of text categorization has been researched extensively, its particular application to email data and detection of spam specifically is relatively recent. Some initial research studies primarily focused on the problem of filtering spam whereby Naïve Bayes (NB) was applied to address the problem of building a personal spam filter. Naive Bayes is a classic machine learning algorithm in which we can use all our feature to detect whether they Figure 3: Cyber Security become malicious file or not and used it for the purpose of classification. NB was advocated due to its 4.3 Intrusion Detection previously demonstrated robustness in the text- An Intrusion Detection System (IDS) is a defense classification domain and due to its ability to be easily measure that supervises activities of the computer implemented in a cost-sensitive decision framework. network and reports the malicious activities to the Although high performance levels were achieved using network administrator. Intruders do many attempts to word features only, it was observed that by additionally gain access to the network and try to harm the incorporating non-textual features and some domain organization’s data. Thus the security is the most knowledge, the filtering performance could be important aspect for any type of organization. improved significantly. Intrusion detection aims to discover illicit activities Phishing is aimed at stealing personal sensitive within a computer or a network through Intrusion information. Researchers [Cha 06] have identified three Detection Systems (IDS). Network IDS are widely principal groups of anti-phishing methods: detective deployed in modern enterprise networks. These systems (monitoring, content filtering, anti-spam), preventive were traditionally based on patterns of known attacks, (authentication, patch and change management), and but modern deployments include other approaches for corrective (site takedown, forensics) ones. anomaly detection, threat detection [Tor 16] and classification based on machine learning. Within the 4.1.1 E-mail Spam Filtering broader intrusion detection area, two specific problems are relevant to our analysis: the detection of botnets and Automatic e-mail classification uses statistical of Domain Generation Algorithms (DGA). A botnet is approaches or machine learning techniques and aims at a network of infected machines controlled by attackers building a model or a classifier specifically for the task and misused to conduct multiple illicit activities. Botnet of filtering spam from a users mail stream. The building detection aims to identify communications between of the model or classifier requires a set of pre- infected machines within the monitored network and classified. The process of building the model is called the external command-and-control servers. Despite training. Machine learning algorithms have achieved many research proposals and commercial tools that more success among all previous techniques employed address this threat, several botnets still exist. DGA in the task of spam filtering. In fact, the success stories automatically generate domain names, and are often of Gmail, can be ascribed to their timely transition and used by an infected machine to communicate with successful use of Machine Learning for filtering not external server(s) by periodically generating new just incoming spam but other abuses like Denial-of- hostnames. They represent a real threat for Service (DoS), virus delivery, and other imaginative organizations because, through DGA which relies on attacks. language processing techniques, it is possible to evade defenses based on static blacklists of domain names. 4.2 Malware detection Network Intrusion Detection (NID) systems are used to Malware detection is an extremely relevant problem identify malicious network activity leading to because modern malware can automatically generate confidentiality, integrity, or availability violation of novel variants with the same malicious effects but the systems in a network. Many intrusion detection appearing as completely different executable files. systems are specifically based on machine learning These polymorphic and metamorphic features defeat techniques [Kha 10] due to their adaptability to new traditional rule-based malware identification and unknown attacks. approaches. Malware can be divided into several classes depending on its purpose: virus, worm, Trojan, Although machine learning facilitates keeping adware, spyware, root kit, backdoor, key logger, various systems safe, the machine learning Ransom ware and Remote Administration Tools. classifiers themselves are vulnerable to malicious ML techniques can be used to analyze malware variants attacks. There has been some work directed to and attributing them to the correct malware family. improving the effectiveness of machine learning algorithms and protecting them from diverse [Cha 06] M. Chandrasekaran, K. Narayanan, and S. attacks. Upadhyaya, “Phishing Security Conference, 2006 [Jor 15] M. I. Jordan and T. M. Mitchell, “Machine 5 Conclusions learning: Trends, perspectives, and prospects,” Science, 2015 [Bla 08] E. Blanzieri and A. Bryl, “A survey of Machine learning approaches are increasingly learning-based techniques of email spam employed for multiple applications and are being filtering,” Artificial Intelligence Review, adopted also for cyber security, hence it is important to 2008 evaluate when and which category of algorithms can [Jav 16] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, achieve adequate results. We analyze these techniques “A deep learning approach for network for three relevant cyber security problems: intrusion intrusion detection system,” in EAI detection, malware analysis and spam detection. International Conference on Bio-inspired Machine learning as a technology has erupted vastly in Information and Communications the whole cyber implementation space. These decision Technologies (formerly BIONETICS), 2016 making algorithms are known to solve several [Tzo 07] G. Tzortzis and A. Likas, “Deep belief problems. There are many opportunities in networks for spam filtering,” in IEEE information security to apply machine learning to International Conference on Tools with address various challenges in such complex domain. Artificial Intelligence (ICTAI), 2007 Spam detection, virus detection, and surveillance [Kha 10] A. Khan, B. Baharudin, L. H. Lee, and K. camera robbery detection are only some examples. Khan, “A review of machine learning Machine learning techniques have been applied in algorithms for textdocuments many areas of science due to their unique classification,” Journal of advances in properties like adaptability, scalability, and potential information technology, 2010. to rapidly adjust to new and unknown challenges. [Tor 16] P. Torres, C. Catania, S. Garcia, and C. G. Garino, “An analysis of Recurrent Neural References Networks for Botnet detection behavior,” in IEEE Biennial Congress of Argentina [Wan 08] X. B. Wang, G. Y. Yang, Y. C. Li, D. Liu, (ARGENCON), 2016. (2008) “Review on the application of Artificial Intelligence in Antivirus Detection System”, IEEE Conference on Cybernetics and Intelligent Systems, pp. 506 509 [Mar 18] Marty, R. AI and Machine Learning in Cyber Security – Towards Data Science. March 16, 2018, from https://towardsdatascience.com/ai- and-machine-learning-in-cyber-security Applications of Artificial Intelligence (AI) to Network Security [Kan 17] Kanal, E. (2017, January). Machine Learning in Cybersecurity. Carnegie Mellon University – Software Engineering Institute. March 9, 2018 [Tyu 07] E. Tyugu. Algorithms and Architectures of Artificial Intelligence. IOS Press. 2007