Machine Learning Based Phishing E-mail detection Security-CEN@Amrita Nidhin A Unnithan, Harikrishnan NB, Akarsh S, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, India nidhinkittu5470@gmail.com nearly 4.8 billion persons using email and it is esti- mated that by 2021 there will be an increase to 5.6 Abstract billion users as email is considered to be main medium of transfer for messages over other apps. But main Phishing email detection is a significant threat problem in email is the presence of phishing mails. in today’s world. The rate at which phishing These phishing mails are unwanted mails which may are generated are tremendously increasing day carry malwares, fraud schemes, advertisements etc. by day. It is high time to deploy a self-learning In comparison to previous years, phishing mails have system that gives a time bound detection and increased and have caused serious damages to busi- prevention of phishing email efficiently. This ness, corporates, individuals and economics. Detect- work proposes a system which uses term doc- ing the fraud/phishing emails precisely is essential, ex- ument matrix as feature engineering mech- tracting and analyzing these mails can reveal us com- anism and classical machine learning tech- plex and interesting patterns and we can make appro- niques for detecting phishing email from legit- priate decisions within a company to block phishing imate and phishing ones. The system also in- mails. During the early stages of communication via corporates the domain knowledge and lexical email clear rules were followed. But nowadays due features as part of feature engineering mecha- to diversity present in email services, like Microsoft nism. The efficiency of the system is compared Outlook, Mozilla Thunderbird, Google’s Gmail, mails using different classical machine learning tech- are grouped into conversations and attempts to hide niques. Based on the accuracy, we propose the quoted parts in order to improve the readability. best model that solves the formulated problem One type of spam mail which is hazardous to users efficiently. is phishing mails. A phishing mail is the one which covers itself as a legitimate mail but once opened can 1 Introduction steal our data without our knowledge. Thus identify- Email plays an important part of everybody’s life. It ing phishing mails from spam mails is very important. is one of the easiest and effective source for transfer- One way to protect our data from phishing mail is to ring messages and files. Even though there are many add a secondary password to log in credentials. An- modes of communication, the popularity of e-mail did other way is to alarm the user once a phishing mail not diminish as it is considered as one of the safest tries to steal our data. and fastest message transfer over networks and is an In [SAZ+ 15] Sami S et.al proposed a model for de- inexpensive method of communication. tecting phishing emails that rely on a preprocessing Nowadays e-mail usage gets a tremendous increase technique which extracts different part of email as compared to previous decades. In 2017 there were feature. And this extracted feature is fed into a j48 classification algorithm to perform classification. In Copyright c by the paper’s authors. Copying permitted for [SZL+ 15], they considered meaningless tokens and new private and academic purposes. pages as the feature set. Authors in [SZL+ 15], selected In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish- ing Shared Pilot at 4th ACM International Workshop on Se- some features that have better predictability from ini- curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona, tial feature set. They provide the O(1) complexity as USA, 21-03-2018, published at http://ceur-ws.org an evaluation method to each feature set to evaluate its predictive ability. In the paper [KK15], sukhjeel tails of vector space modeling techniques such as TF- kaui et.al used Genetic algorithm for the detection IDF and Bag of words. of phishing webpage and for categorizing pages they preferred a filter function. Lu fang et.al in [FBJ+ 15] 2.1 Logistic Regression proposes some solution to overcome the time lag in detecting phishing websites. Here they provide a so- This is a classification algorithm which is used to sep- lution to detect phishing websites by analyzing the arate the data into different classes. This can be nor- peculiarity in its WHOIS and URL information. In mal, ordinary and multinominal. In binary Logistic [VSP18b, VSP18a] deep learning methods were em- Regression the outcome or the classification can be ployed to detect malicious URL0 s and domains. Bi- done into 0 and 1 whereas in multinominal the out- nay kumar et.al has used html contents for detecting come or classification will be in multiple ways. The email phishing in [KKMK15]. But Rachna Dhamija activation function used for performing this is sigmoid et.al in [TC09] mainly concentrated in this topic to function. The mathematical representation of sigmoid know which phishing activity works during the attack activation function is as follows: and why. For that they used a large given set of data 1 which contains reported phishing activities. Fergus σ(x) = (1) toolan et.al made a different approach. They used 1 + exp(−wT x) only five features for classification. For classification they used a C5.0 algorithm which have more precision 2.2 Naive Bayes compare to other algorithms. Mayank pandey et.al in Naive Bayes is a set of supervised learning algo- [PR12] used different types of classification methods rithm which works on the principle of Bayes theo- such as Multilayer Perceptron (MLP), Decision Trees rem. This theorem works on conditional probability by (DT), Support Vector Machine (SVM), Group Method which probability of the events is calculated. Binary of Data Handling (GMDH), Probabilistic Neural Net and multiple classification are done by using different (PNN), Genetic Programming (GP) and Logistic Re- types of algorithms like GaussianNB, MultinomialNB, gression (LR). Lew may form et.al in [FCT+ 15] pro- BernoulliNB [MN+ 98]. Here for this problem we used posed a method which uses hybrid features for de- MultinominalNB from scikit-learn as our algorithm. tecting phishing emails. It is called Hybrid features because it is a combination of URL based, behavior 2.3 Support Vector Machine based and contend based features. Here they acquired an overall accuracy of 97.25 % with an error percent- SVM is a supervised classification algorithm which age of 2.75 %. builds the model by classifying the data into two Even though there are different ways to detect classes. Based on the number of classes we will be phishing, [DAY+ 15] gives an overall evaluation of dif- defining the SVM. It is of two types linear SVM ferent classifiers used for phishing detection. Re- and non-linear SVM. The decision boundary for linear cently count based representation combined with do- SVM is formulated as a hyperplane in feature space, main level features integrated with machine learning i.e. a linear function of the features. Non-linear SVMs techniques are used for classifying phishing mails and result in non-linear decision boundaries in the original legitimate mails [EDB+ 18, BMS08]. The proposed feature space. From different types of kernals avail- methodology uses feature engineering approach com- able we used radial basis function (RBF) for our SVM bined with deep learning, which is one the significant model. direction in which world is moving to because it has performed well in most of the text classification tasks 2.4 TF-IDF [LBH15] and even in phishing detection [LNRW, EC]. TF-IDF stands for term frequency-inverse document The rest of the sections are organized as follows. frequency and its weight can be considered as a statis- Section 2 discusses the background details of email tical measure which evaluates how important a word representation and the machine learning algorithms. is to a document which can in turn be used for in- Section 3 includes the description of data set, exper- formation retrieval and text mining. Term Frequency iments and proposed architecture. Section 4 includes gives us an idea about how frequently a term occurs results. Conclusion is placed in Section 5. in a document. This can be mathematically defined as equation given below 2 Background ft,d This section discusses the mathematical details of var- tf (t, d) = P (2) ft0 ,d ious traditional machine learning algorithms and de- 0 t ∈d Inverse Document Frequency gives us an idea about input word representation for machine learning algo- how important a term is. When we compute term rithms. The domain level features include most com- frequency all the terms are given equal importance monly appeared words (40 features), for example pass- whether it is a stop word or a terminology word. Thus word, fraudulent, business, and special characters like we need to weigh up terminology word which is less $ , #, !, (, [, &, etc. and all the stop words were frequent than the stop word in a document by comput- removed. These are then passed through Logistic Re- ing inverse document frequency given by mathematical gression, Naive Bayes and Support Vector Machine to equation do the classification of phishing and legitimate mails. N idf (t, D) = log (3) |{d ∈ D : t ∈ d}| where N is the total number of documents in the cor- pus. Now TF-IDF can be calculated as tf idf (t, d, D) = tf (t, d) • idf (t, D) (4) Additionally the domain level features are added. This includes a list of most commonly appeared words and a list of special characters. 3 Experiments 3.1 Dataset details The email phishing detection is a task in shared task on anti-phishing shared task at 4th ACM In- ternational Workshop on Security and Privacy Ana- lytics [EDMB+ 18]. Let E = [e1 ,e2 ,...,en ] and C = [c1 ,c2 ,...,cn ] be sets of email types such as legitimate or phishing, the task was to classify each given email samples into either legitimate or phishing. Two sets of data sets were used one with header and one without header. Data set statistics are integrated together in Table 1 for training and Table 2 for testing. Figure 1: Proposed Architecture Table 1: Training Dataset details Training Dataset Legitimate Phishing Total Table 3: Statistics of 10-fold cross validation With header 4082 501 4583 Method Task Accuracy Without header 5088 612 5700 Logistic Regression Without Header 92.2 Naive Bayes Without Header 93.4 Table 2: Testing Dataset details Support Vector Machine Without Header 94.3 Testing Dataset Data Samples Logistic Regression With Header 91.2 With header 4195 Naive Bayes With Header 92.2 Without header 4300 Support Vector Machine With Header 93.3 3.2 Proposed Architecture 4 Results We used count based representation to create our model. A diagrammatic representation of our archi- Our model build using above architecture was trained tecture is shown in Figure 1. The email samples from for data sets with headers and without headers for clas- data set is first passed through count based represen- sification of phishing and legitimate mails. We trained tation, here TF-IDF, for word representation. It is a total of six models, one each for Logistic Regression, then combined with domain level features to get our Naive Bayes, Support Vector Machine for mails with Table 4: Statistics of Test Result Method Task TP TN FP FN Accuracy Precision Recall F1 score Logistic Regression Without Header 3784 325 150 41 0.95 0.96 0.98 0.97 Naive Bayes Without Header 3807 258 217 18 0.94 0.94 0.99 0.97 Support Vector Machine Without Header 3671 337 138 154 0.93 0.96 0.95 0.96 Logistic Regression With Header 3612 490 6 87 0.97 0.99 0.97 0.98 Naive Bayes With Header 3572 489 7 127 0.96 0.99 0.96 0.98 Support Vector Machine With Header 3561 458 38 138 0.95 0.98 0.96 0.97 header and without header. We used 10 fold cross val- References idation for our training data and the results obtained [BMS08] Ram Basnet, Srinivas Mukkamala, and by our model has been consolidated in Table 3. For Andrew H Sung. Detection of phishing data set without headers SVM gave the highest accu- attacks: A machine learning approach. racy with 94.3% and for data set with headers SVM In Soft Computing Applications in Indus- gave the highest accuracy with 93.3%. We didn’t ex- try, pages 373–383. Springer, 2008. tract any features from header data set but extract- ing features from headers may increase the accuracy. [DAY+ 15] Ammar Yahya Daeef, R Badlishah Ah- Our model was tested using test data by IWSPA-AP mad, Yasmin Yacob, Naimah Yaakob, Shared Task committee and the corresponding results and Mohd Nazri Bin Mohd Warip. Phish- for True Positive, True Negative, False Positive, False ing email classifiers evaluation: Email Negative, Accuracy, Precision, Recall, F1 score for our body and header approach. Journal six models are summarized in Table 4. of Theoretical and Applied Information Technology, 80(2):354, 2015. 5 Conclusion [EC] Louis Eugene and Isaac Caswell. Making a manageable email experience with deep This paper evaluated the performance of machine learning. learning based classifier for distinguishing phishing emails from legitimate ones. We created a model us- [EDB+ 18] Ayman Elaassal, Avisha Das, Shahryar ing count based representation combined with domain Baki, Luis De Moraes, and Rakesh level features as word representation and passed to var- Verma. Iwspa-ap: Anti-phising shared ious machine learning techniques such as Logistic Re- task at acm international workshop on gression, Naive Bayes and Support Vector Machine to security and privacy analytics. In Pro- classify whether it is phishing or legitimate. Both the ceedings of the 1st IWSPA Anti-Phishing sub tasks belong to unconstrained category, i.e., any Shared Task. CEUR, 2018. data sets can be used during training and data sets for [EDMB+ 18] Ayman Elaassal, Luis De Moraes, both the tasks where highly imbalanced. Even then Shahryar Baki, Rakesh Verma, and we have not used any other external data set sources Avisha Das. Iwspa-ap shared task email and still were able to achieve good detection rate for dataset, 2018. phishing email in both sub tasks. By adding some additional data sources we can considerable increase [FBJ+ 15] Lv Fang, Wang Bailing, Huang Junheng, the detection rate of phishing emails for the proposed Sun Yushan, and Wei Yuliang. A proac- methodology. tive discovery and filtering solution on phishing websites. In Big Data (Big Data), 2015 IEEE International Confer- 5.0.1 Acknowledgements ence on, pages 2348–2355. IEEE, 2015. This research was supported in part by Paramount [FCT+ 15] Lew May Form, Kang Leng Chiew, Computer Systems. We are grateful to NVIDIA In- Wei King Tiong, et al. Phishing email dia, for the GPU hardware support to the research detection technique by using hybrid fea- grant. We are grateful to Computational Engineering tures. In IT in Asia (CITA), 2015 9th and Networking (CEN) department for encouraging International Conference on, pages 1–5. the research. IEEE, 2015. [KK15] Sukhjeel Kaui and Amrit Kaur. Detec- [VSP18a] R Vinayakumar, KP Soman, and Praba- tion of phishing webpages using weights haran Poornachandran. Detecting mali- computed through genetic algorithm. In cious domain names using deep learning MOOCs, Innovation and Technology in approaches at scale. Journal of Intelli- Education (MITE), 2015 IEEE 3rd In- gent & Fuzzy Systems, 34(3):1355–1367, ternational Conference on, pages 331– 2018. 336. IEEE, 2015. [VSP18b] R Vinayakumar, KP Soman, and Praba- [KKMK15] Binay Kumar, Pankaj Kumar, Ankit haran Poornachandran. Evaluating deep Mundra, and Shikha Kabra. Dc scan- learning approaches to characterize and ner: Detecting phishing attack. In classify malicious urls. Journal of Intel- Image Information Processing (ICIIP), ligent & Fuzzy Systems, 34(3):1333–1343, 2015 Third International Conference on, 2018. pages 271–276. IEEE, 2015. [LBH15] Yann LeCun, Yoshua Bengio, and Ge- offrey Hinton. Deep learning. nature, 521(7553):436, 2015. [LNRW] Christopher Lennan, Bastian Naber, Jan Reher, and Leon Weber. End-to-end spam classification with neural networks. [MN+ 98] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categoriza- tion, volume 752, pages 41–48. Citeseer, 1998. [PR12] Mayank Pandey and Vadlamani Ravi. Detecting phishing e-mails using text and data mining. In Computational Intelli- gence & Computing Research (ICCIC), 2012 IEEE International Conference on, pages 1–6. IEEE, 2012. [SAZ+ 15] Sami Smadi, Nauman Aslam, Li Zhang, Rafe Alasem, and MA Hossain. Detec- tion of phishing emails using data mining algorithms. In Software, Knowledge, In- formation Management and Applications (SKIMA), 2015 9th International Con- ference on, pages 1–8. IEEE, 2015. [SZL+ 15] Hongzhou Sha, Zhou Zhou, Qingyun Liu, Tingwen Liu, and Chao Zheng. Limited dictionary builder: An approach to select representative tokens for malicious urls detection. In Communications (ICC), 2015 IEEE International Conference on, pages 7077–7082. IEEE, 2015. [TC09] Fergus Toolan and Joe Carthy. Phish- ing detection using classifier ensembles. In eCrime Researchers Summit, 2009. eCRIME’09., pages 1–9. IEEE, 2009.