-

Yann LeCun, Yoshua Bengio, and Ge- o rey Hinton. Deep learning. nature

Machine Learning Based Phishing E-mail detection Security-CEN@Amrita

0 Nidhin A Unnithan , Harikrishnan NB, Akarsh S, Vinayakumar R , Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering , Coimbatore Amrita Vishwa Vidyapeetham , India

2015

521 7553

Phishing email detection is a signi cant threat in today's world. The rate at which phishing are generated are tremendously increasing day by day. It is high time to deploy a self-learning system that gives a time bound detection and prevention of phishing email e ciently. This work proposes a system which uses term document matrix as feature engineering mechanism and classical machine learning techniques for detecting phishing email from legitimate and phishing ones. The system also incorporates the domain knowledge and lexical features as part of feature engineering mechanism. The e ciency of the system is compared using di erent classical machine learning techniques. Based on the accuracy, we propose the best model that solves the formulated problem e ciently.

Email plays an important part of everybody's life. It is one of the easiest and e ective source for transferring messages and les. Even though there are many modes of communication, the popularity of e-mail did not diminish as it is considered as one of the safest and fastest message transfer over networks and is an inexpensive method of communication.

Nowadays e-mail usage gets a tremendous increase compared to previous decades. In 2017 there were Copyright c by the paper's authors. Copying permitted for private and academic purposes. nearly 4.8 billion persons using email and it is estimated that by 2021 there will be an increase to 5.6 billion users as email is considered to be main medium of transfer for messages over other apps. But main problem in email is the presence of phishing mails. These phishing mails are unwanted mails which may carry malwares, fraud schemes, advertisements etc. In comparison to previous years, phishing mails have increased and have caused serious damages to business, corporates, individuals and economics. Detecting the fraud/phishing emails precisely is essential, extracting and analyzing these mails can reveal us complex and interesting patterns and we can make appropriate decisions within a company to block phishing mails. During the early stages of communication via email clear rules were followed. But nowadays due to diversity present in email services, like Microsoft Outlook, Mozilla Thunderbird, Google's Gmail, mails are grouped into conversations and attempts to hide quoted parts in order to improve the readability.

One type of spam mail which is hazardous to users is phishing mails. A phishing mail is the one which covers itself as a legitimate mail but once opened can steal our data without our knowledge. Thus identifying phishing mails from spam mails is very important. One way to protect our data from phishing mail is to add a secondary password to log in credentials. Another way is to alarm the user once a phishing mail tries to steal our data.

In [SAZ+15] Sami S et.al proposed a model for detecting phishing emails that rely on a preprocessing technique which extracts di erent part of email as feature. And this extracted feature is fed into a j48 classi cation algorithm to perform classi cation. In [SZL+15], they considered meaningless tokens and new pages as the feature set. Authors in [SZL+15], selected some features that have better predictability from initial feature set. They provide the O(1) complexity as an evaluation method to each feature set to evaluate its predictive ability. In the paper [KK15], sukhjeel kaui et.al used Genetic algorithm for the detection of phishing webpage and for categorizing pages they preferred a lter function. Lu fang et.al in [FBJ+15] proposes some solution to overcome the time lag in detecting phishing websites. Here they provide a solution to detect phishing websites by analyzing the peculiarity in its WHOIS and URL information. In [VSP18b, VSP18a] deep learning methods were employed to detect malicious URL0s and domains. Binay kumar et.al has used html contents for detecting email phishing in [KKMK15]. But Rachna Dhamija et.al in [TC09] mainly concentrated in this topic to know which phishing activity works during the attack and why. For that they used a large given set of data which contains reported phishing activities. Fergus toolan et.al made a di erent approach. They used only ve features for classi cation. For classi cation they used a C5.0 algorithm which have more precision compare to other algorithms. Mayank pandey et.al in [PR12] used di erent types of classi cation methods such as Multilayer Perceptron (MLP), Decision Trees (DT), Support Vector Machine (SVM), Group Method of Data Handling (GMDH), Probabilistic Neural Net (PNN), Genetic Programming (GP) and Logistic Regression (LR). Lew may form et.al in [FCT+15] proposed a method which uses hybrid features for detecting phishing emails. It is called Hybrid features because it is a combination of URL based, behavior based and contend based features. Here they acquired an overall accuracy of 97.25 % with an error percentage of 2.75 %.

Even though there are di erent ways to detect phishing, [DAY+15] gives an overall evaluation of different classi ers used for phishing detection. Recently count based representation combined with domain level features integrated with machine learning techniques are used for classifying phishing mails and legitimate mails [EDB+18, BMS08]. The proposed methodology uses feature engineering approach combined with deep learning, which is one the signi cant direction in which world is moving to because it has performed well in most of the text classi cation tasks [LBH15] and even in phishing detection [LNRW, EC].

The rest of the sections are organized as follows. Section 2 discusses the background details of email representation and the machine learning algorithms. Section 3 includes the description of data set, experiments and proposed architecture. Section 4 includes results. Conclusion is placed in Section 5. 2

Background

This section discusses the mathematical details of various traditional machine learning algorithms and details of vector space modeling techniques such as TFIDF and Bag of words. 2.1 This is a classi cation algorithm which is used to separate the data into di erent classes. This can be normal, ordinary and multinominal. In binary Logistic Regression the outcome or the classi cation can be done into 0 and 1 whereas in multinominal the outcome or classi cation will be in multiple ways. The activation function used for performing this is sigmoid function. The mathematical representation of sigmoid activation function is as follows: (x) =

1 1 + exp( wT x) (1) 2.2

Naive Bayes

Naive Bayes is a set of supervised learning algorithm which works on the principle of Bayes theorem. This theorem works on conditional probability by which probability of the events is calculated. Binary and multiple classi cation are done by using di erent types of algorithms like GaussianNB, MultinomialNB, BernoulliNB [MN+98]. Here for this problem we used MultinominalNB from scikit-learn as our algorithm. 2.3

Support Vector Machine

SVM is a supervised classi cation algorithm which builds the model by classifying the data into two classes. Based on the number of classes we will be de ning the SVM. It is of two types linear SVM and non-linear SVM. The decision boundary for linear SVM is formulated as a hyperplane in feature space, i.e. a linear function of the features. Non-linear SVMs result in non-linear decision boundaries in the original feature space. From di erent types of kernals available we used radial basis function (RBF) for our SVM model. 2.4

TF-IDF TF-IDF stands for term frequency-inverse document frequency and its weight can be considered as a statistical measure which evaluates how important a word is to a document which can in turn be used for information retrieval and text mining. Term Frequency gives us an idea about how frequently a term occurs in a document. This can be mathematically de ned as equation given below ft;d P ft0 ;d t0 2d tf (t; d) = (2)

Inverse Document Frequency gives us an idea about how important a term is. When we compute term frequency all the terms are given equal importance whether it is a stop word or a terminology word. Thus we need to weigh up terminology word which is less frequent than the stop word in a document by computing inverse document frequency given by mathematical equation

N idf (t; D) = log

jfd 2 D : t 2 dgj where N is the total number of documents in the corpus.

Now TF-IDF can be calculated as

tf idf (t; d; D) = tf (t; d) idf (t; D)

Additionally the domain level features are added. This includes a list of most commonly appeared words and a list of special characters. (3) (4) 3 3.1

Experiments Dataset details

The email phishing detection is a task in shared task on anti-phishing shared task at 4th ACM International Workshop on Security and Privacy Analytics [EDMB+18]. Let E = [e1,e2,...,en] and C = [c1,c2,...,cn] be sets of email types such as legitimate or phishing, the task was to classify each given email samples into either legitimate or phishing. Two sets of data sets were used one with header and one without header. Data set statistics are integrated together in Table 1 for training and Table 2 for testing. We used count based representation to create our model. A diagrammatic representation of our architecture is shown in Figure 1. The email samples from data set is rst passed through count based representation, here TF-IDF, for word representation. It is then combined with domain level features to get our input word representation for machine learning algorithms. The domain level features include most commonly appeared words (40 features), for example password, fraudulent, business, and special characters like $ , #, !, (, [, &, etc. and all the stop words were removed. These are then passed through Logistic Regression, Naive Bayes and Support Vector Machine to do the classi cation of phishing and legitimate mails. Our model build using above architecture was trained for data sets with headers and without headers for classi cation of phishing and legitimate mails. We trained a total of six models, one each for Logistic Regression, Naive Bayes, Support Vector Machine for mails with header and without header. We used 10 fold cross validation for our training data and the results obtained by our model has been consolidated in Table 3. For data set without headers SVM gave the highest accuracy with 94.3% and for data set with headers SVM gave the highest accuracy with 93.3%. We didn't extract any features from header data set but extracting features from headers may increase the accuracy. Our model was tested using test data by IWSPA-AP Shared Task committee and the corresponding results for True Positive, True Negative, False Positive, False Negative, Accuracy, Precision, Recall, F1 score for our six models are summarized in Table 4. 5

Conclusion

This paper evaluated the performance of machine learning based classi er for distinguishing phishing emails from legitimate ones. We created a model using count based representation combined with domain level features as word representation and passed to various machine learning techniques such as Logistic Regression, Naive Bayes and Support Vector Machine to classify whether it is phishing or legitimate. Both the sub tasks belong to unconstrained category, i.e., any data sets can be used during training and data sets for both the tasks where highly imbalanced. Even then we have not used any other external data set sources and still were able to achieve good detection rate for phishing email in both sub tasks. By adding some additional data sources we can considerable increase the detection rate of phishing emails for the proposed methodology. 5.0.1

Acknowledgements

This research was supported in part by Paramount Computer Systems. We are grateful to NVIDIA India, for the GPU hardware support to the research grant. We are grateful to Computational Engineering and Networking (CEN) department for encouraging the research.

Ram Basnet, Srinivas Mukkamala, and Andrew H Sung. Detection of phishing attacks: A machine learning approach.

In Soft Computing Applications in Industry, pages 373{383. Springer, 2008.

Ammar Yahya Daeef, R Badlishah Ah

mad, Yasmin Yacob, Naimah Yaakob, and Mohd Nazri Bin Mohd Warip. Phishing email classi ers evaluation: Email body and header approach. Journal of Theoretical and Applied Information Technology, 80(2):354, 2015.

Louis Eugene and Isaac Caswell. Making a manageable email experience with deep learning. Ayman Elaassal, Avisha Das, Shahryar

Baki, Luis De Moraes, and Rakesh Verma. Iwspa-ap: Anti-phising shared task at acm international workshop on security and privacy analytics. In Proceedings of the 1st IWSPA Anti-Phishing Shared Task. CEUR, 2018.

Lv Fang, Wang Bailing, Huang Junheng, Sun Yushan, and Wei Yuliang. A proactive discovery and ltering solution on phishing websites. In Big Data (Big

Data), 2015 IEEE International Conference on, pages 2348{2355. IEEE, 2015.

Lew May Form, Kang Leng Chiew,

Wei King Tiong, et al. Phishing email detection technique by using hybrid features. In IT in Asia (CITA), 2015 9th International Conference on, pages 1{5.

IEEE, 2015. [BMS08] [DAY+15] [EC] [EDB+18] [FBJ+15] [FCT+15] [EDMB+18] Ayman Elaassal, Luis De Moraes, Shahryar Baki, Rakesh Verma, and Avisha Das. Iwspa-ap shared task email dataset, 2018.

Sukhjeel Kaui and Amrit Kaur. Detection of phishing webpages using weights computed through genetic algorithm. In

MOOCs, Innovation and Technology in Education (MITE), 2015 IEEE 3rd International Conference on, pages 331{ 336. IEEE, 2015. [VSP18b] [LNRW] [MN+98] [PR12] [SAZ+15] [SZL+15] [TC09]

Christopher Lennan, Bastian Naber, Jan

Reher, and Leon Weber. End-to-end spam classi cation with neural networks.

Mayank Pandey and Vadlamani Ravi.

Detecting phishing e-mails using text and data mining. In Computational Intelligence & Computing Research (ICCIC), 2012 IEEE International Conference on, pages 1{6. IEEE, 2012.

R Vinayakumar, KP Soman, and Praba

haran Poornachandran. Detecting malicious domain names using deep learning approaches at scale. Journal of Intelligent & Fuzzy Systems, 34(3):1355{1367, 2018.

R Vinayakumar, KP Soman, and Praba

haran Poornachandran. Evaluating deep learning approaches to characterize and classify malicious urls. Journal of Intelligent & Fuzzy Systems, 34(3):1333{1343, 2018.

[KKMK15]

Binay

Kumar , Pankaj Kumar, Ankit Mundra, and

Shikha

Kabra . Dc scanner: Detecting phishing attack . In Image Information Processing (ICIIP) , 2015 Third International Conference on, pages 271 { 276 . IEEE, 2015 .

[LBH15]