-

Tempe, Arizona, USA

A Machine Learning approach towards Phishing Email Detection CEN-Security@IWSPA 2018

0 Harikrishnan NB, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering , Coimbatore Amrita Vishwa Vidyapeetham , India

2018

2 1 03

Email is a platform where we communicate, exchange ideas between each other. In today's world email plays a key role irrespective of the eld. In such a scenario, phishing mails are one of the major threats in today's world. These e-mails "seems" like legitimate but leads the users to malicious sites. As a result the user or organization or institution end up as the prey of the online predators. In order to tackle such problems, several statistical methods have been applied. In this paper we make use of distributional representation namely TF-IDF for numeric representation of phishing mails. Also a comparative study of classical machine learning techniques like Random Forest, AdaBoost, Naive Bayes, Decision Tree, SVM.

In today's world communication plays a key role in all aspects of life. Email is a common platform used by people for faster and e cient communication. Email has become an inevitable part of everyday life. Due to the advancement in this era of digitization the dependency on email has been increasing day by day. The increasing dependency calls for a way to manage the huge amount of data or emails. The emails conveyed include important as well as phishing emails. Phishing emails often leads to malicious websites and results in sharing personal details to the attackers. In order to thwart these situations spam and phishing email classi ers are widely used. Blacklisting which comes under the category of list based lters is a popular method to thwart phishing emails. It achieves this by blocking emails from a list of sender's that are in the blacklist. Blacklist consists of records of IP address and email address of malicious users. When a new emails arrives, the spam and phishing email lter checks the IP and email address with that provided in the blacklist and decides whether the email has to be marked as phishing or not. Other list based lters include whitelistwhich allows emails from senders that is provided by the user. Other popular methods include lters based on contents. This includes word based lters, heuristic lters, Bayesian lters. Word based lters blocks emails with certain speci c words. The main drawback of this method is its failure to classify new malicious email. In order to update the list human intervention is required

Phishing email is a common name that represents spam emails that has malicious intentions. Phishing emails are a potential danger especially to multinational companies, banking sector and even hospitals. Phishing emails are also used by hackers to inject malware into the system. The recent ransomware attack [KRB+15] is the best example for this. These phishing emails seems like legitimate but contains malicious contents which can steal ones valuable details like account number, credit/debit card details etc. In such a situation a model has to be developed which can detect and classify phishing emails very e ciently. The traditional methods relies on human intervention. This calls for an automation in recognizing emails as either phishing or not. In such situations research moves in the direction of machine learning and deep learning.

Recent developments in the eld of machine learning and deep learning, have shown promising results in the eld of Computer Vision, Natural Language Processing, Cyber security. etc. Taking this into account we use a machine learning based model like Decision tree, Logistic Regression, Random forest, Naive Bayes, KNN, AdaBoost, SVM in classifying email as either phishing or legitimate. The proposed method uses SVD (singular value decomposition), NMF (Nonnegative Matrix Factorization) for feature extraction and dimensionality reduction. We have used TFIDF (Term Frequency Inverse Document Frequency) for numeric representation of words.

The paper is structured as follows: Section 2 represents related works, Section 3 discuses dataset description, Section 4 highlights the methodology used, Section 5, 6, 7 represents results, conclusion and acknowledgement respectively. 2

Related Work

Phishing email detection can be treated as a sub problem of spam detection. For several years spam detection has been a rich area of research. [AKCS00], [Sch03], [CL06] are examples of earlier works on antispam lters. The work done speci cally on phishing email detection is comparatively less compared to spam detection. The dataset commonly used for most of the research related to Phishing email is PhishingCorpus [Naz10], [SVKS15], [BVP]. PhishingCorpus consist of a group of hand-screened emails [GNN11] which makes the dataset challenging. The existing learning based approaches are presented in a structured overview in [BB08]. Currently, various experts are tacking the problem of phishing email classi cation in the perspective of text classi cation [BB08]. In [CNU06] performed phishing email detection by identifying structural features from the emails. These features are passed to SVM for detecting phishing emails. In [BCP+08] has proposed two methods, adaptive Dynamic Markov Chains (DMC) and latent classtopic model to classify emails. The adaptive Dynamic Markov Chains gave similar performance when compared to standard version while using two thirds less of the memory. In [ANNWN07] has proposed machine learning based models like logistic regression, SVM, random forest for classifying emails as either spam or legitimate. Also [AGA+13] has mentioned the types of phishing attacks and classi cation. However they have not incorporated the exploration of available datasets and feature engineering techniques. Researchers has also analyzed the class cation of emails based on the contents. This paper uses TF-IDF representation followed by dimensionality reduction for capturing major contributing factors in the dataset and also for reducing the computational cost. This is then passed to classical machine learning techniques for classifying the data as either legitimate or normal. Researchers has also moved in the direction of applying deep learning techniques to classify URL's as benign and malicious URL's [VSP18b], [VSP18a]. In [VSPSK18], [VSP17] authors have used deep learning techniques to classify and evaluate domain generation algorithm. 3

Dataset description

The shared task consists of two tasks. Task 1 is Email with headers and Task 2 is Email with no headers. The dataset details [EDMB+18] [EDB+18] is provided in the table below:

Given a set of emails represented as D = [e1; e2; :::en] and its labels like C = [c1; c2; :::cn]. The labels are either 0 or 1. The machine learning model used will learn the patterns that maps the train data into its corresponding labels. After the learning, the model is used to predict the labels for test data.

In order to represent data in numeric format we used TF-IDF representation. TF-IDF ( Term Frequency Inverse Document Frequency) is used for both the tasks. TF-IDF represents the importance of a word in a corpus. The TF-IDF representation is followed by SVD/ NMF for feature selection and dimensionality reduction. We have used train-test split and chose 33% of training data as validation data for evaluating the performance of the model

We have evaluated the performance of TF-IDF representation and TF-IDF + SVD/NMF representation for the validation data. For TFIDF + SVD/NMF, the rank is taken as 30 i.e, the number of columns of the train and test data matrix will be 30 due to dimensionality reduction. The performance of TF-IDF + SVD/NMF with no of columns as 30 after dimensionality reduction was similar to the performance of TFIDF representation of validation data. This numeric representation for the data is passed to di erent machine learning algorithms. 4.1.1

Data representation for with headers: TF-IDF representation of data. The vocabulary is build using train and test data.

SVD/NMF for feature extraction and dimensionality reduction Step 2 is followed by applying classical ML techniques like Decision Tree, Random Forest, AdaBoost, KNN, SVM 4.1.2

Data representation for with no headers: Data Preprocessing- Data preprocessing involves counting the number of '@', '#' symbol in each data sample. Then '@' and '#' counts are removed from orginal corpus TF-IDF representation of data, followed by appending the '@' count and '#' count.

SVD/NMF for feature extraction and dimensionality reduction Step 3 is followed by applying classical ML techniques like Decision Tree, Random Forest, AdaBoost, KNN, SVM

In this paper we have used classical machine learning techniques like Decision Tree, K- Nearest Neighbors, Logistic Regression, Naive Bayes, Random Forest, SVM. The metrics for understanding the performance are the following:

1. Accuracy 2. Precision 3. Recall

The techniques used for feature extraction and dimensionality reduction are NMF and SVD. In [LS99] describes the details of Non Negative Matrix Factorization well. TFIDF matrix is passed as input to NMF and a group of topics is generated. These represents a weighted set of co-occurring terms. The topics identi ed acts as a basis by providing an e cient way of representation to the original corpus. NMF is found useful when the data attributes are more and is used as a feature extraction technique.

SVD aka singular value decomposition, decomposes the TFIDF matrix (T) into 3 matrices. They are U , , V T , U represents the orthonormal eigenvectors of AAT , represents a diagonal matrix and its diagonal entries are the singular values, V T represents the orthogonal eigenvectors of AT A. SVD is a powerful tool and has many application in the eld of signal processing and image processing. SVD is mainly used for dimensionality reduction and for representing important features. The product of U is used for extracting the features. In all the cases the rank is assumed as 30. So the size of train and test matrix will shrink to ( no of data samples x 30 ). These extracted features are passed to di erent classical machine learning techniques 5

Results

This section provides details of the accuracy, precision, recall, F1-score with respect to training data. The following tables describes the performance of each classical machine learning techniques for the formulated binary classi cation problem to detect whether an email is phishing or legitimate. We have used train-test split (scikit learn) to split the training data into training and validation. We have used 33% of training data for validation. Table 3, 4, represents metrics for validation for sub-task 1 (no header) and sub-task 2 (with F1-Score 0.835 0.877 0.821 0.733 0.837 0.882 0.936 header). The results in Table 3 and 4 corresponds to the TFIDF representation of the data. Similarly Table 5 and 6 represents the evaluation metrics for validation data for sub-task 1 (no header) and sub-task 2 (with header) with TFIDF + SVD/NMF representation respectively. When calculated the training accuracy Decision Tree and Random Forest outperformed almost in all cases. The performance of TFIDF and TFIDF +SVD/NMF representation is almost similar from the results obtained in Table 3, 4, 5, 6. This motivates us to go for dimensionality reduction. Since the number of singular values used are 30, the pre-processed data set size will be (no of rows, 30) Table 7, 8 represents metrics for test set. Table 7 represents the metrics for TFIDF + SVD representation for sub-task 1 and 2 test set. Similarly Table 8 represents the metrics for TFIDF + NMF representation for sub-task 1 and 2 test set. 6

Conclusion

In this paper we used TFIDF+ SVD and TFIDF + NMF representations followed by ML techniques for classifying emails as either legitimate or phishing. The performance of Decision Tree and Random Forest was the highest in the case of training accuracy. But the test data results for decision tree and random forest mentions the case of over tting. The over tting is because the dataset is highly unbalanced. Also both the sub-tasks belong to the unconstrained category (which means we can use any other data sets during training). The given datasets for both the subtasks are highly imbalanced. Even though the tasks are unconstrained, we haven't used any other external sources. With highly, imbalanced data sets, we are able to achieve considerable phishing email detection rate in both the sub-tasks. The phishing email detection rate of the proposed methodology can be easily enhanced by adding additional extra data sources. This will be considered as one of the signi cant direction towards the future work. Also due to computational constraints, the authors couldn't try for deep learning based methods. This can also be taken up as a future work.

Acknowledgement

This research was supported in part by Paramount Computer Systems. We are also grateful to NVIDIA India, for the GPU hardware support to the research grant. We are grateful to Computational Engineering and Networking (CEN) department for encouraging the research. [AGA+13] [AKCS00]

Ammar Almomani, BB Gupta, Samer

Atawneh, A Meulenberg, and Eman Almomani. A survey of phishing email ltering techniques. IEEE communications surveys & tutorials, 15(4):2070{ 2090, 2013.

Ion Androutsopoulos, John Koutsias,

Konstantinos V Chandrinos, and Constantine D Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam ltering with personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160{167. ACM, 2000. [ANNWN07] Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and Suku Nair. A comparison of machine learning techniques for phishing detection. In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, pages 60{ 69. ACM, 2007. [EDMB+18] Ayman Elaassal, Luis De Moraes, Shahryar Baki, Rakesh Verma, and Avisha Das. Iwspa-ap sha red task email dataset, 2018 . [GNN11] [KRB+15] [LS99] [Naz10] [Sch03] [SVKS15] [VSP17] [VSP18a] [VSP18b]

Hugo Gonzalez, Kara Nance, and Jose

Nazario. Phishing by form: The abuse of form sites. In Malicious and Unwanted Software (MALWARE), 2011 6th International Conference on, pages 95{101. IEEE, 2011.

Amin Kharraz, William Robertson, Da

vide Balzarotti, Leyla Bilge, and Engin Kirda. Cutting the gordian knot: A look under the hood of ransomware attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 3{ 24. Springer, 2015.

Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788, 1999. J Nazario. Phishingcorpus homepage.

ed: Retrieved February, 2010.

Karl-Michael Schneider. A comparison

of event models for naive bayes antispam e-mail ltering. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1, pages 307{314. Association for Computational Linguistics, 2003.

Shriya Se, R Vinayakumar, M Anand

Kumar, and KP Soman. Amrita-cen@ sail2015: sentiment analysis in indian languages. In International Conference on Mining Intelligence and Knowledge Exploration, pages 703{710. Springer, 2015.

R Vinayakumar, KP Soman, and Prabaharan Poornachandran. Deep encrypted text categorization. In Advances in

Computing, Communications and Informatics (ICACCI), 2017 International Conference on, pages 364{370. IEEE, 2017.

R Vinayakumar, KP Soman, and Praba

haran Poornachandran. Detecting malicious domain names using deep learning approaches at scale. Journal of Intelligent & Fuzzy Systems, 34(3):1355{1367, 2018.

R Vinayakumar, KP Soman, and Prabaharan Poornachandran. Evaluating deep [VSPSK18]

learning approaches to characterize and classify malicious urls . Journal of Intelligent & Fuzzy Systems , 34 ( 3 ): 1333 { 1343 , 2018 .

Vinayakumar , KP Soman, Prabaharan Poornachandran, and

S Sachin

Kumar . Evaluating deep learning approaches to characterize and classify the dgas at scale . Journal of Intelligent & Fuzzy Systems , 34 ( 3 ): 1265 { 1276 , 2018 .