-

PED-ML: Phishing Email Detection Using Classical Machine Learning Techniques CENSec@Amrita

0 Anu Vazhayil , Harikrishnan NB, Vinayakumar R , Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering , Coimbatore Amrita Vishwa Vidyapeetham , India

1998

In the modern era, all services are maintained online and everyone use it to speed up their day to day activities. This include social as well as nancial activities which involves usage of sensitive information to carry out the intended task. With the increase in usage of such facilities put forth the importance of securing the data used to perform such actions. Over the last decade phishing has become a serious threat to the society by stealing sensitive information to get hold of these facilities. This is considered to be the most pro table cybercrime and according to IBMs X-Force researchers statistics, the number of people becoming the victim of such activities are increasing tremendously. As the risk of phishing emails are increasing steadily, the need to detect and overcome such situations stands as one of the highest priority task at hand. In the present work, we will use non-sequential representation such as term document matrix approach followed by Singular Value Decomposition (SVD) and Nonnegative Matrix Factorization (NMF) to model phishing email detection as a supervised classi cation problem to detect phishing emails from legitimate ones.

In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhishing Shared Pilot at 4th ACM International Workshop on Security and Privacy Analytics (IWSPA 2018), Tempe, Arizona, USA, 21-03-2018, published at http://ceur-ws.org 1

Introduction

The growth of internet has revolutionized the digital era. This revolution has changed entirely the way we communicate, carry out business, advertisement etc. In fact, in today's world in order to establish a successful business a web presence is mandatory. And in all cases important communications takes place through email. At the same time there are instances where phishing emails are send to users and the main goal of such emails is to steal sensitive information of the user. Phishing emails does this by sending emails claiming to originate from some trusted sources. And these emails contain links or attachments which tries to get sensitive information from the user. In such a scenario an e cient mechanism that detects and classify phishing emails has to be addressed. The conventional techniques used are blacklisting, greylisting and whitelisting. In the case of blacklisting, IP and email address of those mails which attempts to collect the private information of users are stored in a list and all emails arrived from the email address speci ed in the list are marked as phishing scams. Whitelist functions exactly opposite to blacklist by allowing emails from trusted users speci ed in the whitelist. The drawback of these methods is the requirement of human involvement in de ning and updating the list and it also fails at detecting the new or the variants of existing phishing email. The other popular method include Bayesian lters, a heuristic approach. Bayesian lters are popularly used detection techniques during 1990s. With the increase in the computational capability, there is a paradigm shift from conventional techniques to data driven techniques. Data driven techniques popularized the impact of machine learning in the area of cyber security [NVK+15] in unfathomable ways.

There has been signi cant amount of research going on in the direction of phishing email classi cation. Researchers have come up with many mathematical models to detect phishing emails. Some of the commonly used techniques are naive bayes classi er, boosted decision tree [CM01], SVM [DWV99a], LVQ-based neural network [CXMX05] etc. These methods needs a Bayesian prior knowledge about the nature of phishing emails [SVKS15], [BVP].

Recent trends in the eld of computer vision and Natural Language Processing (NLP), clearly conveys the potential use of machine learning techniques to tackle many signi cant problems in these areas. In such a situation our research mainly focus on machine learning based solution to classify emails as either phishing or legitimate. In this paper the authors used Term Document Matrix (TDM) for non sequential representation of the corpus. Feature engineering is an important step in all machine learning tasks. In order to extract the important features SVD and NMF is applied on the data. These are then passed to machine learning algorithms like Decision tree, KNN, Naive Bayes, Random forest, SVM and logistic regression.

The remaining part of the paper is arranged as follows: Section 2 represents related works, Section 3 discusses the model altogether, covering dataset description, representation of the data and highlights the methodology used, Section 4 and 5 represents results and conclusion respectively followed by acknowledgement. 2

Related Works

Phishing attacks are serious cyber threats for both multinational companies as well as users. These emails seems like they are legitimate but contains malicious contents which can steal important information like bank account number, credit card details etc, and bring huge loss to individuals and organizations. This calls the importance of segregating such emails. Methods like blacklisting requires human intervention to manually select and classify the emails. While on the other hand there are feature engineering techniques which analyses the contents of emails and helps in the classi cation process. In [SDHH98], the work has conveyed the importance of phishing speci c features for classi cation. In [KMAH04] the class cation error was reduced by utilizing the temporal relation in email sequence and using those as features. Heuristics based feature selection was highlighted in [MW04]. Due to the growth of computing facilities, data driven methods were widely used in email classi cation. In [Faw03] and [Gee03] data mining techniques were introduced for ltering non-legitimate emails. Also [DWV99b] used PCA as a pre processing technique for extracting features as well as for dimensionality reduction. Authors in [ANNWN07] has used machine learning based models like logistic regression, SVM and random forest for classifying emails as either phishing or legitimate. In this work we make use of the importance of dimensionality reduction and TDM representation of data. For dimensionality reduction we use SVD and NMF. The representation is then followed by application of classical machine learning techniques on the processed data. 3

Proposed Architecture

The proposed architecture for an anti-phishing framework to detect phishing emails from legitimate ones is explained using a ow chart in Figure 1. The same model is used in both the cases where the data contains emails with and without header. Detailed explanation of all the levels are given below. 3.1

Dataset description

As part of the anti-phishing shared task at rst security and privacy analytics(IWSPA-AP 2018) two subtasks were held. Task 1 is classifying Email with headers and Task 2 is Email with no headers. The dataset details [EDMB+18], [EDB+18] is provided in Tables 1 and 2 above. 3.2

Dataset representation

Data representation is considered to be the most important part in any machine learning task and need to be chosen properly depending on the nature of the dataset. The corpus received for the shared task contains text and special symbols. So, the rst step is to produce meaningful representation of the data. In this work, for all the experiments TDM is used for the numerical representation of the data for both the subtasks given. After doing the representation the second step involves feature extraction and dimensionality reduction. This is carried out using Singular Value Decomposition (SVD) and Non-negative Matrix factorization (NMF) methods. For this, the TDM is passed to the feature extraction block. In the feature extraction block, the rank is taken as 30 for all the cases which means, the number of columns of the train and test data matrix will be taken as 30 after doing the dimensionality reduction. This numeric representation of the data is then passed to all the di erent machine learning algorithms for classi caiton. Figure 1 describes the steps involved in the proposed architecture. The proposed architecture consists of 5 blocks. Block 1 represents the raw dataset ie. the set of emails with and without headers. In block 2 the data is preprocessed by removing the special characters and unnecessary details from the raw data. Block 3 represents the process of data representation of the emails. The data representation is followed by dimensionality reduction block where SVD and NMF technniques are applied to the input from block 3. This is passed to block 5 where di erent classical machine learning algorithms are incorporated. Finally the emails are classi ed as either legitimate or phishing. The mathematical formulation of the task is as follows: Given a set of emails represented as D = [e1; e2; :::en] and its classes like C = [c1; c2; :::cn]. The class values are either 0 or 1. The machine learning models used in the work learn from the training data and label accordingly. After the learning process, the model is used to predict the classes for unseen test data. SVM. The metrics used for analyzing the performance of the model are as follows: TDM representation of data is done and the vocabulary is built using train and test data SVD or NMF is used for feature extraction and dimensionality reduction Step 2 is followed by applying di erent classical ML techniques like Decision Tree, Random Forest, AdaBoost, KNN and SVM 3.2.2

Data representation of samples with no headers:

SVD or NMF is applied for feature extraction and dimensionality reduction Step 3 is followed by applying di erent classical ML techniques like Decision Tree, Random Forest, AdaBoost, KNN and SVM on the numeric representation of the data 3.3

Methodology

The paper discusses classical machine learning approaches like Decision Tree, K- Nearest Neighbors, Logistic Regression, Naive Bayes, Random Forest and

1. Accuracy 2. Precision 3. Recall 4. F1-Score

For numeric representation of data TDM is used. The TDM matrix is passed to SVD and NMF for extracting best features.

SVD decomposes a matrix as the product of three di erent matrices. These matrices can be geometrically interpreted as rotation, stretching, rotation. The mathematical representation of SVD is : A = U V T where U represents the orthonormal eigenvectors of AAT . And V T represents the orthonormal eigenvectors of AT A. It is a diagonal matrix and represents the singular values. For extracting features the product of U is su cient. In all the cases the rank is chosen as 30. So the resultant train and test dataset size will be reduced to, total no of data points x 30.

The second technique used for feature extraction is NMF. It factorizes a matrix as the product of two matrices i.e, W and H. These matrices does not contain any negative elements. The TDM is passed as the input to NMF. NMF generates a list of topics. These topics acts as a basis for representing the original dataset. FP 443 470 475 447 475 475 474 437 495 477 466 496 496 461

Results

The datasets provided are highly imbalanced, and still gives considerably high classi cation accuracy. The following tables lists the performance of each classical machine learning techniques applied for the formulated binary classi cation problem to detect whether an email is phishing or legitimate. In the Tables 3, 4 and 5 the results obtained are for predicting the labels for the training data by using sklearn train-test split where 33% of the training data is used for validating the result and the rest for training the model. From the results obtained, Random Forest has outperformed all other techniques for the training data set. Test data results are provided in Table 6 and 7. Table 6 describe the results for classi cation using TDM with SVD for both subtasks. Table 7 represent the results for classi cation using TDM with NMF for both subtasks. The shared task organizers had given the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values for test dataset which are listed in Table 6 and 7 along with accuracy, precision, recall and F1-score, which are estimated taking TP, TN, FP and FN values and using it in the following equations: accuracy = precision =

(tp + tn) (tp + f p + tn + f n)

tp (tp + f p) recall =

tp (tp + f n) f 1 score =

(2 tp) (2 tp + f p + f n) (1) (2) (3) (4) 5

Conclusion

The paper focuses on phishing email detection which is a major threat in the present scenario. For both the subtasks numeric representation of data is done using the methodology, TDM with SVD and TDM with NMF. These representations are followed by applying classical machine learning techniques to the data inorder to classify an email as phishing or legitimate. One of the drawback with the current model is that the proposed mechanism relies on feature selection, which requires domain knowledge. To overcome this issue deep learning models can be incorporated, which can learn more complex patterns from the raw data and use it as features that produce more e cacy and this can be considered as a possible future work. In addition to that both the subtasks belongs to unconstrained category, allowing external datasets to be used for the training purpose. The datasets provided in the subtasks are highly imbalanced. With highly imbalanced datasets, we are able to achieve considerably high phishing email detection rate in both the subtasks. The tasks are unconstrained but we have not used datasets from any other external sources. Thus, the phishing email detection rate of the proposed architecture can be easily enhanced by adding additional data from external sources with the data provided in the shared task. This will be considered as one of the signi cant direction towards the future work.

Acknowledgements

This research was supported in part by Paramount Computer Systems. We are grateful to NVIDIA India, for the GPU hardware support to the research grant. We are grateful to Computational Engineering and Networking (CEN) department for encouraging the research. [BVP] [CM01] [CXMX05] [DWV99a] [DWV99b]

Barathi Ganesh Hullathy Balakrishnan, Anand Kumar Madasamy Vinayakumar, and Soman Kotti Padannayil. Nlp cen amrita@ smm4h: Health care text classi cation through class embeddings.

Xavier Carreras and Lluis Marquez.

Boosting trees for anti-spam email ltering. arXiv preprint cs/0109015, 2001.

Zhan Chuan, Lu Xianliang, Hou Mengshu, and Zhou Xu. A lvq-based neural network anti-spam email approach.

ACM SIGOPS Operating Systems Review, 39(1):34{39, 2005.

Harris Drucker, Donghui Wu, and Vladimir N Vapnik. Support vector machines for spam categorization.

IEEE Transactions on Neural networks, 10(5):1048{1054, 1999.

Harris Drucker, Donghui Wu, and Vladimir N Vapnik. Support vector machines for spam categorization.

IEEE Transactions on Neural networks, 10(5):1048{1054, 1999.

Ayman Elaassal, Avisha Das, Shahryar Baki, Luis De Moraes, and Rakesh Verma. Iwspa-ap: Anti-phising shared task at acm international workshop on security and privacy analytics. In Proceedings of the 1st IWSPA Anti

Phishing Shared Task. CEUR, 2018. [EDMB+18] Ayman Elaassal, Luis De Moraes, Shahryar Baki, Rakesh Verma, and Avisha Das. Iwspa-ap shared task email dataset, 2018. [Faw03] [Gee03] [KMAH04] [MW04] [NVK+15] [SDHH98] [SVKS15]

Tom Fawcett. In vivo spam ltering: a challenge problem for kdd.

ACM SIGKDD Explorations Newsletter, 5(2):140{148, 2003.

Kevin R Gee. Using latent semantic indexing to lter spam. In Proceedings of the 2003 ACM symposium on Applied computing, pages 460{464. ACM, 2003.

Svetlana Kiritchenko, Stan Matwin, and Suhayya Abu-Hakima. Email classi cation with temporal features. In Intelligent Information Processing and Web Mining, pages 523{533. Springer, 2004.

Tony A Meyer and Brendon Whateley. Spambayes: E ective open-source, bayesian based, email classi cation system. In CEAS. Citeseer, 2004.

Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald, and Edin Muharemagic. Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1):1, 2015.

[ANNWN07]

Saeed

Abu-Nimeh , Dario Nappa,

Xinlei

Wang , and

Suku

Nair . A comparison of machine learning techniques for phishing detection . In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit , pages 60 { 69 . ACM, 2007 .