=Paper=
{{Paper
|id=Vol-2124/paper_17
|storemode=property
|title=Detecting
Phishing E-mail Using Machine Learning Techniques
CEN-SecureNLP
|pdfUrl=https://ceur-ws.org/Vol-2124/paper_17.pdf
|volume=Vol-2124
|authors=Nidhin A Unnithan,Harikrishnan NB,Vinayakumar R,Soman KP
}}
==Detecting
Phishing E-mail Using Machine Learning Techniques
CEN-SecureNLP==
Detecting Phishing E-mail using Machine learning techniques CEN-SecureNLP Nidhin A Unnithan, Harikrishnan NB, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, India nidhinkittu5470@gmail.com Sai Sundarakrishna Caterpillar, Bangalore, India sai.sundarakrishna@gmail.com and is an inexpensive method. Even though there are many modes of message transfer, email popularity Abstract didn’t reduced mostly in business, colleges and other private and government sectors as email is considered The number of unsolicited aka phishing emails the safety transfer of message. Email communication are increasing tremendously day by day. This plays an important part in everybody’s life. Nowadays suggests the need to design a reliable frame- email usage gets a tremendous increase compared to work to filter out phishing emails. In the pro- olden days. There is a tremendous increase in users posed work, we develop a supervised classifier compared to 2016 in 2017. Nearly 4.8 billion persons for distinguishing phishing email from legiti- are using email in 2017 and calculations shows that mate ones. The term frequency-inverse doc- the number will rise to 5.6 billion users by 2021 over ument frequency (tf-idf) matrix and Doc2Vec other apps [RH11]. But main problem with email has are formed for legitimate and phishing emails. been phishing mails which causes malwares and are This is passed to various traditional machine used in fraud schemes, advertisements etc. Consid- learning classifiers for classification. The ma- ering previous years email phishing has increased re- chine learning classifiers with Doc2Vec repre- cently and many security threats evolves and cause se- sentation have performed well in comparison rious damages to business, individuals and economics. to the tf-idf representation. Thus we conclude Especially for business emails extracting and analyzing Doc2Vec representation is more appropriate these communication networks can reveal interesting for detecting and classifying phishing and le- and complex patterns of processes and decision mak- gitimate emails. ing within a company. Detecting these fraud/phishing Emails precisely in communication networks is essen- 1 Introduction tial. Electronic-mail (Email) is one of the most effective Phishing mails are type of spam mail which are haz- and easy source for transferring messages. It is con- ardous to users. A phishing mail can steal our data sidered as the safest message transfer over networks without our knowledge once its opened. Thus identi- fying phishing mails from spam mails is very impor- Copyright c by the paper’s authors. Copying permitted for tant. One way to protect our data from phishing mail private and academic purposes. is to add a secondary password to log in credentials. In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish- Another way is to alarm the user once a Phishing mail ing Shared Pilot at 4th ACM International Workshop on Se- tries to steal our data. curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona, USA, 21-03-2018, published at http://ceur-ws.org During the infant stages of email communication,] clear rules was followed [SHP08], but recently due and domains as stated in [VSP18b, VSP18a]. Domain to the diversity of email programs and formatting Generation Algorithms which can be used by malicious standards we have the freedom to edit and change families were also classified using deep learning meth- quoted text. Despite with these limitations, Symantec ods as said in [VSPSK18]. Brightmail Sanz [SHP08] has been showing good per- In this task we propose a machine learning based formance even now for detection of phishing emails. approach to extract the underlying structure in email Moreover, it has the capability to keep track of IP (in- text to overcome problems of error-prone rule-based ternet protocol) addresses of that sent phishing mail. approaches. This will enable the downstream tasks The performance was comparable to [MW04]. Email to work with much cleaner data and additional infor- services like Microsoft Outlook, Mozilla Thunderbird, mation by focusing on particular parts. Also further or even online email communication such as Gmail, we show the performance improvements and flexibility usually group emails into conversations and attempt over the previous work on similar tasks. to hide quoted parts in order to improve the readabil- ity. Table 1: Training Dataset details In 2011 2.3 billion users were using emails which Category Legitimate Phishing Total have increased to about 4.3 billion by 2016 [RH11]. With header 4082 501 4583 TREC has defined phishing as an unwanted email sent discriminately [C+ 08]. Thus emails have been used for With no header 5088 612 5700 marketing and advertising purposes [CL98]. Datasets such as the Enron [KY04] or Avocado cor- Table 2: Testing Dataset details pus [OWKG15] provide real world information about business communication and contains a mix of profes- Category Email Samples sional emails, personal emails, and phishing. [PS05] With header 4195 published parts of his personal email archive for re- With no header 4300 search. A recent survey shows the diversity of email classification tasks alone [MSR+ 17]. Similarly another interesting analysis of communication networks based 2 Background on metadata like sender, recipients, and time extracted from emails are discussed in [BCGJ11]. Models based This section discusses the mathematical details of vari- on the written contents of emails may get confused by ous traditional machine learning algorithms and vector automatically inserted text blocks or quoted messages. space modeling techniques such as tf-idf and Doc2Vec. Thus working with real world data requires normal- 2.1 Term frequency-inverse document fre- ization of data prior to solving the problem at hand. quency (tf-idf ) Rauscher et al. [RMA15] developed an approach to detect zones inside work-related emails where relevant Term frequency-inverse document frequency (tf-idf) business knowledge may be found. By ending over- can be used in information retrieval. It will reflect lapping text passages across the corpus, Jamison et how much a word is important in a document or cor- al. managed to resolve email threads of the Enron pus. Tf-idf is also used for text mining and user mod- corpus almost perfectly [JG13]. It has to be noted eling as a weighting factor. It will give less important that the claimed accuracy of almost 100% was only to the words which are frequently repeated in a par- tested on 20 email threats. In order to reassemble ticular document. It is also used to remove stop words email threats, Yeh et al. considered a similar approach from a corpus. Nowadays the importance of tf-idf in with a more elaborate evaluation reaching an accu- search engine is very huge. Tf-idf can be calculated by racy of 98% separating email conversations into parts the following equations [YWD05]. To do so they rely on additional meta in- formation in emails sent through Microsoft Outlook ft,d tf (t, d) = P (1) (thread index) and rules that match specific client ft0 ,d 0 t ∈d headers. Thus, such an approach will not work on ar- bitrary emails nor can it handle different localization N idf (t, D) = log (2) or edits by the user. Even though there are differ- |{d ∈ D : t ∈ d}| ent ways to detect phishing [DAY+ 15] gives an over- where N is the total number of documents in the cor- all evaluation of different classifiers used for phishing pus. detection. Recently deep learning methods has also been used extensively for detecting phishing mails as tf idf (t, d, D) = tf (t, d) • idf (t, D) (3) stated in [BMS08] and for detecting malicious URLs Table 3: 10 fold cross validation accuracy of train data without header Task Representation Algorithm Accuracy No Header Doc2Vec Decision Tree 81.2 No Header Doc2Vec Naive Bayes 79.5 No Header Doc2Vec Adaboost 83.4 No Header Doc2Vec Logistic Regresson 80.1 No Header Doc2Vec K-nearest neighbour 76.8 No Header Doc2Vec Support vector machine 88.4 No Header Doc2Vec Random Forest 87.4 No Header tf-idf Decision Tree 74.2 No Header tf-idf Naive Bayes 71.4 No Header tf-idf Adaboost 75.6 No Header tf-idf Logistic Regresson 70.2 No Header tf-idf K-nearest neighbour 63.2 No Header tf-idf Support vector machine 79.4 No Header tf-idf Random Forest 78.1 2.2 Doc2Vec is often passed as an object or scenario which imitates some set of properties and output is usually a decision Doc2Vec is an unsupervised learning algorithm which saying either YES or NO. gives a fixed length vector representation of a variable Trees are built using leaves. On every node of the length text. The text can be a sentence, paragraph tree a test is conducted which looks for the least pos- or a document. It is an extension of Word2Vec in sible outcome. The leaves subsist of numerical or cat- which given a vector representation of context words egorical values, of the respective item, which is the as the input it predicts the word which is most likely to outcome after each test. accompany the context words. Word2Vec is inspired because it can be used to predict the next word in a sentence given the context word vectors, thus captur- 2.3.2 Naive Bayes ing the semantics of the sentence even though the word This uses Bayes theorem. It is the most singular vectors are randomly initialized. Instead of word vec- feature with independence i.e. coordinates present tor we use document vector to predict next word given for any feature dependability in a class doesn’t af- context from a document in Doc2Vec. In document fect other features. Naive Bayes Classifier model is vector every document is represented by a column prone to outperform when the feature dimension is of unique vector called document matrix and words high and is easy to build. Though it outperforms are represented by unique vectors called word matrix. most of the time when the condition of independence Next word in a context is predicted by the concatena- is matched, its independence does not overcomes the tion or averaging of document and word vectors. problems related to dimensionality. It utilizes con- In Doc2Vec the document vector is same for all con- ditional probability model i.e. when a problem is text generated from same document but differs across posed which needs to be classified and imitates a vec- documents. However word vector matrix is same for tor X = (x1 , x2 , ...xn ) which epitomizes features yield- different document, i.e., the vector representation of ing probabilities P (Ck /(x1 , x2, ...xn)) for k outcomes. same word across different document have the same Mathematically it can be expressed as vector representation. P (Ck P (x|Ck 2.3 Machine Learning P (Ck /x) = (4) P (x) 2.3.1 Decision Tree 2.3.3 AdaBoost In modern era, the most sensible discrete method plus a supervised algorithm personifying output in graph- It is a continuous learning algorithm whose main pur- ical format is decision trees. It’s an algorithm where pose lies in stepping up the achievement of the learn- each element in the given domain is put to an element ing algorithm. It is solemnly used for classification. It of its range which could be either discrete or continu- performs this task by forming a strong classifier which ous. It’s better for class type variables. In this proce- is a sequence of innumerable weak classifiers. When dure, each split is chosen in such a way that it reduces Ada boost is combined with Decision tress it is best- the target variable’s variance. The Decision tree input out-of the box classifier. Irrespective of its swiftness Table 4: 10 fold cross validation accuracy of train data with header Task Representation Algorithm Accuracy With Header Doc2Vec Decision Tree 73.1 With Header Doc2Vec Naive Bayes 70.1 With Header Doc2Vec Adaboost 77.4 With Header Doc2Vec Logistic Regresson 72.2 With Header Doc2Vec K-nearest neighbour 69.1 With Header Doc2Vec Support vector machine 75.4 With Header Doc2Vec Random Forest 73.4 With Header tf-idf Decision Tree 68.2 With Header tf-idf Naive Bayes 64.2 With Header tf-idf Adaboost 69.4 With Header tf-idf Logistic Regresson 66.7 With Header tf-idf K-nearest neighbour 62.2 With Header tf-idf Support vector machine 72.4 With Header tf-idf Random Forest 71.2 2.3.6 Support Vector Machine (SVM) Table 5: Test Data result for SVM combined with Doc2Vec Support Vector Machine (SVM) is a linear classifier Task TP TN FP FN algorithm based on supervised learning. It helps to No Header 3825 0 475 0 create a boundary between the variables to classify them. It creates a hyper plane boundary with maxi- With Header 3593 7 489 106 mum margin to separate the variables. This algorithm is robust to outliers. The co-ordinates of individual in classifying it has been used as a feature learner as observations are called as support vectors. SVM cre- well. ates a hyperplane separating support vectors with the maximum possible margin. 2.3.4 Logistic Regression Support Vector Machines is one of the popularly used method in supervised machine learning tech- It is used when target variable is categorized. It hinges niques. Problems like linear regression and classifi- on MLE (Maximum Likelihood Estimation) and is a cation tasks could be solved easily with it. Here the qualitative choice model. It is used to predict whether training set is separated by a hyperplane where the the risk factor increases the odds of a given outcome points nearer to the hyperplane are support vectors by a specific factor. Logistic Regression can be used which aid them in finding the position of hyper-plane. to model binary classification problems. The mathe- In case training data set couldn’t be linearly separated, matical representation is given as it is mapped to a high-dimensional space where it is assumed to be linearly separable. 1 F (x) = (5) 1 + exp(−wT x) 2.3.7 Random Forest where F can take values in the range 0 to 1. Random Forest is a supervised learning algorithm used in both classification and regression problems. In the 2.3.5 k-nearest neighbour (KNN) random forest classifier, to get high accuracy results we need to create large number of decision trees. The It is the simplest algorithm of machine learning. It is prediction obtained from a Random Forest is prone known as lazy learning because it furnishes only ap- to be far better than the predictions obtained by an proximate values. It is flubbed by local structure of individual decision tree. Random Forest utilizes the the data. This procedure validates the local posterior concept of bagging for creating several minimal corre- probability of each class existing by the average of class lated decision trees. Advantages of Random forest is membership over its K-nearest neighbors. its ability to handle missing values and to avoid over- Figure 1: Proposed architecture for email phishing detection fitting of the model. 4 Results Our model was trained for seven different machine 3 Experiments learning techniques for two different representations, i.e., with and without header data sets. All the re- 3.1 Description of data set sults have been consolidated in Table 3 and Table 4. Out of all the different models the one in which SVM The anti-phishing shared task is a part of First Se- combined with Doc2Vec gave the highest accuracy for curity and Privacy Analytics Anti-Phishing Shared both the data sets, thus only that model was given for Task (IWSPA-AP 2018) at 4th ACM Interna- submission even though we trained for seven different tional Workshop on Security and Privacy Analytics techniques. The submitted models were tested using [EDMB+ 18][EDB+ 18]. Let E = [e1 , e2 , ....en ] be a set test data and the result for True Positive, True Neg- of emails and C = [c1 , c2 , c3 , ...cn ] be a set of email ative, False Positive, False Negative are consolidated types such as legitimate or phishing. The task is to into Table 5. classify each given email sample into either legitimate or phishing. The detailed summary of training and testing data set is summarized in Table 1 and Table 2. 5 Conclusion 3.2 Proposed Architecture The main objective of this work is to develop a su- In our proposed architecture we used count based and pervised classifier which can detect phishing and legit- distributed representation for word representation. In imate emails. We used count based and distributed count based method we used tf-idf for word repre- representations for our word representation and used sentation and for distributed representation we used different machine learning techniques such as Naive Doc2Vec using gensim library. Once the word rep- Bayse, Logistic Regression, Decision Tree, K Near- resentations were created we used different machine est Neighbour, Random Forest, Adaboost and Sup- learning techniques to classify the data as legitimate port Vector Machine for classification of legitimate and or phishing. The machine learning techniques used phishing emails. The proposed methodology rely on are Naive Bayse, Logistic Regression, Decision Tree, feature engineering and in future we can apply deep K Nearest Neighbour, Random Forest, Adaboost and learning on the phishing detection and can be consid- Support Vector Machine. ered as one in the future direction. Acknowledgements the International Conference Recent Ad- vances in Natural Language Processing This research was supported in part by Paramount RANLP 2013, pages 327–335, 2013. Computer Systems. We are grateful to NVIDIA In- dia, for the GPU hardware support to the research [KY04] Bryan Klimt and Yiming Yang. The en- grant. We are grateful to Computational Engineering ron corpus: A new dataset for email clas- and Networking (CEN) department for encouraging sification research. In European Confer- the research. ence on Machine Learning, pages 217– 226. Springer, 2004. References [BCGJ11] Francesco Bonchi, Carlos Castillo, Aris- [MSR+ 17] Ghulam Mujtaba, Liyana Shuib, tides Gionis, and Alejandro Jaimes. So- Ram Gopal Raj, Nahdia Majeed, and cial network analysis and mining for busi- Mohammed Ali Al-Garadi. Email clas- ness applications. ACM Transactions sification research trends: Review and on Intelligent Systems and Technology open issues. IEEE Access, 5:9044–9064, (TIST), 2(3):22, 2011. 2017. [BMS08] Ram Basnet, Srinivas Mukkamala, and [MW04] Tony A Meyer and Brendon Whate- Andrew H Sung. Detection of phishing ley. Spambayes: Effective open-source, attacks: A machine learning approach. bayesian based, email classification sys- In Soft Computing Applications in Indus- tem. In CEAS. Citeseer, 2004. try, pages 373–383. Springer, 2008. [OWKG15] Douglas Oard, William Webber, David [C+ 08] Gordon V Cormack et al. Email spam Kirsch, and Sergey Golitsynskiy. Avo- filtering: A systematic review. Founda- cado research email collection. Philadel- tions and Trends R in Information Re- phia: Linguistic Data Consortium, 2015. trieval, 1(4):335–455, 2008. [PS05] Adam Perer and Ben Shneiderman. Be- [CL98] Lorrie Faith Cranor and Brian A LaMac- yond threads: Identifying discussions chia. Spam! Communications of the in email archives. Technical report, ACM, 41(8):74–83, 1998. MARYLAND UNIV COLLEGE PARK HUMAN COMPUTER INTERACTION [DAY+ 15] Ammar Yahya Daeef, R Badlishah Ah- LAB, 2005. mad, Yasmin Yacob, Naimah Yaakob, and Mohd Nazri Bin Mohd Warip. Phish- [RH11] Sara Radicati and Quoc Hoang. Email ing email classifiers evaluation: Email statistics report, 2011-2015. Retrieved body and header approach. Journal May, 25:2011, 2011. of Theoretical and Applied Information Technology, 80(2):354, 2015. [RMA15] François Rauscher, Nada Matta, and + [EDB 18] Ayman Elaassal, Avisha Das, Shahryar Hassan Atifi. Context aware knowledge Baki, Luis De Moraes, and Rakesh zoning: Traceability and business emails. Verma. Iwspa-ap: Anti-phising shared In IFIP International Workshop on Ar- task at acm international workshop on tificial Intelligence for Knowledge Man- security and privacy analytics. In Pro- agement, pages 66–79. Springer, 2015. ceedings of the 1st IWSPA Anti-Phishing [SHP08] Enrique Puertas Sanz, José Shared Task. CEUR, 2018. Marı́a Gómez Hidalgo, and José Car- [EDMB+ 18] Ayman Elaassal, Luis De Moraes, los Cortizo Pérez. Email spam filtering. Shahryar Baki, Rakesh Verma, and Advances in computers, 74:45–114, 2008. Avisha Das. Iwspa-ap shared task email dataset, 2018. [VSP18a] R Vinayakumar, KP Soman, and Praba- haran Poornachandran. Detecting mali- [JG13] Emily Jamison and Iryna Gurevych. cious domain names using deep learning Headerless, quoteless, but not hopeless? approaches at scale. Journal of Intelli- using pairwise email classification to dis- gent & Fuzzy Systems, 34(3):1355–1367, entangle email threads. In Proceedings of 2018. [VSP18b] R Vinayakumar, KP Soman, and Praba- haran Poornachandran. Evaluating deep learning approaches to characterize and classify malicious urls. Journal of Intel- ligent & Fuzzy Systems, 34(3):1333–1343, 2018. [VSPSK18] R Vinayakumar, KP Soman, Prabaharan Poornachandran, and S Sachin Kumar. Evaluating deep learning approaches to characterize and classify the dgas at scale. Journal of Intelligent & Fuzzy Sys- tems, 34(3):1265–1276, 2018. [YWD05] Chi-Yuan Yeh, Chili-Hung Wu, and Shine-Hwang Doong. Effective spam classification based on meta-heuristics. In Systems, Man and Cybernetics, 2005 IEEE International Conference on, vol- ume 4, pages 3872–3877. IEEE, 2005.