Deep Learning Based Phishing E-mail Detection CEN-Deepspam Hiransha M, Nidhin A Unnithan, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, India hiransham5600@gmail.com, nidhinkittu5470@gmail.com emails, there is another type of email called phishing email. This phishing email is very dangerous to all Abstract internet users especially for multinational companies, finance etc. to everyone who uses even a single account Email communication, has now become an in- in any of the internet source for various purpose. evitable communication tool in our daily life. Phishing can be defined as an act to steal our Especially for finance sector, communication valuable information like user id, user password, through email plays an important role in their debit/credit card details for harmful reasons where businesses. So, it is very important to clas- they are concealed as a genuine organization. Phish- sify emails based on their behavior. Email ing rely on fooling users to share their valuable de- phishing one of most dangerous Internet phe- tails regarding usernames, user password, card details nomenon that cause various problems to busi- etc. phishing can be also defined as a type of cyber- ness class mainly to finance sector. This type attack that uses electronic communication channels of emails steals our valuable information with- like SMS, emails, phone calls to convey socially manip- out our permission, more over we won0 t be ulated messages to humans which in-turn make them aware of such an act even if it has been oc- to provide their credentials, credit card number, pass- curred. In this paper, we reveal about how word etc. for attacker’s benefit. Such types of activ- to distinguish phishing emails from legitimate ities persuade a normal website user to enter his/her mails. Dataset had two types of email texts details to a fraud website that acts like a hidden pas- one with header and other without header. sage between the user and the attacker. Most of the We used Keras Word Embedding and Convo- phishing attacks rely on email and website, that are lutional Neural Network to build our model. designed exactly like emails and websites from genuine organization to prompt users into detailing their finan- 1 Introduction cial or personal information. The hacker could use this sensitive information of users for his/her benefits. The internet has become an efficient powerful tool to the present world. Considering the uncontrolled Many researchers have been working under the growth of internet and abundant use of emails, has phishing problems and proposed a wide variety of so- increased insecurity in email communication. We are lutions to resist phishing attacks. There are two cat- very familiar with the name spamming whenever we egories regarding the solutions for phishing attacks. are on the topic email. Spamming is nothing but a In the first category of solutions works by detecting junk email which is for no use. But among these spam phishing emails or messages to warn the user about the attack before the hacker could steal user’s private Copyright c by the paper’s authors. Copying permitted for data. The second category of solutions works by secur- private and academic purposes. ing the login procedures by adding a secondary login In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish- process that will resist the hacker from stealing the ing Shared Pilot at 4th ACM International Workshop on Se- credentials. curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona, USA, 21-03-2018, published at http://ceur-ws.org Word embedding has been a hot topic for language identification. Recently, the application of Convolu- tures. Here they acquired an overall accuracy of 97.25 tional Neural Network with Keras embedding is used % with an error percentage of 2.75 %. Justin zhan et.al for e-mail phishing detection [EDB+ 18]. Following, in [ZT11] used a weak estimator method which works in this paper, we use Keras Word Embedding and by anomaly detection that detects the system which CNN for finding phishing emails from legitimate and exhibits a deviation in its behavior from the normal phishing ones. Here we aim at developing a classifier system. In [Zen17], they created a machine learning which can distinguish phishing emails from legitimate model for detecting phishing emails. Machine learning emails. Our model makes use of Keras Word Em- model was created using a predictive analysis to detect bedding and Convolutional Neural Network followed the dissimilarity between both phishing and legitimate by Pooling layer, Fully Connected layer, Non-linear emails using a static analysis. Samual marchal et.al activation function (Sigmoid) and one output neuron in [MFSE14] developed an automatic phishing detect- for classifying legitimate and phishing mails from the ing system that works on real time. This system uses given set of emails. URL that generated from the queries of search engines like yahoo, google etc. for feature extraction. This extracted feature is then used for classification using 2 Related Works machine learning. In [FM15], they used host based In [SAZ+ 15] Sami S et.al proposed a model for detect- and lexical features for classifying the URL. They cre- ing phishing emails that rely on a preprocessing tech- ated clusters for the entire dataset which in turn used nique which extracts different part of email as feature. as a feature for the classification system. This system And this extracted feature is fed into a j48 classifica- achieves an accuracy of 93-98 % in detecting phish- tion algorithm to perform classification. In [SZL+ 15], ing emails. Hicham tout in [TH09], done a different they considered meaningless tokens and new pages as approach in which online system should prove their the feature set. Authors in [SZL+ 15], selected some originality for the transfer of data between them. features that have better predictability from initial fea- ture set. They provide the O(1) complexity as an eval- 3 Background uation method to each feature set to evaluate its pre- dictive ability. In the paper [KK15], sukhjeel kaui et.al 3.1 Keras Word Embedding used Genetic algorithm for the detection of phishing Relative meanings and dense representations of words webpage and for categorizing pages they preferred a fil- can be provided using word embedding. The sparse ter function. Lu fang et.al in [FBJ+ 15] proposes some representation used by bag of word models are im- solution to overcome the time lag in detecting phishing proved using word embedding. In word embedding websites. Here they provide a solution to detect phish- projection of a word in a continuous vector space is ing websites by analyzing the peculiarity in its WHOIS represented by dense vectors. Keras provides an em- and URL information. In [VSP18b, VSP18a] deep bedding layer which can be used on text data. It re- learning methods were employed to detect malicious quires the input data to be integer encode thus pro- URL0 s and domains. Binay kumar et.al has used html viding a unique integer representation for each word. contents for detecting email phishing in [KKMK15]. Initially random weights are assigned to embedding But Rachna Dhamija et.al in [TC09] mainly concen- layer which are then modified by learning an embed- trated in this topic to know which phishing activ- ding for each word in training dataset. It is defined as ity works during the attack and why. For that they the first hidden layer of a network. We have to specify used a large given set of data which contains reported three arguments for this layer namely input dimension, phishing activities. Fergus toolan et.al made a differ- output dimension and input length. ent approach. They used only five features for clas- sification. For classification they used a C5.0 algo- 3.2 Convolutional Neural Network rithm which have more precision compare to other al- gorithms. Mayank pandey et.al in [PR12] used differ- Convolutional Neural Networks are several layers of ent types of classification methods such as Multilayer convolutions followed by nonlinear activation function Perceptron (MLP), Decision Trees (DT), Support Vec- like ReLU. Unlike in traditional neural network where tor Machine (SVM), Group Method of Data Handling we have fully connected layers, in CNN convolution (GMDH), Probabilistic Neural Net (PNN), Genetic over input is done to compute the output which results Programming (GP) and Logistic Regression (LR). Lew in a local connection. A large number of filters are ap- may form et.al in [FCT+ 15] proposed a method which plied in each layer whose outputs are combined to get uses hybrid features for detecting phishing emails. It the result. Values of filters are learned by CNN during is called Hybrid features because it is a combination training phase. For NLP tasks the input to CNN will of URL based, behavior based and contend based fea- be sentences or documents. A word or a character is Figure 1: Proposed Architecture represented as a row of the matrix which provides the 2, total of 5,700 mails were given in which 5,088 were vector corresponding to that word known as word em- legitimate while 612 were phishing. For test data set, bedding. The embedding dimension determines the total of 4,195 emails were given for Task 1 and 4,300 column space of the matrix. The main difference in were given for Task 2. CNN between image and NLP is in choosing the size of the filter. In images the filter is slide over a local 4.2 Proposed Architecture patch of the input where as in NLP it slides over an The Architecture composed of following layers, Keras entire row since the entire represents a word. In other Embedding, CNN, Classification. Keras embedding is words column space of filter matrix will be same as an inbuilt function in Keras framework which gener- column space of input matrix [ZZL15]. ates the vectors for words. A unique vector is formed for each unique words and is then passed to CNN to 4 Experiments give a dense vector. The CNN combines the vector All experiments were run on a GPU enabled Tensor- formed by embedding layer and gives a much more Flow [ABC+ 16] in conjunction with Keras [Cho15] dense vector which is the passed through pooling layer framework. Model was trained using backpropaga- to reduce the dimensionality and is then given to a tion methodology. The emails were tokenized and fully connected layer. A schematic diagram of the pro- converted to lower case. A dictionary was created posed architecture is shown in Figure 1. The model which contains a unique id for every word and un- configuration details for both the tasks are given in known words were assigned to default key 0. A unique Table 1. Total parameters for the model is 413105 vector is formed for each email and it coordinately out of which 413105 are trainable parameters and 0 works with CNN layer to give a dense vector. We cre- non-trainable parameters. ated a total of five models with Keras embedding and CNN layer. Three models for task 1 with CNN epochs 5 Results varying from 100, 500, 1000. Two models for task 2 The model build using the above architecture was used with CNN epochs varying from 100, 500. to classify the data set. For sub task 1 in which the emails didn’t had header files our model gave an ac- 4.1 Description of Data set curacy of 96.8%. For sub task 2 in which header files The data set consist of emails having both legitimate were given our model gave an accuracy of 94.2 %. The and phishing mails [EDMB+ 18]. Two sets of data sets accuracy obtained was measured on a 10 fold cross were given one with header files for Task 1, i.e., having validation. The results are summarized in Table 2. from, to addresses and one without header for Task Our model was tested using test data by IWSPA-AP 2, i.e., only the matter. For training data set, total Shared Task committee and the resulting True Posi- number of 4,583 mails were given for Task 1 in which tive (TP), True Negative (TN), False Positive (FP), 4,082 were legitimate and 501 were phishing. For Task False Negative (FN) has been summarized in Table 3. Table 1: Model Configuration Details Layer (type) Output Shape Param # input 1 (InputLayer) (None, 1000) 0 embedding 1 (Embedding) (None, 10000, 100) 4400 conv1d 1 (Conv1D) (None, 9996, 128) 64128 max pooling1d 1 (MaxPooling1D) (None, 1999, 128) 0 conv1d 2 (Conv1D) (None, 1995, 128) 82048 max pooling1d 2 (MaxPooling1D) (None, 399, 128) 0 conv1d 3 (Conv1D) (None, 395, 128) 82048 max pooling1d 3 (MaxPooling1D) (None, 11, 128) 0 flatten 1 (Flatten) (None, 1408) 0 dropout 1 (Dropout) (None, 1408) 0 dense 1 (Dense) (None, 128) 180352 dense 2 (Dense) (None, 1) 129 Table 2: Cross Validation Results Method Task Accuracy Word Embedding + CNN Sub task1 no header 0.968 Word Embedding + CNN Sub task2 with header 0.942 6.0.1 Acknowledgements Table 3: Statistics of Test Result Method Task TP TN FP FN This research was supported in part by Paramount Computer Systems. We are grateful to NVIDIA In- CNN 100 epochs No Header 3646 295 180 179 dia, for the GPU hardware support to the research CNN 500 epochs No Header 3666 288 187 159 grant. We are grateful to Computational Engineering CNN 1000 epochs No Header 3688 287 188 137 and Networking (CEN) department for encouraging the research. CNN 100 epochs With Header 3237 496 0 462 CNN 500 epochs With Header 3618 496 0 81 References [ABC+ 16] Martı́n Abadi, Paul Barham, Jianmin 6 Conclusion Chen, Zhifeng Chen, Andy Davis, Jef- frey Dean, Matthieu Devin, Sanjay Ghe- Email phishing is a growing threat to digital world. mawat, Geoffrey Irving, Michael Isard, To curb this problem has become a major goal for ev- et al. Tensorflow: A system for large- ery digital platform. Here we proposed a model using scale machine learning. In OSDI, vol- Keras Word Embedding and CNN to classify legiti- ume 16, pages 265–283, 2016. mate and phishing mails. Combining these two will give a dense vector representation for words which are [Cho15] François Chollet. Keras. https://gith then used to classify mails given in data set. Our ub.com/fchollet/keras, 2015. model performed well for both the tasks with header [EDB+ 18] Ayman Elaassal, Avisha Das, Shahryar and without header. A highly imbalanced data sets Baki, Luis De Moraes, and Rakesh were given for both sub tasks and the task it self was Verma. Iwspa-ap: Anti-phising shared unconstrained, i.e., any data sets can be used during task at acm international workshop on training. But without using any external data sets security and privacy analytics. In Pro- we were able to get good detection rate for phishing ceedings of the 1st IWSPA Anti-Phishing email in both sub tasks. Thus we can conclude that if Shared Task. CEUR, 2018. we add some additional data sources we can consider- able increase the detection rate of phishing emails for [EDMB+ 18] Ayman Elaassal, Luis De Moraes, the proposed methodology. Shahryar Baki, Rakesh Verma, and Avisha Das. Iwspa-ap shared task email (SKIMA), 2015 9th International Con- dataset, 2018. ference on, pages 1–8. IEEE, 2015. [FBJ+ 15] Lv Fang, Wang Bailing, Huang Junheng, [SZL+ 15] Hongzhou Sha, Zhou Zhou, Qingyun Liu, Sun Yushan, and Wei Yuliang. A proac- Tingwen Liu, and Chao Zheng. Limited tive discovery and filtering solution on dictionary builder: An approach to select phishing websites. In Big Data (Big representative tokens for malicious urls Data), 2015 IEEE International Confer- detection. In Communications (ICC), ence on, pages 2348–2355. IEEE, 2015. 2015 IEEE International Conference on, pages 7077–7082. IEEE, 2015. [FCT+ 15] Lew May Form, Kang Leng Chiew, Wei King Tiong, et al. Phishing email [TC09] Fergus Toolan and Joe Carthy. Phish- detection technique by using hybrid fea- ing detection using classifier ensembles. tures. In IT in Asia (CITA), 2015 9th In eCrime Researchers Summit, 2009. International Conference on, pages 1–5. eCRIME’09., pages 1–9. IEEE, 2009. IEEE, 2015. [TH09] Hicham Tout and William Hafner. Phish- [FM15] Mohammed Nazim Feroz and Susan pin: An identity-based anti-phishing Mengel. Phishing url detection us- approach. In Computational Science ing url ranking. In Big Data (Big- and Engineering, 2009. CSE’09. Inter- Data Congress), 2015 IEEE Interna- national Conference on, volume 3, pages tional Congress on, pages 635–638. 347–352. IEEE, 2009. IEEE, 2015. [VSP18a] R Vinayakumar, KP Soman, and Praba- [KK15] Sukhjeel Kaui and Amrit Kaur. Detec- haran Poornachandran. Detecting mali- tion of phishing webpages using weights cious domain names using deep learning computed through genetic algorithm. In approaches at scale. Journal of Intelli- MOOCs, Innovation and Technology in gent & Fuzzy Systems, 34(3):1355–1367, Education (MITE), 2015 IEEE 3rd In- 2018. ternational Conference on, pages 331– 336. IEEE, 2015. [VSP18b] R Vinayakumar, KP Soman, and Praba- haran Poornachandran. Evaluating deep [KKMK15] Binay Kumar, Pankaj Kumar, Ankit learning approaches to characterize and Mundra, and Shikha Kabra. Dc scan- classify malicious urls. Journal of Intel- ner: Detecting phishing attack. In ligent & Fuzzy Systems, 34(3):1333–1343, Image Information Processing (ICIIP), 2018. 2015 Third International Conference on, pages 271–276. IEEE, 2015. [Zen17] Yuanyuan Grace Zeng. Identifying email threats using predictive analysis. In Cy- [MFSE14] Samuel Marchal, Jérôme François, Radu ber Security And Protection Of Digital State, and Thomas Engel. Phishstorm: Services (Cyber Security), 2017 Interna- Detecting phishing with streaming ana- tional Conference on, pages 1–2. IEEE, lytics. IEEE Transactions on Network 2017. and Service Management, 11(4):458–471, 2014. [ZT11] Justin Zhan and Lijo Thomas. Phish- ing detection using stochastic learning- [PR12] Mayank Pandey and Vadlamani Ravi. based weak estimators. In Computational Detecting phishing e-mails using text and Intelligence in Cyber Security (CICS), data mining. In Computational Intelli- 2011 IEEE Symposium on, pages 55–59. gence & Computing Research (ICCIC), IEEE, 2011. 2012 IEEE International Conference on, pages 1–6. IEEE, 2012. [ZZL15] Xiang Zhang, Junbo Zhao, and Yann Le- Cun. Character-level convolutional net- [SAZ+ 15] Sami Smadi, Nauman Aslam, Li Zhang, works for text classification. In Advances Rafe Alasem, and MA Hossain. Detec- in neural information processing systems, tion of phishing emails using data mining pages 649–657, 2015. algorithms. In Software, Knowledge, In- formation Management and Applications