Deep Learning Based Phishing E-mail Detection
                                                   CEN-Deepspam


                     Hiransha M, Nidhin A Unnithan, Vinayakumar R, Soman KP
                     Center for Computational Engineering and Networking(CEN),
                               Amrita School of Engineering, Coimbatore
                                 Amrita Vishwa Vidyapeetham, India
                        hiransham5600@gmail.com, nidhinkittu5470@gmail.com


                                                                 emails, there is another type of email called phishing
                                                                 email. This phishing email is very dangerous to all
                        Abstract                                 internet users especially for multinational companies,
                                                                 finance etc. to everyone who uses even a single account
    Email communication, has now become an in-                   in any of the internet source for various purpose.
    evitable communication tool in our daily life.                   Phishing can be defined as an act to steal our
    Especially for finance sector, communication                 valuable information like user id, user password,
    through email plays an important role in their               debit/credit card details for harmful reasons where
    businesses. So, it is very important to clas-                they are concealed as a genuine organization. Phish-
    sify emails based on their behavior. Email                   ing rely on fooling users to share their valuable de-
    phishing one of most dangerous Internet phe-                 tails regarding usernames, user password, card details
    nomenon that cause various problems to busi-                 etc. phishing can be also defined as a type of cyber-
    ness class mainly to finance sector. This type               attack that uses electronic communication channels
    of emails steals our valuable information with-              like SMS, emails, phone calls to convey socially manip-
    out our permission, more over we won0 t be                   ulated messages to humans which in-turn make them
    aware of such an act even if it has been oc-                 to provide their credentials, credit card number, pass-
    curred. In this paper, we reveal about how                   word etc. for attacker’s benefit. Such types of activ-
    to distinguish phishing emails from legitimate               ities persuade a normal website user to enter his/her
    mails. Dataset had two types of email texts                  details to a fraud website that acts like a hidden pas-
    one with header and other without header.                    sage between the user and the attacker. Most of the
    We used Keras Word Embedding and Convo-                      phishing attacks rely on email and website, that are
    lutional Neural Network to build our model.                  designed exactly like emails and websites from genuine
                                                                 organization to prompt users into detailing their finan-
1    Introduction                                                cial or personal information. The hacker could use this
                                                                 sensitive information of users for his/her benefits.
The internet has become an efficient powerful tool
to the present world. Considering the uncontrolled                  Many researchers have been working under the
growth of internet and abundant use of emails, has               phishing problems and proposed a wide variety of so-
increased insecurity in email communication. We are              lutions to resist phishing attacks. There are two cat-
very familiar with the name spamming whenever we                 egories regarding the solutions for phishing attacks.
are on the topic email. Spamming is nothing but a                In the first category of solutions works by detecting
junk email which is for no use. But among these spam             phishing emails or messages to warn the user about
                                                                 the attack before the hacker could steal user’s private
Copyright c by the paper’s authors. Copying permitted for        data. The second category of solutions works by secur-
private and academic purposes.                                   ing the login procedures by adding a secondary login
In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish-   process that will resist the hacker from stealing the
ing Shared Pilot at 4th ACM International Workshop on Se-        credentials.
curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona,
USA, 21-03-2018, published at http://ceur-ws.org                   Word embedding has been a hot topic for language
identification. Recently, the application of Convolu-        tures. Here they acquired an overall accuracy of 97.25
tional Neural Network with Keras embedding is used           % with an error percentage of 2.75 %. Justin zhan et.al
for e-mail phishing detection [EDB+ 18]. Following,          in [ZT11] used a weak estimator method which works
in this paper, we use Keras Word Embedding and               by anomaly detection that detects the system which
CNN for finding phishing emails from legitimate and          exhibits a deviation in its behavior from the normal
phishing ones. Here we aim at developing a classifier        system. In [Zen17], they created a machine learning
which can distinguish phishing emails from legitimate        model for detecting phishing emails. Machine learning
emails. Our model makes use of Keras Word Em-                model was created using a predictive analysis to detect
bedding and Convolutional Neural Network followed            the dissimilarity between both phishing and legitimate
by Pooling layer, Fully Connected layer, Non-linear          emails using a static analysis. Samual marchal et.al
activation function (Sigmoid) and one output neuron          in [MFSE14] developed an automatic phishing detect-
for classifying legitimate and phishing mails from the       ing system that works on real time. This system uses
given set of emails.                                         URL that generated from the queries of search engines
                                                             like yahoo, google etc. for feature extraction. This
                                                             extracted feature is then used for classification using
2    Related Works                                           machine learning. In [FM15], they used host based
In [SAZ+ 15] Sami S et.al proposed a model for detect-       and lexical features for classifying the URL. They cre-
ing phishing emails that rely on a preprocessing tech-       ated clusters for the entire dataset which in turn used
nique which extracts different part of email as feature.     as a feature for the classification system. This system
And this extracted feature is fed into a j48 classifica-     achieves an accuracy of 93-98 % in detecting phish-
tion algorithm to perform classification. In [SZL+ 15],      ing emails. Hicham tout in [TH09], done a different
they considered meaningless tokens and new pages as          approach in which online system should prove their
the feature set. Authors in [SZL+ 15], selected some         originality for the transfer of data between them.
features that have better predictability from initial fea-
ture set. They provide the O(1) complexity as an eval-       3     Background
uation method to each feature set to evaluate its pre-
dictive ability. In the paper [KK15], sukhjeel kaui et.al    3.1   Keras Word Embedding
used Genetic algorithm for the detection of phishing         Relative meanings and dense representations of words
webpage and for categorizing pages they preferred a fil-     can be provided using word embedding. The sparse
ter function. Lu fang et.al in [FBJ+ 15] proposes some       representation used by bag of word models are im-
solution to overcome the time lag in detecting phishing      proved using word embedding. In word embedding
websites. Here they provide a solution to detect phish-      projection of a word in a continuous vector space is
ing websites by analyzing the peculiarity in its WHOIS       represented by dense vectors. Keras provides an em-
and URL information. In [VSP18b, VSP18a] deep                bedding layer which can be used on text data. It re-
learning methods were employed to detect malicious           quires the input data to be integer encode thus pro-
URL0 s and domains. Binay kumar et.al has used html          viding a unique integer representation for each word.
contents for detecting email phishing in [KKMK15].           Initially random weights are assigned to embedding
But Rachna Dhamija et.al in [TC09] mainly concen-            layer which are then modified by learning an embed-
trated in this topic to know which phishing activ-           ding for each word in training dataset. It is defined as
ity works during the attack and why. For that they           the first hidden layer of a network. We have to specify
used a large given set of data which contains reported       three arguments for this layer namely input dimension,
phishing activities. Fergus toolan et.al made a differ-      output dimension and input length.
ent approach. They used only five features for clas-
sification. For classification they used a C5.0 algo-
                                                             3.2   Convolutional Neural Network
rithm which have more precision compare to other al-
gorithms. Mayank pandey et.al in [PR12] used differ-         Convolutional Neural Networks are several layers of
ent types of classification methods such as Multilayer       convolutions followed by nonlinear activation function
Perceptron (MLP), Decision Trees (DT), Support Vec-          like ReLU. Unlike in traditional neural network where
tor Machine (SVM), Group Method of Data Handling             we have fully connected layers, in CNN convolution
(GMDH), Probabilistic Neural Net (PNN), Genetic              over input is done to compute the output which results
Programming (GP) and Logistic Regression (LR). Lew           in a local connection. A large number of filters are ap-
may form et.al in [FCT+ 15] proposed a method which          plied in each layer whose outputs are combined to get
uses hybrid features for detecting phishing emails. It       the result. Values of filters are learned by CNN during
is called Hybrid features because it is a combination        training phase. For NLP tasks the input to CNN will
of URL based, behavior based and contend based fea-          be sentences or documents. A word or a character is
                                           Figure 1: Proposed Architecture
represented as a row of the matrix which provides the       2, total of 5,700 mails were given in which 5,088 were
vector corresponding to that word known as word em-         legitimate while 612 were phishing. For test data set,
bedding. The embedding dimension determines the             total of 4,195 emails were given for Task 1 and 4,300
column space of the matrix. The main difference in          were given for Task 2.
CNN between image and NLP is in choosing the size
of the filter. In images the filter is slide over a local   4.2 Proposed Architecture
patch of the input where as in NLP it slides over an
                                                            The Architecture composed of following layers, Keras
entire row since the entire represents a word. In other
                                                            Embedding, CNN, Classification. Keras embedding is
words column space of filter matrix will be same as
                                                            an inbuilt function in Keras framework which gener-
column space of input matrix [ZZL15].
                                                            ates the vectors for words. A unique vector is formed
                                                            for each unique words and is then passed to CNN to
4 Experiments                                               give a dense vector. The CNN combines the vector
All experiments were run on a GPU enabled Tensor-           formed by embedding layer and gives a much more
Flow [ABC+ 16] in conjunction with Keras [Cho15]            dense vector which is the passed through pooling layer
framework. Model was trained using backpropaga-             to reduce the dimensionality and is then given to a
tion methodology. The emails were tokenized and             fully connected layer. A schematic diagram of the pro-
converted to lower case. A dictionary was created           posed architecture is shown in Figure 1. The model
which contains a unique id for every word and un-           configuration details for both the tasks are given in
known words were assigned to default key 0. A unique        Table 1. Total parameters for the model is 413105
vector is formed for each email and it coordinately         out of which 413105 are trainable parameters and 0
works with CNN layer to give a dense vector. We cre-        non-trainable parameters.
ated a total of five models with Keras embedding and
CNN layer. Three models for task 1 with CNN epochs          5 Results
varying from 100, 500, 1000. Two models for task 2
                                                            The model build using the above architecture was used
with CNN epochs varying from 100, 500.
                                                            to classify the data set. For sub task 1 in which the
                                                            emails didn’t had header files our model gave an ac-
4.1 Description of Data set
                                                            curacy of 96.8%. For sub task 2 in which header files
The data set consist of emails having both legitimate       were given our model gave an accuracy of 94.2 %. The
and phishing mails [EDMB+ 18]. Two sets of data sets        accuracy obtained was measured on a 10 fold cross
were given one with header files for Task 1, i.e., having   validation. The results are summarized in Table 2.
from, to addresses and one without header for Task          Our model was tested using test data by IWSPA-AP
2, i.e., only the matter. For training data set, total      Shared Task committee and the resulting True Posi-
number of 4,583 mails were given for Task 1 in which        tive (TP), True Negative (TN), False Positive (FP),
4,082 were legitimate and 501 were phishing. For Task       False Negative (FN) has been summarized in Table 3.
                                    Table 1: Model Configuration Details
                               Layer (type)              Output Shape              Param #
                            input 1 (InputLayer)                (None, 1000)            0
                         embedding 1 (Embedding)             (None, 10000, 100)       4400
                             conv1d 1 (Conv1D)                (None, 9996, 128)       64128
                     max pooling1d 1 (MaxPooling1D)           (None, 1999, 128)         0
                             conv1d 2 (Conv1D)                (None, 1995, 128)       82048
                     max pooling1d 2 (MaxPooling1D)           (None, 399, 128)          0
                             conv1d 3 (Conv1D)                (None, 395, 128)        82048
                     max pooling1d 3 (MaxPooling1D)            (None, 11, 128)          0
                             flatten 1 (Flatten)                (None, 1408)            0
                            dropout 1 (Dropout)                 (None, 1408)            0
                               dense 1 (Dense)                  (None, 128)          180352
                               dense 2 (Dense)                   (None, 1)             129


                                     Table 2: Cross Validation Results
                                Method                    Task                    Accuracy
                        Word Embedding + CNN            Sub task1 no header         0.968
                        Word Embedding + CNN            Sub task2 with header       0.942
                                                             6.0.1     Acknowledgements
          Table 3: Statistics of Test Result
       Method           Task       TP TN FP FN               This research was supported in part by Paramount
                                                             Computer Systems. We are grateful to NVIDIA In-
    CNN 100 epochs   No Header 3646 295 180 179
                                                             dia, for the GPU hardware support to the research
    CNN 500 epochs   No Header 3666 288 187 159              grant. We are grateful to Computational Engineering
 CNN 1000 epochs No Header 3688 287 188 137                  and Networking (CEN) department for encouraging
                                                             the research.
    CNN 100 epochs With Header 3237 496       0 462
    CNN 500 epochs With Header 3618 496       0    81        References
                                                             [ABC+ 16]    Martı́n Abadi, Paul Barham, Jianmin
6     Conclusion                                                          Chen, Zhifeng Chen, Andy Davis, Jef-
                                                                          frey Dean, Matthieu Devin, Sanjay Ghe-
Email phishing is a growing threat to digital world.                      mawat, Geoffrey Irving, Michael Isard,
To curb this problem has become a major goal for ev-                      et al. Tensorflow: A system for large-
ery digital platform. Here we proposed a model using                      scale machine learning. In OSDI, vol-
Keras Word Embedding and CNN to classify legiti-                          ume 16, pages 265–283, 2016.
mate and phishing mails. Combining these two will
give a dense vector representation for words which are       [Cho15]      François Chollet. Keras. https://gith
then used to classify mails given in data set. Our                        ub.com/fchollet/keras, 2015.
model performed well for both the tasks with header
                                                             [EDB+ 18]    Ayman Elaassal, Avisha Das, Shahryar
and without header. A highly imbalanced data sets
                                                                          Baki, Luis De Moraes, and Rakesh
were given for both sub tasks and the task it self was
                                                                          Verma. Iwspa-ap: Anti-phising shared
unconstrained, i.e., any data sets can be used during
                                                                          task at acm international workshop on
training. But without using any external data sets
                                                                          security and privacy analytics. In Pro-
we were able to get good detection rate for phishing
                                                                          ceedings of the 1st IWSPA Anti-Phishing
email in both sub tasks. Thus we can conclude that if
                                                                          Shared Task. CEUR, 2018.
we add some additional data sources we can consider-
able increase the detection rate of phishing emails for      [EDMB+ 18] Ayman Elaassal, Luis De Moraes,
the proposed methodology.                                               Shahryar Baki, Rakesh Verma, and
            Avisha Das. Iwspa-ap shared task email                  (SKIMA), 2015 9th International Con-
            dataset, 2018.                                          ference on, pages 1–8. IEEE, 2015.
[FBJ+ 15]   Lv Fang, Wang Bailing, Huang Junheng,       [SZL+ 15]   Hongzhou Sha, Zhou Zhou, Qingyun Liu,
            Sun Yushan, and Wei Yuliang. A proac-                   Tingwen Liu, and Chao Zheng. Limited
            tive discovery and filtering solution on                dictionary builder: An approach to select
            phishing websites. In Big Data (Big                     representative tokens for malicious urls
            Data), 2015 IEEE International Confer-                  detection. In Communications (ICC),
            ence on, pages 2348–2355. IEEE, 2015.                   2015 IEEE International Conference on,
                                                                    pages 7077–7082. IEEE, 2015.
[FCT+ 15]   Lew May Form, Kang Leng Chiew,
            Wei King Tiong, et al. Phishing email       [TC09]      Fergus Toolan and Joe Carthy. Phish-
            detection technique by using hybrid fea-                ing detection using classifier ensembles.
            tures. In IT in Asia (CITA), 2015 9th                   In eCrime Researchers Summit, 2009.
            International Conference on, pages 1–5.                 eCRIME’09., pages 1–9. IEEE, 2009.
            IEEE, 2015.
                                                        [TH09]      Hicham Tout and William Hafner. Phish-
[FM15]      Mohammed Nazim Feroz and Susan                          pin: An identity-based anti-phishing
            Mengel.    Phishing url detection us-                   approach.   In Computational Science
            ing url ranking. In Big Data (Big-                      and Engineering, 2009. CSE’09. Inter-
            Data Congress), 2015 IEEE Interna-                      national Conference on, volume 3, pages
            tional Congress on, pages 635–638.                      347–352. IEEE, 2009.
            IEEE, 2015.
                                                        [VSP18a]    R Vinayakumar, KP Soman, and Praba-
[KK15]      Sukhjeel Kaui and Amrit Kaur. Detec-                    haran Poornachandran. Detecting mali-
            tion of phishing webpages using weights                 cious domain names using deep learning
            computed through genetic algorithm. In                  approaches at scale. Journal of Intelli-
            MOOCs, Innovation and Technology in                     gent & Fuzzy Systems, 34(3):1355–1367,
            Education (MITE), 2015 IEEE 3rd In-                     2018.
            ternational Conference on, pages 331–
            336. IEEE, 2015.                            [VSP18b]    R Vinayakumar, KP Soman, and Praba-
                                                                    haran Poornachandran. Evaluating deep
[KKMK15] Binay Kumar, Pankaj Kumar, Ankit                           learning approaches to characterize and
         Mundra, and Shikha Kabra. Dc scan-                         classify malicious urls. Journal of Intel-
         ner: Detecting phishing attack.      In                    ligent & Fuzzy Systems, 34(3):1333–1343,
         Image Information Processing (ICIIP),                      2018.
         2015 Third International Conference on,
         pages 271–276. IEEE, 2015.                     [Zen17]     Yuanyuan Grace Zeng. Identifying email
                                                                    threats using predictive analysis. In Cy-
[MFSE14]    Samuel Marchal, Jérôme François, Radu                ber Security And Protection Of Digital
            State, and Thomas Engel. Phishstorm:                    Services (Cyber Security), 2017 Interna-
            Detecting phishing with streaming ana-                  tional Conference on, pages 1–2. IEEE,
            lytics. IEEE Transactions on Network                    2017.
            and Service Management, 11(4):458–471,
            2014.                                       [ZT11]      Justin Zhan and Lijo Thomas. Phish-
                                                                    ing detection using stochastic learning-
[PR12]      Mayank Pandey and Vadlamani Ravi.                       based weak estimators. In Computational
            Detecting phishing e-mails using text and               Intelligence in Cyber Security (CICS),
            data mining. In Computational Intelli-                  2011 IEEE Symposium on, pages 55–59.
            gence & Computing Research (ICCIC),                     IEEE, 2011.
            2012 IEEE International Conference on,
            pages 1–6. IEEE, 2012.                      [ZZL15]     Xiang Zhang, Junbo Zhao, and Yann Le-
                                                                    Cun. Character-level convolutional net-
[SAZ+ 15]   Sami Smadi, Nauman Aslam, Li Zhang,                     works for text classification. In Advances
            Rafe Alasem, and MA Hossain. Detec-                     in neural information processing systems,
            tion of phishing emails using data mining               pages 649–657, 2015.
            algorithms. In Software, Knowledge, In-
            formation Management and Applications