Machine Learning Based Phishing E-mail detection
                                              Security-CEN@Amrita


           Nidhin A Unnithan, Harikrishnan NB, Akarsh S, Vinayakumar R, Soman KP
                  Center for Computational Engineering and Networking(CEN),
                            Amrita School of Engineering, Coimbatore
                              Amrita Vishwa Vidyapeetham, India
                                  nidhinkittu5470@gmail.com


                                                                 nearly 4.8 billion persons using email and it is esti-
                                                                 mated that by 2021 there will be an increase to 5.6
                        Abstract                                 billion users as email is considered to be main medium
                                                                 of transfer for messages over other apps. But main
    Phishing email detection is a significant threat             problem in email is the presence of phishing mails.
    in today’s world. The rate at which phishing                 These phishing mails are unwanted mails which may
    are generated are tremendously increasing day                carry malwares, fraud schemes, advertisements etc.
    by day. It is high time to deploy a self-learning            In comparison to previous years, phishing mails have
    system that gives a time bound detection and                 increased and have caused serious damages to busi-
    prevention of phishing email efficiently. This               ness, corporates, individuals and economics. Detect-
    work proposes a system which uses term doc-                  ing the fraud/phishing emails precisely is essential, ex-
    ument matrix as feature engineering mech-                    tracting and analyzing these mails can reveal us com-
    anism and classical machine learning tech-                   plex and interesting patterns and we can make appro-
    niques for detecting phishing email from legit-              priate decisions within a company to block phishing
    imate and phishing ones. The system also in-                 mails. During the early stages of communication via
    corporates the domain knowledge and lexical                  email clear rules were followed. But nowadays due
    features as part of feature engineering mecha-               to diversity present in email services, like Microsoft
    nism. The efficiency of the system is compared               Outlook, Mozilla Thunderbird, Google’s Gmail, mails
    using different classical machine learning tech-             are grouped into conversations and attempts to hide
    niques. Based on the accuracy, we propose the                quoted parts in order to improve the readability.
    best model that solves the formulated problem
                                                                    One type of spam mail which is hazardous to users
    efficiently.
                                                                 is phishing mails. A phishing mail is the one which
                                                                 covers itself as a legitimate mail but once opened can
1    Introduction                                                steal our data without our knowledge. Thus identify-
Email plays an important part of everybody’s life. It            ing phishing mails from spam mails is very important.
is one of the easiest and effective source for transfer-         One way to protect our data from phishing mail is to
ring messages and files. Even though there are many              add a secondary password to log in credentials. An-
modes of communication, the popularity of e-mail did             other way is to alarm the user once a phishing mail
not diminish as it is considered as one of the safest            tries to steal our data.
and fastest message transfer over networks and is an                In [SAZ+ 15] Sami S et.al proposed a model for de-
inexpensive method of communication.                             tecting phishing emails that rely on a preprocessing
   Nowadays e-mail usage gets a tremendous increase              technique which extracts different part of email as
compared to previous decades. In 2017 there were                 feature. And this extracted feature is fed into a j48
                                                                 classification algorithm to perform classification. In
Copyright c by the paper’s authors. Copying permitted for        [SZL+ 15], they considered meaningless tokens and new
private and academic purposes.
                                                                 pages as the feature set. Authors in [SZL+ 15], selected
In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish-
ing Shared Pilot at 4th ACM International Workshop on Se-
                                                                 some features that have better predictability from ini-
curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona,       tial feature set. They provide the O(1) complexity as
USA, 21-03-2018, published at http://ceur-ws.org                 an evaluation method to each feature set to evaluate
its predictive ability. In the paper [KK15], sukhjeel       tails of vector space modeling techniques such as TF-
kaui et.al used Genetic algorithm for the detection         IDF and Bag of words.
of phishing webpage and for categorizing pages they
preferred a filter function. Lu fang et.al in [FBJ+ 15]     2.1   Logistic Regression
proposes some solution to overcome the time lag in
detecting phishing websites. Here they provide a so-        This is a classification algorithm which is used to sep-
lution to detect phishing websites by analyzing the         arate the data into different classes. This can be nor-
peculiarity in its WHOIS and URL information. In            mal, ordinary and multinominal. In binary Logistic
[VSP18b, VSP18a] deep learning methods were em-             Regression the outcome or the classification can be
ployed to detect malicious URL0 s and domains. Bi-          done into 0 and 1 whereas in multinominal the out-
nay kumar et.al has used html contents for detecting        come or classification will be in multiple ways. The
email phishing in [KKMK15]. But Rachna Dhamija              activation function used for performing this is sigmoid
et.al in [TC09] mainly concentrated in this topic to        function. The mathematical representation of sigmoid
know which phishing activity works during the attack        activation function is as follows:
and why. For that they used a large given set of data
                                                                                           1
which contains reported phishing activities. Fergus                        σ(x) =                               (1)
toolan et.al made a different approach. They used                                   1 + exp(−wT x)
only five features for classification. For classification
they used a C5.0 algorithm which have more precision        2.2   Naive Bayes
compare to other algorithms. Mayank pandey et.al in         Naive Bayes is a set of supervised learning algo-
[PR12] used different types of classification methods       rithm which works on the principle of Bayes theo-
such as Multilayer Perceptron (MLP), Decision Trees         rem. This theorem works on conditional probability by
(DT), Support Vector Machine (SVM), Group Method            which probability of the events is calculated. Binary
of Data Handling (GMDH), Probabilistic Neural Net           and multiple classification are done by using different
(PNN), Genetic Programming (GP) and Logistic Re-            types of algorithms like GaussianNB, MultinomialNB,
gression (LR). Lew may form et.al in [FCT+ 15] pro-         BernoulliNB [MN+ 98]. Here for this problem we used
posed a method which uses hybrid features for de-           MultinominalNB from scikit-learn as our algorithm.
tecting phishing emails. It is called Hybrid features
because it is a combination of URL based, behavior          2.3    Support Vector Machine
based and contend based features. Here they acquired
an overall accuracy of 97.25 % with an error percent-       SVM is a supervised classification algorithm which
age of 2.75 %.                                              builds the model by classifying the data into two
   Even though there are different ways to detect           classes. Based on the number of classes we will be
phishing, [DAY+ 15] gives an overall evaluation of dif-     defining the SVM. It is of two types linear SVM
ferent classifiers used for phishing detection. Re-         and non-linear SVM. The decision boundary for linear
cently count based representation combined with do-         SVM is formulated as a hyperplane in feature space,
main level features integrated with machine learning        i.e. a linear function of the features. Non-linear SVMs
techniques are used for classifying phishing mails and      result in non-linear decision boundaries in the original
legitimate mails [EDB+ 18, BMS08]. The proposed             feature space. From different types of kernals avail-
methodology uses feature engineering approach com-          able we used radial basis function (RBF) for our SVM
bined with deep learning, which is one the significant      model.
direction in which world is moving to because it has
performed well in most of the text classification tasks     2.4   TF-IDF
[LBH15] and even in phishing detection [LNRW, EC].
                                                            TF-IDF stands for term frequency-inverse document
   The rest of the sections are organized as follows.
                                                            frequency and its weight can be considered as a statis-
Section 2 discusses the background details of email
                                                            tical measure which evaluates how important a word
representation and the machine learning algorithms.
                                                            is to a document which can in turn be used for in-
Section 3 includes the description of data set, exper-
                                                            formation retrieval and text mining. Term Frequency
iments and proposed architecture. Section 4 includes
                                                            gives us an idea about how frequently a term occurs
results. Conclusion is placed in Section 5.
                                                            in a document. This can be mathematically defined as
                                                            equation given below
2   Background
                                                                                           ft,d
This section discusses the mathematical details of var-                       tf (t, d) = P                     (2)
                                                                                             ft0 ,d
ious traditional machine learning algorithms and de-                                      0
                                                                                         t ∈d
   Inverse Document Frequency gives us an idea about            input word representation for machine learning algo-
how important a term is. When we compute term                   rithms. The domain level features include most com-
frequency all the terms are given equal importance              monly appeared words (40 features), for example pass-
whether it is a stop word or a terminology word. Thus           word, fraudulent, business, and special characters like
we need to weigh up terminology word which is less              $ , #, !, (, [, &, etc. and all the stop words were
frequent than the stop word in a document by comput-            removed. These are then passed through Logistic Re-
ing inverse document frequency given by mathematical            gression, Naive Bayes and Support Vector Machine to
equation                                                        do the classification of phishing and legitimate mails.
                                      N
            idf (t, D) = log                              (3)
                               |{d ∈ D : t ∈ d}|
where N is the total number of documents in the cor-
pus.
  Now TF-IDF can be calculated as

           tf idf (t, d, D) = tf (t, d) • idf (t, D)      (4)
  Additionally the domain level features are added.
This includes a list of most commonly appeared words
and a list of special characters.

3      Experiments
3.1     Dataset details
The email phishing detection is a task in shared
task on anti-phishing shared task at 4th ACM In-
ternational Workshop on Security and Privacy Ana-
lytics [EDMB+ 18]. Let E = [e1 ,e2 ,...,en ] and C =
[c1 ,c2 ,...,cn ] be sets of email types such as legitimate
or phishing, the task was to classify each given email
samples into either legitimate or phishing. Two sets of
data sets were used one with header and one without
header. Data set statistics are integrated together in
Table 1 for training and Table 2 for testing.
                                                                           Figure 1: Proposed Architecture
           Table 1: Training Dataset details
    Training Dataset Legitimate Phishing Total
                                                                     Table 3: Statistics of 10-fold cross validation
       With header             4082          501       4583             Method                   Task         Accuracy
      Without header           5088          612       5700         Logistic Regression   Without Header        92.2
                                                                       Naive Bayes        Without Header        93.4
            Table 2: Testing Dataset details                    Support Vector Machine Without Header           94.3
           Testing Dataset Data Samples                             Logistic Regression     With Header         91.2
             With header                4195                           Naive Bayes          With Header         92.2
            Without header              4300                    Support Vector Machine      With Header         93.3

3.2     Proposed Architecture
                                                                4    Results
We used count based representation to create our
model. A diagrammatic representation of our archi-              Our model build using above architecture was trained
tecture is shown in Figure 1. The email samples from            for data sets with headers and without headers for clas-
data set is first passed through count based represen-          sification of phishing and legitimate mails. We trained
tation, here TF-IDF, for word representation. It is             a total of six models, one each for Logistic Regression,
then combined with domain level features to get our             Naive Bayes, Support Vector Machine for mails with
                                         Table 4: Statistics of Test Result
           Method                 Task          TP TN FP FN Accuracy Precision Recall F1 score
      Logistic Regression    Without Header 3784 325 150 41            0.95       0.96      0.98      0.97
         Naive Bayes         Without Header 3807 258 217 18            0.94       0.94      0.99      0.97
    Support Vector Machine Without Header 3671 337 138 154             0.93       0.96      0.95      0.96
      Logistic Regression     With Header      3612 490    6      87   0.97       0.99      0.97      0.98
         Naive Bayes          With Header      3572 489    7   127     0.96       0.99      0.96      0.98
    Support Vector Machine    With Header      3561 458 38 138         0.95       0.98      0.96      0.97
header and without header. We used 10 fold cross val-      References
idation for our training data and the results obtained
                                                           [BMS08]     Ram Basnet, Srinivas Mukkamala, and
by our model has been consolidated in Table 3. For
                                                                       Andrew H Sung. Detection of phishing
data set without headers SVM gave the highest accu-
                                                                       attacks: A machine learning approach.
racy with 94.3% and for data set with headers SVM
                                                                       In Soft Computing Applications in Indus-
gave the highest accuracy with 93.3%. We didn’t ex-
                                                                       try, pages 373–383. Springer, 2008.
tract any features from header data set but extract-
ing features from headers may increase the accuracy.       [DAY+ 15]   Ammar Yahya Daeef, R Badlishah Ah-
Our model was tested using test data by IWSPA-AP                       mad, Yasmin Yacob, Naimah Yaakob,
Shared Task committee and the corresponding results                    and Mohd Nazri Bin Mohd Warip. Phish-
for True Positive, True Negative, False Positive, False                ing email classifiers evaluation: Email
Negative, Accuracy, Precision, Recall, F1 score for our                body and header approach. Journal
six models are summarized in Table 4.                                  of Theoretical and Applied Information
                                                                       Technology, 80(2):354, 2015.

5     Conclusion                                           [EC]        Louis Eugene and Isaac Caswell. Making
                                                                       a manageable email experience with deep
This paper evaluated the performance of machine                        learning.
learning based classifier for distinguishing phishing
emails from legitimate ones. We created a model us-        [EDB+ 18]   Ayman Elaassal, Avisha Das, Shahryar
ing count based representation combined with domain                    Baki, Luis De Moraes, and Rakesh
level features as word representation and passed to var-               Verma. Iwspa-ap: Anti-phising shared
ious machine learning techniques such as Logistic Re-                  task at acm international workshop on
gression, Naive Bayes and Support Vector Machine to                    security and privacy analytics. In Pro-
classify whether it is phishing or legitimate. Both the                ceedings of the 1st IWSPA Anti-Phishing
sub tasks belong to unconstrained category, i.e., any                  Shared Task. CEUR, 2018.
data sets can be used during training and data sets for
                                                           [EDMB+ 18] Ayman Elaassal, Luis De Moraes,
both the tasks where highly imbalanced. Even then
                                                                      Shahryar Baki, Rakesh Verma, and
we have not used any other external data set sources
                                                                      Avisha Das. Iwspa-ap shared task email
and still were able to achieve good detection rate for
                                                                      dataset, 2018.
phishing email in both sub tasks. By adding some
additional data sources we can considerable increase       [FBJ+ 15]   Lv Fang, Wang Bailing, Huang Junheng,
the detection rate of phishing emails for the proposed                 Sun Yushan, and Wei Yuliang. A proac-
methodology.                                                           tive discovery and filtering solution on
                                                                       phishing websites. In Big Data (Big
                                                                       Data), 2015 IEEE International Confer-
5.0.1    Acknowledgements                                              ence on, pages 2348–2355. IEEE, 2015.

This research was supported in part by Paramount           [FCT+ 15]   Lew May Form, Kang Leng Chiew,
Computer Systems. We are grateful to NVIDIA In-                        Wei King Tiong, et al. Phishing email
dia, for the GPU hardware support to the research                      detection technique by using hybrid fea-
grant. We are grateful to Computational Engineering                    tures. In IT in Asia (CITA), 2015 9th
and Networking (CEN) department for encouraging                        International Conference on, pages 1–5.
the research.                                                          IEEE, 2015.
[KK15]      Sukhjeel Kaui and Amrit Kaur. Detec-        [VSP18a]   R Vinayakumar, KP Soman, and Praba-
            tion of phishing webpages using weights                haran Poornachandran. Detecting mali-
            computed through genetic algorithm. In                 cious domain names using deep learning
            MOOCs, Innovation and Technology in                    approaches at scale. Journal of Intelli-
            Education (MITE), 2015 IEEE 3rd In-                    gent & Fuzzy Systems, 34(3):1355–1367,
            ternational Conference on, pages 331–                  2018.
            336. IEEE, 2015.
                                                        [VSP18b]   R Vinayakumar, KP Soman, and Praba-
[KKMK15] Binay Kumar, Pankaj Kumar, Ankit                          haran Poornachandran. Evaluating deep
         Mundra, and Shikha Kabra. Dc scan-                        learning approaches to characterize and
         ner: Detecting phishing attack.      In                   classify malicious urls. Journal of Intel-
         Image Information Processing (ICIIP),                     ligent & Fuzzy Systems, 34(3):1333–1343,
         2015 Third International Conference on,                   2018.
         pages 271–276. IEEE, 2015.

[LBH15]     Yann LeCun, Yoshua Bengio, and Ge-
            offrey Hinton. Deep learning. nature,
            521(7553):436, 2015.

[LNRW]      Christopher Lennan, Bastian Naber, Jan
            Reher, and Leon Weber. End-to-end
            spam classification with neural networks.

[MN+ 98]    Andrew McCallum, Kamal Nigam, et al.
            A comparison of event models for naive
            bayes text classification. In AAAI-98
            workshop on learning for text categoriza-
            tion, volume 752, pages 41–48. Citeseer,
            1998.

[PR12]      Mayank Pandey and Vadlamani Ravi.
            Detecting phishing e-mails using text and
            data mining. In Computational Intelli-
            gence & Computing Research (ICCIC),
            2012 IEEE International Conference on,
            pages 1–6. IEEE, 2012.

[SAZ+ 15]   Sami Smadi, Nauman Aslam, Li Zhang,
            Rafe Alasem, and MA Hossain. Detec-
            tion of phishing emails using data mining
            algorithms. In Software, Knowledge, In-
            formation Management and Applications
            (SKIMA), 2015 9th International Con-
            ference on, pages 1–8. IEEE, 2015.

[SZL+ 15]   Hongzhou Sha, Zhou Zhou, Qingyun Liu,
            Tingwen Liu, and Chao Zheng. Limited
            dictionary builder: An approach to select
            representative tokens for malicious urls
            detection. In Communications (ICC),
            2015 IEEE International Conference on,
            pages 7077–7082. IEEE, 2015.

[TC09]      Fergus Toolan and Joe Carthy. Phish-
            ing detection using classifier ensembles.
            In eCrime Researchers Summit, 2009.
            eCRIME’09., pages 1–9. IEEE, 2009.