=Paper= {{Paper |id=Vol-2124/paper_17 |storemode=property |title=Detecting Phishing E-mail Using Machine Learning Techniques CEN-SecureNLP |pdfUrl=https://ceur-ws.org/Vol-2124/paper_17.pdf |volume=Vol-2124 |authors=Nidhin A Unnithan,Harikrishnan NB,Vinayakumar R,Soman KP }} ==Detecting Phishing E-mail Using Machine Learning Techniques CEN-SecureNLP== https://ceur-ws.org/Vol-2124/paper_17.pdf
       Detecting Phishing E-mail using Machine learning
                          techniques
                                                   CEN-SecureNLP


                  Nidhin A Unnithan, Harikrishnan NB, Vinayakumar R, Soman KP
                    Center for Computational Engineering and Networking(CEN),
                              Amrita School of Engineering, Coimbatore
                                Amrita Vishwa Vidyapeetham, India
                                    nidhinkittu5470@gmail.com

                                                 Sai Sundarakrishna
                                            Caterpillar, Bangalore, India
                                           sai.sundarakrishna@gmail.com



                                                                 and is an inexpensive method. Even though there
                                                                 are many modes of message transfer, email popularity
                        Abstract                                 didn’t reduced mostly in business, colleges and other
                                                                 private and government sectors as email is considered
    The number of unsolicited aka phishing emails                the safety transfer of message. Email communication
    are increasing tremendously day by day. This                 plays an important part in everybody’s life. Nowadays
    suggests the need to design a reliable frame-                email usage gets a tremendous increase compared to
    work to filter out phishing emails. In the pro-              olden days. There is a tremendous increase in users
    posed work, we develop a supervised classifier               compared to 2016 in 2017. Nearly 4.8 billion persons
    for distinguishing phishing email from legiti-               are using email in 2017 and calculations shows that
    mate ones. The term frequency-inverse doc-                   the number will rise to 5.6 billion users by 2021 over
    ument frequency (tf-idf) matrix and Doc2Vec                  other apps [RH11]. But main problem with email has
    are formed for legitimate and phishing emails.               been phishing mails which causes malwares and are
    This is passed to various traditional machine                used in fraud schemes, advertisements etc. Consid-
    learning classifiers for classification. The ma-             ering previous years email phishing has increased re-
    chine learning classifiers with Doc2Vec repre-               cently and many security threats evolves and cause se-
    sentation have performed well in comparison                  rious damages to business, individuals and economics.
    to the tf-idf representation. Thus we conclude               Especially for business emails extracting and analyzing
    Doc2Vec representation is more appropriate                   these communication networks can reveal interesting
    for detecting and classifying phishing and le-               and complex patterns of processes and decision mak-
    gitimate emails.                                             ing within a company. Detecting these fraud/phishing
                                                                 Emails precisely in communication networks is essen-
1    Introduction                                                tial.
Electronic-mail (Email) is one of the most effective                Phishing mails are type of spam mail which are haz-
and easy source for transferring messages. It is con-            ardous to users. A phishing mail can steal our data
sidered as the safest message transfer over networks             without our knowledge once its opened. Thus identi-
                                                                 fying phishing mails from spam mails is very impor-
Copyright c by the paper’s authors. Copying permitted for        tant. One way to protect our data from phishing mail
private and academic purposes.                                   is to add a secondary password to log in credentials.
In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish-   Another way is to alarm the user once a Phishing mail
ing Shared Pilot at 4th ACM International Workshop on Se-
                                                                 tries to steal our data.
curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona,
USA, 21-03-2018, published at http://ceur-ws.org                    During the infant stages of email communication,]
clear rules was followed [SHP08], but recently due          and domains as stated in [VSP18b, VSP18a]. Domain
to the diversity of email programs and formatting           Generation Algorithms which can be used by malicious
standards we have the freedom to edit and change            families were also classified using deep learning meth-
quoted text. Despite with these limitations, Symantec       ods as said in [VSPSK18].
Brightmail Sanz [SHP08] has been showing good per-             In this task we propose a machine learning based
formance even now for detection of phishing emails.         approach to extract the underlying structure in email
Moreover, it has the capability to keep track of IP (in-    text to overcome problems of error-prone rule-based
ternet protocol) addresses of that sent phishing mail.      approaches. This will enable the downstream tasks
The performance was comparable to [MW04]. Email             to work with much cleaner data and additional infor-
services like Microsoft Outlook, Mozilla Thunderbird,       mation by focusing on particular parts. Also further
or even online email communication such as Gmail,           we show the performance improvements and flexibility
usually group emails into conversations and attempt         over the previous work on similar tasks.
to hide quoted parts in order to improve the readabil-
ity.                                                                   Table 1: Training Dataset details
   In 2011 2.3 billion users were using emails which            Category         Legitimate Phishing               Total
have increased to about 4.3 billion by 2016 [RH11].
                                                                With header             4082            501        4583
TREC has defined phishing as an unwanted email sent
discriminately [C+ 08]. Thus emails have been used for          With no header          5088            612        5700
marketing and advertising purposes [CL98].
   Datasets such as the Enron [KY04] or Avocado cor-
                                                                        Table 2: Testing Dataset details
pus [OWKG15] provide real world information about
business communication and contains a mix of profes-                      Category       Email Samples
sional emails, personal emails, and phishing. [PS05]                     With header               4195
published parts of his personal email archive for re-                  With no header              4300
search. A recent survey shows the diversity of email
classification tasks alone [MSR+ 17]. Similarly another
interesting analysis of communication networks based        2     Background
on metadata like sender, recipients, and time extracted
from emails are discussed in [BCGJ11]. Models based         This section discusses the mathematical details of vari-
on the written contents of emails may get confused by       ous traditional machine learning algorithms and vector
automatically inserted text blocks or quoted messages.      space modeling techniques such as tf-idf and Doc2Vec.
Thus working with real world data requires normal-
                                                            2.1    Term frequency-inverse document fre-
ization of data prior to solving the problem at hand.
                                                                   quency (tf-idf )
Rauscher et al. [RMA15] developed an approach to
detect zones inside work-related emails where relevant      Term frequency-inverse document frequency (tf-idf)
business knowledge may be found. By ending over-            can be used in information retrieval. It will reflect
lapping text passages across the corpus, Jamison et         how much a word is important in a document or cor-
al. managed to resolve email threads of the Enron           pus. Tf-idf is also used for text mining and user mod-
corpus almost perfectly [JG13]. It has to be noted          eling as a weighting factor. It will give less important
that the claimed accuracy of almost 100% was only           to the words which are frequently repeated in a par-
tested on 20 email threats. In order to reassemble          ticular document. It is also used to remove stop words
email threats, Yeh et al. considered a similar approach     from a corpus. Nowadays the importance of tf-idf in
with a more elaborate evaluation reaching an accu-          search engine is very huge. Tf-idf can be calculated by
racy of 98% separating email conversations into parts       the following equations
[YWD05]. To do so they rely on additional meta in-
formation in emails sent through Microsoft Outlook                                           ft,d
                                                                                tf (t, d) = P                          (1)
(thread index) and rules that match specific client                                            ft0 ,d
                                                                                             0
                                                                                            t ∈d
headers. Thus, such an approach will not work on ar-
bitrary emails nor can it handle different localization                                           N
                                                                        idf (t, D) = log                               (2)
or edits by the user. Even though there are differ-                                        |{d ∈ D : t ∈ d}|
ent ways to detect phishing [DAY+ 15] gives an over-        where N is the total number of documents in the cor-
all evaluation of different classifiers used for phishing   pus.
detection. Recently deep learning methods has also
been used extensively for detecting phishing mails as                  tf idf (t, d, D) = tf (t, d) • idf (t, D)       (3)
stated in [BMS08] and for detecting malicious URLs
                      Table 3: 10 fold cross validation accuracy of train data without header
                           Task       Representation           Algorithm               Accuracy
                        No Header        Doc2Vec             Decision Tree               81.2
                        No Header        Doc2Vec              Naive Bayes                79.5
                        No Header        Doc2Vec               Adaboost                  83.4
                        No Header        Doc2Vec           Logistic Regresson            80.1
                        No Header        Doc2Vec          K-nearest neighbour            76.8
                        No Header        Doc2Vec         Support vector machine          88.4
                        No Header        Doc2Vec            Random Forest                87.4
                        No Header          tf-idf            Decision Tree               74.2
                        No Header          tf-idf             Naive Bayes                71.4
                        No Header          tf-idf              Adaboost                  75.6
                        No Header          tf-idf          Logistic Regresson            70.2
                        No Header          tf-idf         K-nearest neighbour            63.2
                        No Header          tf-idf        Support vector machine          79.4
                        No Header          tf-idf           Random Forest                78.1
2.2 Doc2Vec                                                 is often passed as an object or scenario which imitates
                                                            some set of properties and output is usually a decision
Doc2Vec is an unsupervised learning algorithm which
                                                            saying either YES or NO.
gives a fixed length vector representation of a variable
                                                               Trees are built using leaves. On every node of the
length text. The text can be a sentence, paragraph
                                                            tree a test is conducted which looks for the least pos-
or a document. It is an extension of Word2Vec in
                                                            sible outcome. The leaves subsist of numerical or cat-
which given a vector representation of context words
                                                            egorical values, of the respective item, which is the
as the input it predicts the word which is most likely to
                                                            outcome after each test.
accompany the context words. Word2Vec is inspired
because it can be used to predict the next word in a
sentence given the context word vectors, thus captur-       2.3.2 Naive Bayes
ing the semantics of the sentence even though the word
                                                            This uses Bayes theorem. It is the most singular
vectors are randomly initialized. Instead of word vec-
                                                            feature with independence i.e. coordinates present
tor we use document vector to predict next word given
                                                            for any feature dependability in a class doesn’t af-
context from a document in Doc2Vec. In document
                                                            fect other features. Naive Bayes Classifier model is
vector every document is represented by a column
                                                            prone to outperform when the feature dimension is
of unique vector called document matrix and words
                                                            high and is easy to build. Though it outperforms
are represented by unique vectors called word matrix.
                                                            most of the time when the condition of independence
Next word in a context is predicted by the concatena-
                                                            is matched, its independence does not overcomes the
tion or averaging of document and word vectors.
                                                            problems related to dimensionality. It utilizes con-
   In Doc2Vec the document vector is same for all con-      ditional probability model i.e. when a problem is
text generated from same document but differs across        posed which needs to be classified and imitates a vec-
documents. However word vector matrix is same for           tor X = (x1 , x2 , ...xn ) which epitomizes features yield-
different document, i.e., the vector representation of      ing probabilities P (Ck /(x1 , x2, ...xn)) for k outcomes.
same word across different document have the same           Mathematically it can be expressed as
vector representation.
                                                                                          P (Ck P (x|Ck
2.3     Machine Learning                                                    P (Ck /x) =                            (4)
                                                                                              P (x)
2.3.1    Decision Tree
                                                             2.3.3    AdaBoost
In modern era, the most sensible discrete method plus
a supervised algorithm personifying output in graph-         It is a continuous learning algorithm whose main pur-
ical format is decision trees. It’s an algorithm where       pose lies in stepping up the achievement of the learn-
each element in the given domain is put to an element        ing algorithm. It is solemnly used for classification. It
of its range which could be either discrete or continu-      performs this task by forming a strong classifier which
ous. It’s better for class type variables. In this proce-    is a sequence of innumerable weak classifiers. When
dure, each split is chosen in such a way that it reduces     Ada boost is combined with Decision tress it is best-
the target variable’s variance. The Decision tree input      out-of the box classifier. Irrespective of its swiftness
                       Table 4: 10 fold cross validation accuracy of train data with header
                         Task         Representation           Algorithm           Accuracy
                      With Header          Doc2Vec               Decision Tree             73.1
                      With Header          Doc2Vec                Naive Bayes              70.1
                      With Header          Doc2Vec                  Adaboost               77.4
                      With Header          Doc2Vec             Logistic Regresson          72.2
                      With Header          Doc2Vec            K-nearest neighbour          69.1
                      With Header          Doc2Vec           Support vector machine        75.4
                      With Header          Doc2Vec               Random Forest             73.4
                      With Header            tf-idf              Decision Tree             68.2
                      With Header            tf-idf               Naive Bayes              64.2
                      With Header            tf-idf                 Adaboost               69.4
                      With Header            tf-idf            Logistic Regresson          66.7
                      With Header            tf-idf           K-nearest neighbour          62.2
                      With Header            tf-idf          Support vector machine        72.4
                      With Header            tf-idf              Random Forest             71.2
                                                                2.3.6   Support Vector Machine (SVM)
Table 5: Test Data result for SVM combined with
Doc2Vec                                                         Support Vector Machine (SVM) is a linear classifier
          Task       TP TN FP FN                                algorithm based on supervised learning. It helps to
         No Header        3825    0    475     0                create a boundary between the variables to classify
                                                                them. It creates a hyper plane boundary with maxi-
        With Header       3593    7    489    106               mum margin to separate the variables. This algorithm
                                                                is robust to outliers. The co-ordinates of individual
in classifying it has been used as a feature learner as         observations are called as support vectors. SVM cre-
well.                                                           ates a hyperplane separating support vectors with the
                                                                maximum possible margin.
2.3.4   Logistic Regression                                        Support Vector Machines is one of the popularly
                                                                used method in supervised machine learning tech-
It is used when target variable is categorized. It hinges       niques. Problems like linear regression and classifi-
on MLE (Maximum Likelihood Estimation) and is a                 cation tasks could be solved easily with it. Here the
qualitative choice model. It is used to predict whether         training set is separated by a hyperplane where the
the risk factor increases the odds of a given outcome           points nearer to the hyperplane are support vectors
by a specific factor. Logistic Regression can be used           which aid them in finding the position of hyper-plane.
to model binary classification problems. The mathe-             In case training data set couldn’t be linearly separated,
matical representation is given as                              it is mapped to a high-dimensional space where it is
                                                                assumed to be linearly separable.
                                 1
                F (x) =                               (5)
                          1 + exp(−wT x)                        2.3.7   Random Forest
  where F can take values in the range 0 to 1.                  Random Forest is a supervised learning algorithm used
                                                                in both classification and regression problems. In the
2.3.5   k-nearest neighbour (KNN)                               random forest classifier, to get high accuracy results
                                                                we need to create large number of decision trees. The
It is the simplest algorithm of machine learning. It is         prediction obtained from a Random Forest is prone
known as lazy learning because it furnishes only ap-            to be far better than the predictions obtained by an
proximate values. It is flubbed by local structure of           individual decision tree. Random Forest utilizes the
the data. This procedure validates the local posterior          concept of bagging for creating several minimal corre-
probability of each class existing by the average of class      lated decision trees. Advantages of Random forest is
membership over its K-nearest neighbors.                        its ability to handle missing values and to avoid over-
                           Figure 1: Proposed architecture for email phishing detection
fitting of the model.                                         4   Results

                                                              Our model was trained for seven different machine
3     Experiments                                             learning techniques for two different representations,
                                                              i.e., with and without header data sets. All the re-
3.1   Description of data set                                 sults have been consolidated in Table 3 and Table 4.
                                                              Out of all the different models the one in which SVM
The anti-phishing shared task is a part of First Se-          combined with Doc2Vec gave the highest accuracy for
curity and Privacy Analytics Anti-Phishing Shared             both the data sets, thus only that model was given for
Task (IWSPA-AP 2018) at 4th ACM Interna-                      submission even though we trained for seven different
tional Workshop on Security and Privacy Analytics             techniques. The submitted models were tested using
[EDMB+ 18][EDB+ 18]. Let E = [e1 , e2 , ....en ] be a set     test data and the result for True Positive, True Neg-
of emails and C = [c1 , c2 , c3 , ...cn ] be a set of email   ative, False Positive, False Negative are consolidated
types such as legitimate or phishing. The task is to          into Table 5.
classify each given email sample into either legitimate
or phishing. The detailed summary of training and
testing data set is summarized in Table 1 and Table 2.
                                                              5   Conclusion
3.2   Proposed Architecture
                                                              The main objective of this work is to develop a su-
In our proposed architecture we used count based and          pervised classifier which can detect phishing and legit-
distributed representation for word representation. In        imate emails. We used count based and distributed
count based method we used tf-idf for word repre-             representations for our word representation and used
sentation and for distributed representation we used          different machine learning techniques such as Naive
Doc2Vec using gensim library. Once the word rep-              Bayse, Logistic Regression, Decision Tree, K Near-
resentations were created we used different machine           est Neighbour, Random Forest, Adaboost and Sup-
learning techniques to classify the data as legitimate        port Vector Machine for classification of legitimate and
or phishing. The machine learning techniques used             phishing emails. The proposed methodology rely on
are Naive Bayse, Logistic Regression, Decision Tree,          feature engineering and in future we can apply deep
K Nearest Neighbour, Random Forest, Adaboost and              learning on the phishing detection and can be consid-
Support Vector Machine.                                       ered as one in the future direction.
Acknowledgements                                                      the International Conference Recent Ad-
                                                                      vances in Natural Language Processing
This research was supported in part by Paramount
                                                                      RANLP 2013, pages 327–335, 2013.
Computer Systems. We are grateful to NVIDIA In-
dia, for the GPU hardware support to the research
                                                          [KY04]      Bryan Klimt and Yiming Yang. The en-
grant. We are grateful to Computational Engineering
                                                                      ron corpus: A new dataset for email clas-
and Networking (CEN) department for encouraging
                                                                      sification research. In European Confer-
the research.
                                                                      ence on Machine Learning, pages 217–
                                                                      226. Springer, 2004.
References
[BCGJ11]    Francesco Bonchi, Carlos Castillo, Aris-      [MSR+ 17]   Ghulam Mujtaba,         Liyana Shuib,
            tides Gionis, and Alejandro Jaimes. So-                   Ram Gopal Raj, Nahdia Majeed, and
            cial network analysis and mining for busi-                Mohammed Ali Al-Garadi. Email clas-
            ness applications. ACM Transactions                       sification research trends: Review and
            on Intelligent Systems and Technology                     open issues. IEEE Access, 5:9044–9064,
            (TIST), 2(3):22, 2011.                                    2017.

[BMS08]     Ram Basnet, Srinivas Mukkamala, and           [MW04]      Tony A Meyer and Brendon Whate-
            Andrew H Sung. Detection of phishing                      ley. Spambayes: Effective open-source,
            attacks: A machine learning approach.                     bayesian based, email classification sys-
            In Soft Computing Applications in Indus-                  tem. In CEAS. Citeseer, 2004.
            try, pages 373–383. Springer, 2008.
                                                          [OWKG15] Douglas Oard, William Webber, David
[C+ 08]     Gordon V Cormack et al. Email spam                     Kirsch, and Sergey Golitsynskiy. Avo-
            filtering: A systematic review. Founda-                cado research email collection. Philadel-
            tions and Trends R in Information Re-                  phia: Linguistic Data Consortium, 2015.
            trieval, 1(4):335–455, 2008.
                                                          [PS05]      Adam Perer and Ben Shneiderman. Be-
[CL98]      Lorrie Faith Cranor and Brian A LaMac-                    yond threads: Identifying discussions
            chia. Spam! Communications of the                         in email archives. Technical report,
            ACM, 41(8):74–83, 1998.                                   MARYLAND UNIV COLLEGE PARK
                                                                      HUMAN COMPUTER INTERACTION
[DAY+ 15]   Ammar Yahya Daeef, R Badlishah Ah-
                                                                      LAB, 2005.
            mad, Yasmin Yacob, Naimah Yaakob,
            and Mohd Nazri Bin Mohd Warip. Phish-
                                                          [RH11]      Sara Radicati and Quoc Hoang. Email
            ing email classifiers evaluation: Email
                                                                      statistics report, 2011-2015. Retrieved
            body and header approach. Journal
                                                                      May, 25:2011, 2011.
            of Theoretical and Applied Information
            Technology, 80(2):354, 2015.                  [RMA15]     François Rauscher, Nada Matta, and
     +
[EDB 18]    Ayman Elaassal, Avisha Das, Shahryar                      Hassan Atifi. Context aware knowledge
            Baki, Luis De Moraes, and Rakesh                          zoning: Traceability and business emails.
            Verma. Iwspa-ap: Anti-phising shared                      In IFIP International Workshop on Ar-
            task at acm international workshop on                     tificial Intelligence for Knowledge Man-
            security and privacy analytics. In Pro-                   agement, pages 66–79. Springer, 2015.
            ceedings of the 1st IWSPA Anti-Phishing
                                                          [SHP08]     Enrique      Puertas     Sanz,      José
            Shared Task. CEUR, 2018.
                                                                      Marı́a Gómez Hidalgo, and José Car-
[EDMB+ 18] Ayman Elaassal, Luis De Moraes,                            los Cortizo Pérez. Email spam filtering.
           Shahryar Baki, Rakesh Verma, and                           Advances in computers, 74:45–114, 2008.
           Avisha Das. Iwspa-ap shared task email
           dataset, 2018.                                 [VSP18a]    R Vinayakumar, KP Soman, and Praba-
                                                                      haran Poornachandran. Detecting mali-
[JG13]      Emily Jamison and Iryna Gurevych.                         cious domain names using deep learning
            Headerless, quoteless, but not hopeless?                  approaches at scale. Journal of Intelli-
            using pairwise email classification to dis-               gent & Fuzzy Systems, 34(3):1355–1367,
            entangle email threads. In Proceedings of                 2018.
[VSP18b]    R Vinayakumar, KP Soman, and Praba-
            haran Poornachandran. Evaluating deep
            learning approaches to characterize and
            classify malicious urls. Journal of Intel-
            ligent & Fuzzy Systems, 34(3):1333–1343,
            2018.
[VSPSK18] R Vinayakumar, KP Soman, Prabaharan
          Poornachandran, and S Sachin Kumar.
          Evaluating deep learning approaches to
          characterize and classify the dgas at
          scale. Journal of Intelligent & Fuzzy Sys-
          tems, 34(3):1265–1276, 2018.
[YWD05]     Chi-Yuan Yeh, Chili-Hung Wu, and
            Shine-Hwang Doong. Effective spam
            classification based on meta-heuristics.
            In Systems, Man and Cybernetics, 2005
            IEEE International Conference on, vol-
            ume 4, pages 3872–3877. IEEE, 2005.