<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PED-ML: Phishing Email Detection Using Classical Machine Learning Techniques CENSec@Amrita</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Anu Vazhayil</institution>
          ,
          <addr-line>Harikrishnan NB, Vinayakumar R</addr-line>
          ,
          <institution>Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <volume>62</volume>
      <abstract>
        <p>In the modern era, all services are maintained online and everyone use it to speed up their day to day activities. This include social as well as nancial activities which involves usage of sensitive information to carry out the intended task. With the increase in usage of such facilities put forth the importance of securing the data used to perform such actions. Over the last decade phishing has become a serious threat to the society by stealing sensitive information to get hold of these facilities. This is considered to be the most pro table cybercrime and according to IBMs X-Force researchers statistics, the number of people becoming the victim of such activities are increasing tremendously. As the risk of phishing emails are increasing steadily, the need to detect and overcome such situations stands as one of the highest priority task at hand. In the present work, we will use non-sequential representation such as term document matrix approach followed by Singular Value Decomposition (SVD) and Nonnegative Matrix Factorization (NMF) to model phishing email detection as a supervised classi cation problem to detect phishing emails from legitimate ones.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Copyright c by the paper's authors. Copying permitted for
private and academic purposes.</p>
      <p>In: R. Verma, A. Das (eds.): Proceedings of the 1st
AntiPhishing Shared Pilot at 4th ACM International Workshop on
Security and Privacy Analytics (IWSPA 2018), Tempe, Arizona,
USA, 21-03-2018, published at http://ceur-ws.org
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The growth of internet has revolutionized the digital
era. This revolution has changed entirely the way we
communicate, carry out business, advertisement etc.
In fact, in today's world in order to establish a
successful business a web presence is mandatory. And in all
cases important communications takes place through
email. At the same time there are instances where
phishing emails are send to users and the main goal of
such emails is to steal sensitive information of the user.
Phishing emails does this by sending emails claiming to
originate from some trusted sources. And these emails
contain links or attachments which tries to get
sensitive information from the user. In such a scenario an
e cient mechanism that detects and classify phishing
emails has to be addressed. The conventional
techniques used are blacklisting, greylisting and
whitelisting. In the case of blacklisting, IP and email address
of those mails which attempts to collect the private
information of users are stored in a list and all emails
arrived from the email address speci ed in the list are
marked as phishing scams. Whitelist functions exactly
opposite to blacklist by allowing emails from trusted
users speci ed in the whitelist. The drawback of these
methods is the requirement of human involvement in
de ning and updating the list and it also fails at
detecting the new or the variants of existing phishing
email. The other popular method include Bayesian
lters, a heuristic approach. Bayesian lters are
popularly used detection techniques during 1990s. With
the increase in the computational capability, there is
a paradigm shift from conventional techniques to data
driven techniques. Data driven techniques popularized
the impact of machine learning in the area of cyber
security [NVK+15] in unfathomable ways.</p>
      <p>There has been signi cant amount of research going
on in the direction of phishing email classi cation.
Researchers have come up with many mathematical
models to detect phishing emails. Some of the commonly
used techniques are naive bayes classi er, boosted
decision tree [CM01], SVM [DWV99a], LVQ-based
neural network [CXMX05] etc. These methods needs a
Bayesian prior knowledge about the nature of
phishing emails [SVKS15], [BVP].</p>
      <p>Recent trends in the eld of computer vision and
Natural Language Processing (NLP), clearly conveys
the potential use of machine learning techniques to
tackle many signi cant problems in these areas. In
such a situation our research mainly focus on
machine learning based solution to classify emails as
either phishing or legitimate. In this paper the authors
used Term Document Matrix (TDM) for non
sequential representation of the corpus. Feature
engineering is an important step in all machine learning tasks.
In order to extract the important features SVD and
NMF is applied on the data. These are then passed
to machine learning algorithms like Decision tree,
KNN, Naive Bayes, Random forest, SVM and logistic
regression.</p>
      <p>The remaining part of the paper is arranged as
follows: Section 2 represents related works, Section 3
discusses the model altogether, covering dataset
description, representation of the data and highlights the
methodology used, Section 4 and 5 represents results
and conclusion respectively followed by
acknowledgement.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related</title>
    </sec>
    <sec id="sec-4">
      <title>Works</title>
      <p>Phishing attacks are serious cyber threats for both
multinational companies as well as users. These emails
seems like they are legitimate but contains malicious
contents which can steal important information like
bank account number, credit card details etc, and
bring huge loss to individuals and organizations. This
calls the importance of segregating such emails.
Methods like blacklisting requires human intervention to
manually select and classify the emails. While on the
other hand there are feature engineering techniques
which analyses the contents of emails and helps in the
classi cation process. In [SDHH98], the work has
conveyed the importance of phishing speci c features for
classi cation. In [KMAH04] the class cation error was
reduced by utilizing the temporal relation in email
sequence and using those as features. Heuristics based
feature selection was highlighted in [MW04]. Due to
the growth of computing facilities, data driven
methods were widely used in email classi cation. In [Faw03]
and [Gee03] data mining techniques were introduced
for ltering non-legitimate emails. Also [DWV99b]
used PCA as a pre processing technique for extracting
features as well as for dimensionality reduction.
Authors in [ANNWN07] has used machine learning based
models like logistic regression, SVM and random forest
for classifying emails as either phishing or legitimate.
In this work we make use of the importance of
dimensionality reduction and TDM representation of data.
For dimensionality reduction we use SVD and NMF.
The representation is then followed by application of
classical machine learning techniques on the processed
data.
3</p>
    </sec>
    <sec id="sec-5">
      <title>Proposed Architecture</title>
      <p>The proposed architecture for an anti-phishing
framework to detect phishing emails from legitimate ones
is explained using a ow chart in Figure 1. The same
model is used in both the cases where the data contains
emails with and without header. Detailed explanation
of all the levels are given below.
3.1</p>
      <sec id="sec-5-1">
        <title>Dataset description</title>
        <p>As part of the anti-phishing shared task at rst
security and privacy analytics(IWSPA-AP 2018) two
subtasks were held. Task 1 is classifying Email with
headers and Task 2 is Email with no headers. The dataset
details [EDMB+18], [EDB+18] is provided in Tables 1
and 2 above.
3.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Dataset representation</title>
        <p>Data representation is considered to be the most
important part in any machine learning task and need
to be chosen properly depending on the nature of the
dataset. The corpus received for the shared task
contains text and special symbols. So, the rst step is
to produce meaningful representation of the data. In
this work, for all the experiments TDM is used for the
numerical representation of the data for both the
subtasks given. After doing the representation the second
step involves feature extraction and dimensionality
reduction. This is carried out using Singular Value
Decomposition (SVD) and Non-negative Matrix
factorization (NMF) methods. For this, the TDM is passed
to the feature extraction block. In the feature
extraction block, the rank is taken as 30 for all the cases
which means, the number of columns of the train and
test data matrix will be taken as 30 after doing the
dimensionality reduction. This numeric representation
of the data is then passed to all the di erent
machine learning algorithms for classi caiton. Figure 1
describes the steps involved in the proposed
architecture. The proposed architecture consists of 5 blocks.
Block 1 represents the raw dataset ie. the set of emails
with and without headers. In block 2 the data is
preprocessed by removing the special characters and
unnecessary details from the raw data. Block 3
represents the process of data representation of the emails.
The data representation is followed by
dimensionality reduction block where SVD and NMF technniques
are applied to the input from block 3. This is passed
to block 5 where di erent classical machine learning
algorithms are incorporated. Finally the emails are
classi ed as either legitimate or phishing. The
mathematical formulation of the task is as follows:
Given a set of emails represented as D =
[e1; e2; :::en] and its classes like C = [c1; c2; :::cn].
The class values are either 0 or 1. The machine
learning models used in the work learn from the
training data and label accordingly. After the
learning process, the model is used to predict the
classes for unseen test data.
SVM. The metrics used for analyzing the performance
of the model are as follows:
TDM representation of data is done and the
vocabulary is built using train and test data
SVD or NMF is used for feature extraction and
dimensionality reduction
Step 2 is followed by applying di erent classical
ML techniques like Decision Tree, Random Forest,
AdaBoost, KNN and SVM
3.2.2</p>
      </sec>
      <sec id="sec-5-3">
        <title>Data representation of samples with no headers:</title>
        <p>SVD or NMF is applied for feature extraction and
dimensionality reduction
Step 3 is followed by applying di erent classical
ML techniques like Decision Tree, Random
Forest, AdaBoost, KNN and SVM on the numeric
representation of the data
3.3</p>
      </sec>
      <sec id="sec-5-4">
        <title>Methodology</title>
        <p>The paper discusses classical machine learning
approaches like Decision Tree, K- Nearest Neighbors,
Logistic Regression, Naive Bayes, Random Forest and</p>
        <sec id="sec-5-4-1">
          <title>1. Accuracy</title>
        </sec>
        <sec id="sec-5-4-2">
          <title>2. Precision</title>
        </sec>
        <sec id="sec-5-4-3">
          <title>3. Recall</title>
        </sec>
        <sec id="sec-5-4-4">
          <title>4. F1-Score</title>
          <p>For numeric representation of data TDM is used.
The TDM matrix is passed to SVD and NMF for
extracting best features.</p>
          <p>SVD decomposes a matrix as the product of three
di erent matrices. These matrices can be
geometrically interpreted as rotation, stretching,
rotation. The mathematical representation of SVD
is : A = U V T where U represents the
orthonormal eigenvectors of AAT . And V T represents the
orthonormal eigenvectors of AT A. It is a diagonal
matrix and represents the singular values. For
extracting features the product of U is su cient. In
all the cases the rank is chosen as 30. So the
resultant train and test dataset size will be reduced
to, total no of data points x 30.</p>
          <p>The second technique used for feature extraction
is NMF. It factorizes a matrix as the product of
two matrices i.e, W and H. These matrices does
not contain any negative elements. The TDM is
passed as the input to NMF. NMF generates a
list of topics. These topics acts as a basis for
representing the original dataset.
FP
443
470
475
447
475
475
474
437
495
477
466
496
496
461</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>The datasets provided are highly imbalanced, and still
gives considerably high classi cation accuracy. The
following tables lists the performance of each
classical machine learning techniques applied for the
formulated binary classi cation problem to detect whether
an email is phishing or legitimate. In the Tables 3, 4
and 5 the results obtained are for predicting the labels
for the training data by using sklearn train-test split
where 33% of the training data is used for validating
the result and the rest for training the model. From
the results obtained, Random Forest has outperformed
all other techniques for the training data set. Test data
results are provided in Table 6 and 7. Table 6 describe
the results for classi cation using TDM with SVD for
both subtasks. Table 7 represent the results for
classi cation using TDM with NMF for both subtasks.
The shared task organizers had given the true positive
(TP), true negative (TN), false positive (FP) and false
negative (FN) values for test dataset which are listed
in Table 6 and 7 along with accuracy, precision, recall
and F1-score, which are estimated taking TP, TN, FP
and FN values and using it in the following equations:
accuracy =
precision =</p>
      <p>(tp + tn)
(tp + f p + tn + f n)</p>
      <p>tp
(tp + f p)
recall =</p>
      <p>tp
(tp + f n)
f 1
score =</p>
      <p>(2 tp)
(2 tp + f p + f n)
(1)
(2)
(3)
(4)
5</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>The paper focuses on phishing email detection which
is a major threat in the present scenario. For both
the subtasks numeric representation of data is done
using the methodology, TDM with SVD and TDM
with NMF. These representations are followed by
applying classical machine learning techniques to the
data inorder to classify an email as phishing or
legitimate. One of the drawback with the current model is
that the proposed mechanism relies on feature
selection, which requires domain knowledge. To overcome
this issue deep learning models can be incorporated,
which can learn more complex patterns from the raw
data and use it as features that produce more e cacy
and this can be considered as a possible future work.
In addition to that both the subtasks belongs to
unconstrained category, allowing external datasets to be
used for the training purpose. The datasets provided
in the subtasks are highly imbalanced. With highly
imbalanced datasets, we are able to achieve
considerably high phishing email detection rate in both the
subtasks. The tasks are unconstrained but we have not
used datasets from any other external sources. Thus,
the phishing email detection rate of the proposed
architecture can be easily enhanced by adding additional
data from external sources with the data provided in
the shared task. This will be considered as one of the
signi cant direction towards the future work.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This research was supported in part by Paramount
Computer Systems. We are grateful to NVIDIA
India, for the GPU hardware support to the research
grant. We are grateful to Computational Engineering
and Networking (CEN) department for encouraging
the research.
[BVP]
[CM01]
[CXMX05]
[DWV99a]
[DWV99b]</p>
      <p>Barathi Ganesh Hullathy Balakrishnan,
Anand Kumar Madasamy
Vinayakumar, and Soman Kotti Padannayil. Nlp
cen amrita@ smm4h: Health care text
classi cation through class embeddings.</p>
      <p>Xavier Carreras and Lluis Marquez.</p>
      <p>Boosting trees for anti-spam email
ltering. arXiv preprint cs/0109015, 2001.</p>
      <p>Zhan Chuan, Lu Xianliang, Hou
Mengshu, and Zhou Xu. A lvq-based
neural network anti-spam email approach.</p>
      <p>ACM SIGOPS Operating Systems
Review, 39(1):34{39, 2005.</p>
      <p>Harris Drucker, Donghui Wu, and
Vladimir N Vapnik. Support
vector machines for spam categorization.</p>
      <p>IEEE Transactions on Neural networks,
10(5):1048{1054, 1999.</p>
      <p>Harris Drucker, Donghui Wu, and
Vladimir N Vapnik. Support
vector machines for spam categorization.</p>
      <p>IEEE Transactions on Neural networks,
10(5):1048{1054, 1999.</p>
      <p>Ayman Elaassal, Avisha Das, Shahryar
Baki, Luis De Moraes, and Rakesh
Verma. Iwspa-ap: Anti-phising shared
task at acm international workshop on
security and privacy analytics. In
Proceedings of the 1st IWSPA
Anti</p>
      <p>Phishing Shared Task. CEUR, 2018.
[EDMB+18] Ayman Elaassal, Luis De Moraes,
Shahryar Baki, Rakesh Verma, and
Avisha Das. Iwspa-ap shared task email
dataset, 2018.
[Faw03]
[Gee03]
[KMAH04]
[MW04]
[NVK+15]
[SDHH98]
[SVKS15]</p>
      <p>Tom Fawcett. In vivo spam
ltering: a challenge problem for kdd.</p>
      <p>ACM SIGKDD Explorations Newsletter,
5(2):140{148, 2003.</p>
      <p>Kevin R Gee. Using latent semantic
indexing to lter spam. In Proceedings of
the 2003 ACM symposium on Applied
computing, pages 460{464. ACM, 2003.</p>
      <p>Svetlana Kiritchenko, Stan Matwin, and
Suhayya Abu-Hakima. Email classi
cation with temporal features. In
Intelligent Information Processing and Web
Mining, pages 523{533. Springer, 2004.</p>
      <p>Tony A Meyer and Brendon
Whateley. Spambayes: E ective open-source,
bayesian based, email classi cation
system. In CEAS. Citeseer, 2004.</p>
      <p>Maryam M Najafabadi, Flavio
Villanustre, Taghi M Khoshgoftaar,
Naeem Seliya, Randall Wald, and
Edin Muharemagic. Deep learning
applications and challenges in big data
analytics. Journal of Big Data, 2(1):1,
2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [ANNWN07]
          <string-name>
            <given-names>Saeed</given-names>
            <surname>Abu-Nimeh</surname>
          </string-name>
          , Dario Nappa,
          <string-name>
            <given-names>Xinlei</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Suku</given-names>
            <surname>Nair</surname>
          </string-name>
          .
          <article-title>A comparison of machine learning techniques for phishing detection</article-title>
          .
          <source>In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit</source>
          , pages
          <volume>60</volume>
          {
          <fpage>69</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>