<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Tempe, Arizona,
USA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Machine Learning approach towards Phishing Email Detection CEN-Security@IWSPA 2018</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Harikrishnan NB, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>Email is a platform where we communicate, exchange ideas between each other. In today's world email plays a key role irrespective of the eld. In such a scenario, phishing mails are one of the major threats in today's world. These e-mails "seems" like legitimate but leads the users to malicious sites. As a result the user or organization or institution end up as the prey of the online predators. In order to tackle such problems, several statistical methods have been applied. In this paper we make use of distributional representation namely TF-IDF for numeric representation of phishing mails. Also a comparative study of classical machine learning techniques like Random Forest, AdaBoost, Naive Bayes, Decision Tree, SVM.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In today's world communication plays a key role in all
aspects of life. Email is a common platform used by
people for faster and e cient communication. Email
has become an inevitable part of everyday life. Due to
the advancement in this era of digitization the
dependency on email has been increasing day by day. The
increasing dependency calls for a way to manage the
huge amount of data or emails. The emails conveyed
include important as well as phishing emails. Phishing
emails often leads to malicious websites and results in
sharing personal details to the attackers. In order to
thwart these situations spam and phishing email
classi ers are widely used. Blacklisting which comes under
the category of list based lters is a popular method
to thwart phishing emails. It achieves this by blocking
emails from a list of sender's that are in the blacklist.
Blacklist consists of records of IP address and email
address of malicious users. When a new emails arrives,
the spam and phishing email lter checks the IP and
email address with that provided in the blacklist and
decides whether the email has to be marked as
phishing or not. Other list based lters include
whitelistwhich allows emails from senders that is provided by
the user. Other popular methods include lters based
on contents. This includes word based lters,
heuristic lters, Bayesian lters. Word based lters blocks
emails with certain speci c words. The main drawback
of this method is its failure to classify new malicious
email. In order to update the list human intervention
is required</p>
      <p>Phishing email is a common name that represents
spam emails that has malicious intentions. Phishing
emails are a potential danger especially to
multinational companies, banking sector and even hospitals.
Phishing emails are also used by hackers to inject
malware into the system. The recent ransomware attack
[KRB+15] is the best example for this. These
phishing emails seems like legitimate but contains malicious
contents which can steal ones valuable details like
account number, credit/debit card details etc. In such a
situation a model has to be developed which can detect
and classify phishing emails very e ciently. The
traditional methods relies on human intervention. This
calls for an automation in recognizing emails as either
phishing or not. In such situations research moves in
the direction of machine learning and deep learning.</p>
      <p>Recent developments in the eld of machine
learning and deep learning, have shown promising results
in the eld of Computer Vision, Natural Language
Processing, Cyber security. etc. Taking this into
account we use a machine learning based model like
Decision tree, Logistic Regression, Random forest, Naive
Bayes, KNN, AdaBoost, SVM in classifying email as
either phishing or legitimate. The proposed method
uses SVD (singular value decomposition), NMF
(Nonnegative Matrix Factorization) for feature extraction
and dimensionality reduction. We have used TFIDF
(Term Frequency Inverse Document Frequency) for
numeric representation of words.</p>
      <p>The paper is structured as follows: Section 2
represents related works, Section 3 discuses dataset
description, Section 4 highlights the methodology used,
Section 5, 6, 7 represents results, conclusion and
acknowledgement respectively.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Phishing email detection can be treated as a sub
problem of spam detection. For several years spam
detection has been a rich area of research. [AKCS00],
[Sch03], [CL06] are examples of earlier works on
antispam lters. The work done speci cally on
phishing email detection is comparatively less compared to
spam detection. The dataset commonly used for most
of the research related to Phishing email is
PhishingCorpus [Naz10], [SVKS15], [BVP]. PhishingCorpus
consist of a group of hand-screened emails [GNN11]
which makes the dataset challenging. The existing
learning based approaches are presented in a
structured overview in [BB08]. Currently, various experts
are tacking the problem of phishing email classi
cation in the perspective of text classi cation [BB08].
In [CNU06] performed phishing email detection by
identifying structural features from the emails. These
features are passed to SVM for detecting phishing
emails. In [BCP+08] has proposed two methods,
adaptive Dynamic Markov Chains (DMC) and latent
classtopic model to classify emails. The adaptive Dynamic
Markov Chains gave similar performance when
compared to standard version while using two thirds less
of the memory. In [ANNWN07] has proposed machine
learning based models like logistic regression, SVM,
random forest for classifying emails as either spam or
legitimate. Also [AGA+13] has mentioned the types of
phishing attacks and classi cation. However they have
not incorporated the exploration of available datasets
and feature engineering techniques. Researchers has
also analyzed the class cation of emails based on the
contents. This paper uses TF-IDF representation
followed by dimensionality reduction for capturing major
contributing factors in the dataset and also for
reducing the computational cost. This is then passed to
classical machine learning techniques for classifying the
data as either legitimate or normal. Researchers has
also moved in the direction of applying deep learning
techniques to classify URL's as benign and malicious
URL's [VSP18b], [VSP18a]. In [VSPSK18], [VSP17]
authors have used deep learning techniques to classify
and evaluate domain generation algorithm.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset description</title>
      <p>The shared task consists of two tasks. Task 1 is Email
with headers and Task 2 is Email with no headers.
The dataset details [EDMB+18] [EDB+18] is provided
in the table below:</p>
      <p>Given a set of emails represented as D =
[e1; e2; :::en] and its labels like C = [c1; c2; :::cn].
The labels are either 0 or 1. The machine learning
model used will learn the patterns that maps the
train data into its corresponding labels. After the
learning, the model is used to predict the labels
for test data.</p>
      <p>In order to represent data in numeric format we
used TF-IDF representation. TF-IDF ( Term
Frequency Inverse Document Frequency) is used for both
the tasks. TF-IDF represents the importance of a word
in a corpus. The TF-IDF representation is followed
by SVD/ NMF for feature selection and
dimensionality reduction. We have used train-test split and chose
33% of training data as validation data for evaluating
the performance of the model</p>
      <p>We have evaluated the performance of TF-IDF
representation and TF-IDF + SVD/NMF representation
for the validation data. For TFIDF + SVD/NMF,
the rank is taken as 30 i.e, the number of columns of
the train and test data matrix will be 30 due to
dimensionality reduction. The performance of TF-IDF
+ SVD/NMF with no of columns as 30 after
dimensionality reduction was similar to the performance of
TFIDF representation of validation data. This
numeric representation for the data is passed to di erent
machine learning algorithms.
4.1.1</p>
      <p>Data representation for with headers:
TF-IDF representation of data. The vocabulary
is build using train and test data.</p>
      <p>SVD/NMF for feature extraction and
dimensionality reduction
Step 2 is followed by applying classical ML
techniques like Decision Tree, Random Forest,
AdaBoost, KNN, SVM
4.1.2</p>
      <p>Data representation for with no headers:
Data Preprocessing- Data preprocessing involves
counting the number of '@', '#' symbol in each
data sample. Then '@' and '#' counts are
removed from orginal corpus
TF-IDF representation of data, followed by
appending the '@' count and '#' count.</p>
      <p>SVD/NMF for feature extraction and
dimensionality reduction
Step 3 is followed by applying classical ML
techniques like Decision Tree, Random Forest,
AdaBoost, KNN, SVM</p>
      <p>In this paper we have used classical machine
learning techniques like Decision Tree, K- Nearest
Neighbors, Logistic Regression, Naive Bayes, Random
Forest, SVM. The metrics for understanding the
performance are the following:</p>
      <sec id="sec-3-1">
        <title>1. Accuracy</title>
      </sec>
      <sec id="sec-3-2">
        <title>2. Precision</title>
      </sec>
      <sec id="sec-3-3">
        <title>3. Recall</title>
        <p>The techniques used for feature extraction and
dimensionality reduction are NMF and SVD. In
[LS99] describes the details of Non Negative Matrix
Factorization well. TFIDF matrix is passed as input
to NMF and a group of topics is generated. These
represents a weighted set of co-occurring terms.
The topics identi ed acts as a basis by providing
an e cient way of representation to the original
corpus. NMF is found useful when the data attributes
are more and is used as a feature extraction technique.</p>
        <p>SVD aka singular value decomposition, decomposes
the TFIDF matrix (T) into 3 matrices. They are U ,
, V T , U represents the orthonormal eigenvectors of
AAT , represents a diagonal matrix and its diagonal
entries are the singular values, V T represents the
orthogonal eigenvectors of AT A. SVD is a powerful tool
and has many application in the eld of signal
processing and image processing. SVD is mainly used for
dimensionality reduction and for representing
important features. The product of U is used for extracting
the features. In all the cases the rank is assumed as
30. So the size of train and test matrix will shrink
to ( no of data samples x 30 ). These extracted
features are passed to di erent classical machine learning
techniques
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>This section provides details of the accuracy, precision,
recall, F1-score with respect to training data. The
following tables describes the performance of each
classical machine learning techniques for the formulated
binary classi cation problem to detect whether an email
is phishing or legitimate. We have used train-test split
(scikit learn) to split the training data into training
and validation. We have used 33% of training data for
validation. Table 3, 4, represents metrics for
validation for sub-task 1 (no header) and sub-task 2 (with
F1-Score
0.835
0.877
0.821
0.733
0.837
0.882
0.936
header). The results in Table 3 and 4 corresponds to
the TFIDF representation of the data. Similarly Table
5 and 6 represents the evaluation metrics for validation
data for sub-task 1 (no header) and sub-task 2 (with
header) with TFIDF + SVD/NMF representation
respectively. When calculated the training accuracy
Decision Tree and Random Forest outperformed almost
in all cases. The performance of TFIDF and TFIDF
+SVD/NMF representation is almost similar from the
results obtained in Table 3, 4, 5, 6. This motivates us
to go for dimensionality reduction. Since the number
of singular values used are 30, the pre-processed data
set size will be (no of rows, 30) Table 7, 8 represents
metrics for test set. Table 7 represents the metrics
for TFIDF + SVD representation for sub-task 1 and
2 test set. Similarly Table 8 represents the metrics for
TFIDF + NMF representation for sub-task 1 and 2
test set.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we used TFIDF+ SVD and TFIDF +
NMF representations followed by ML techniques for
classifying emails as either legitimate or phishing. The
performance of Decision Tree and Random Forest was
the highest in the case of training accuracy. But the
test data results for decision tree and random forest
mentions the case of over tting. The over tting is
because the dataset is highly unbalanced. Also both
the sub-tasks belong to the unconstrained category
(which means we can use any other data sets
during training). The given datasets for both the
subtasks are highly imbalanced. Even though the tasks
are unconstrained, we haven't used any other external
sources. With highly, imbalanced data sets, we are
able to achieve considerable phishing email detection
rate in both the sub-tasks. The phishing email
detection rate of the proposed methodology can be easily
enhanced by adding additional extra data sources. This
will be considered as one of the signi cant direction
towards the future work. Also due to computational
constraints, the authors couldn't try for deep learning
based methods. This can also be taken up as a future
work.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This research was supported in part by Paramount
Computer Systems. We are also grateful to NVIDIA
India, for the GPU hardware support to the research
grant. We are grateful to Computational Engineering
and Networking (CEN) department for encouraging
the research.
[AGA+13]
[AKCS00]</p>
      <sec id="sec-6-1">
        <title>Ammar Almomani, BB Gupta, Samer</title>
        <p>Atawneh, A Meulenberg, and Eman
Almomani. A survey of phishing email
ltering techniques. IEEE
communications surveys &amp; tutorials, 15(4):2070{
2090, 2013.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Ion Androutsopoulos, John Koutsias,</title>
        <p>
          Konstantinos V Chandrinos, and
Constantine D Spyropoulos. An
experimental comparison of naive bayesian
and keyword-based anti-spam ltering
with personal e-mail messages. In
Proceedings of the 23rd annual
international ACM SIGIR conference on
Research and development in information
retrieval, pages 160{167. ACM, 2000.
[ANNWN07] Saeed Abu-Nimeh, Dario Nappa,
Xinlei Wang, and Suku Nair. A
comparison of machine learning techniques for
phishing detection. In Proceedings of the
anti-phishing working groups 2nd annual
eCrime researchers summit, pages 60{
69. ACM, 2007.
[EDMB+18] Ayman Elaassal, Luis De Moraes,
Shahryar Baki, Rakesh Verma, and
Avisha Das. Iwspa-ap sha
          <xref ref-type="bibr" rid="ref2">red task email
dataset, 2018</xref>
          .
[GNN11]
[KRB+15]
[LS99]
[Naz10]
[Sch03]
[SVKS15]
[VSP17]
[VSP18a]
[VSP18b]
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Hugo Gonzalez, Kara Nance, and Jose</title>
        <p>Nazario. Phishing by form: The abuse of
form sites. In Malicious and Unwanted
Software (MALWARE), 2011 6th
International Conference on, pages 95{101.
IEEE, 2011.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Amin Kharraz, William Robertson, Da</title>
        <p>vide Balzarotti, Leyla Bilge, and Engin
Kirda. Cutting the gordian knot: A
look under the hood of ransomware
attacks. In International Conference on
Detection of Intrusions and Malware,
and Vulnerability Assessment, pages 3{
24. Springer, 2015.</p>
      </sec>
      <sec id="sec-6-5">
        <title>Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788, 1999.</title>
      </sec>
      <sec id="sec-6-6">
        <title>J Nazario. Phishingcorpus homepage.</title>
        <p>ed: Retrieved February, 2010.</p>
      </sec>
      <sec id="sec-6-7">
        <title>Karl-Michael Schneider. A comparison</title>
        <p>of event models for naive bayes
antispam e-mail ltering. In Proceedings of
the tenth conference on European
chapter of the Association for Computational
Linguistics-Volume 1, pages 307{314.
Association for Computational
Linguistics, 2003.</p>
      </sec>
      <sec id="sec-6-8">
        <title>Shriya Se, R Vinayakumar, M Anand</title>
        <p>Kumar, and KP Soman. Amrita-cen@
sail2015: sentiment analysis in indian
languages. In International Conference
on Mining Intelligence and Knowledge
Exploration, pages 703{710. Springer,
2015.</p>
      </sec>
      <sec id="sec-6-9">
        <title>R Vinayakumar, KP Soman, and Prabaharan Poornachandran. Deep encrypted text categorization. In Advances in</title>
        <p>Computing, Communications and
Informatics (ICACCI), 2017 International
Conference on, pages 364{370. IEEE,
2017.</p>
      </sec>
      <sec id="sec-6-10">
        <title>R Vinayakumar, KP Soman, and Praba</title>
        <p>haran Poornachandran. Detecting
malicious domain names using deep learning
approaches at scale. Journal of
Intelligent &amp; Fuzzy Systems, 34(3):1355{1367,
2018.</p>
        <p>R Vinayakumar, KP Soman, and
Prabaharan Poornachandran. Evaluating deep
[VSPSK18]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>learning approaches to characterize and classify malicious urls</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <volume>1333</volume>
          {
          <fpage>1343</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>R</given-names>
            <surname>Vinayakumar</surname>
          </string-name>
          , KP Soman, Prabaharan Poornachandran, and
          <string-name>
            <given-names>S Sachin</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Evaluating deep learning approaches to characterize and classify the dgas at scale</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <volume>1265</volume>
          {
          <fpage>1276</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>