<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Yann LeCun, Yoshua Bengio, and Ge-
o rey Hinton. Deep learning. nature</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Machine Learning Based Phishing E-mail detection Security-CEN@Amrita</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Nidhin A Unnithan</institution>
          ,
          <addr-line>Harikrishnan NB, Akarsh S, Vinayakumar R</addr-line>
          ,
          <institution>Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>521</volume>
      <issue>7553</issue>
      <abstract>
        <p>Phishing email detection is a signi cant threat in today's world. The rate at which phishing are generated are tremendously increasing day by day. It is high time to deploy a self-learning system that gives a time bound detection and prevention of phishing email e ciently. This work proposes a system which uses term document matrix as feature engineering mechanism and classical machine learning techniques for detecting phishing email from legitimate and phishing ones. The system also incorporates the domain knowledge and lexical features as part of feature engineering mechanism. The e ciency of the system is compared using di erent classical machine learning techniques. Based on the accuracy, we propose the best model that solves the formulated problem e ciently.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Email plays an important part of everybody's life. It
is one of the easiest and e ective source for
transferring messages and les. Even though there are many
modes of communication, the popularity of e-mail did
not diminish as it is considered as one of the safest
and fastest message transfer over networks and is an
inexpensive method of communication.</p>
      <p>Nowadays e-mail usage gets a tremendous increase
compared to previous decades. In 2017 there were
Copyright c by the paper's authors. Copying permitted for
private and academic purposes.
nearly 4.8 billion persons using email and it is
estimated that by 2021 there will be an increase to 5.6
billion users as email is considered to be main medium
of transfer for messages over other apps. But main
problem in email is the presence of phishing mails.
These phishing mails are unwanted mails which may
carry malwares, fraud schemes, advertisements etc.
In comparison to previous years, phishing mails have
increased and have caused serious damages to
business, corporates, individuals and economics.
Detecting the fraud/phishing emails precisely is essential,
extracting and analyzing these mails can reveal us
complex and interesting patterns and we can make
appropriate decisions within a company to block phishing
mails. During the early stages of communication via
email clear rules were followed. But nowadays due
to diversity present in email services, like Microsoft
Outlook, Mozilla Thunderbird, Google's Gmail, mails
are grouped into conversations and attempts to hide
quoted parts in order to improve the readability.</p>
      <p>One type of spam mail which is hazardous to users
is phishing mails. A phishing mail is the one which
covers itself as a legitimate mail but once opened can
steal our data without our knowledge. Thus
identifying phishing mails from spam mails is very important.
One way to protect our data from phishing mail is to
add a secondary password to log in credentials.
Another way is to alarm the user once a phishing mail
tries to steal our data.</p>
      <p>In [SAZ+15] Sami S et.al proposed a model for
detecting phishing emails that rely on a preprocessing
technique which extracts di erent part of email as
feature. And this extracted feature is fed into a j48
classi cation algorithm to perform classi cation. In
[SZL+15], they considered meaningless tokens and new
pages as the feature set. Authors in [SZL+15], selected
some features that have better predictability from
initial feature set. They provide the O(1) complexity as
an evaluation method to each feature set to evaluate
its predictive ability. In the paper [KK15], sukhjeel
kaui et.al used Genetic algorithm for the detection
of phishing webpage and for categorizing pages they
preferred a lter function. Lu fang et.al in [FBJ+15]
proposes some solution to overcome the time lag in
detecting phishing websites. Here they provide a
solution to detect phishing websites by analyzing the
peculiarity in its WHOIS and URL information. In
[VSP18b, VSP18a] deep learning methods were
employed to detect malicious URL0s and domains.
Binay kumar et.al has used html contents for detecting
email phishing in [KKMK15]. But Rachna Dhamija
et.al in [TC09] mainly concentrated in this topic to
know which phishing activity works during the attack
and why. For that they used a large given set of data
which contains reported phishing activities. Fergus
toolan et.al made a di erent approach. They used
only ve features for classi cation. For classi cation
they used a C5.0 algorithm which have more precision
compare to other algorithms. Mayank pandey et.al in
[PR12] used di erent types of classi cation methods
such as Multilayer Perceptron (MLP), Decision Trees
(DT), Support Vector Machine (SVM), Group Method
of Data Handling (GMDH), Probabilistic Neural Net
(PNN), Genetic Programming (GP) and Logistic
Regression (LR). Lew may form et.al in [FCT+15]
proposed a method which uses hybrid features for
detecting phishing emails. It is called Hybrid features
because it is a combination of URL based, behavior
based and contend based features. Here they acquired
an overall accuracy of 97.25 % with an error
percentage of 2.75 %.</p>
      <p>Even though there are di erent ways to detect
phishing, [DAY+15] gives an overall evaluation of
different classi ers used for phishing detection.
Recently count based representation combined with
domain level features integrated with machine learning
techniques are used for classifying phishing mails and
legitimate mails [EDB+18, BMS08]. The proposed
methodology uses feature engineering approach
combined with deep learning, which is one the signi cant
direction in which world is moving to because it has
performed well in most of the text classi cation tasks
[LBH15] and even in phishing detection [LNRW, EC].</p>
      <p>The rest of the sections are organized as follows.
Section 2 discusses the background details of email
representation and the machine learning algorithms.
Section 3 includes the description of data set,
experiments and proposed architecture. Section 4 includes
results. Conclusion is placed in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>This section discusses the mathematical details of
various traditional machine learning algorithms and
details of vector space modeling techniques such as
TFIDF and Bag of words.
2.1
This is a classi cation algorithm which is used to
separate the data into di erent classes. This can be
normal, ordinary and multinominal. In binary Logistic
Regression the outcome or the classi cation can be
done into 0 and 1 whereas in multinominal the
outcome or classi cation will be in multiple ways. The
activation function used for performing this is sigmoid
function. The mathematical representation of sigmoid
activation function is as follows:
(x) =</p>
      <p>1
1 + exp( wT x)
(1)
2.2</p>
      <sec id="sec-2-1">
        <title>Naive Bayes</title>
        <p>Naive Bayes is a set of supervised learning
algorithm which works on the principle of Bayes
theorem. This theorem works on conditional probability by
which probability of the events is calculated. Binary
and multiple classi cation are done by using di erent
types of algorithms like GaussianNB, MultinomialNB,
BernoulliNB [MN+98]. Here for this problem we used
MultinominalNB from scikit-learn as our algorithm.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Support Vector Machine</title>
        <p>SVM is a supervised classi cation algorithm which
builds the model by classifying the data into two
classes. Based on the number of classes we will be
de ning the SVM. It is of two types linear SVM
and non-linear SVM. The decision boundary for linear
SVM is formulated as a hyperplane in feature space,
i.e. a linear function of the features. Non-linear SVMs
result in non-linear decision boundaries in the original
feature space. From di erent types of kernals
available we used radial basis function (RBF) for our SVM
model.
2.4</p>
        <p>TF-IDF
TF-IDF stands for term frequency-inverse document
frequency and its weight can be considered as a
statistical measure which evaluates how important a word
is to a document which can in turn be used for
information retrieval and text mining. Term Frequency
gives us an idea about how frequently a term occurs
in a document. This can be mathematically de ned as
equation given below
ft;d
P ft0 ;d
t0 2d
tf (t; d) =
(2)</p>
        <p>Inverse Document Frequency gives us an idea about
how important a term is. When we compute term
frequency all the terms are given equal importance
whether it is a stop word or a terminology word. Thus
we need to weigh up terminology word which is less
frequent than the stop word in a document by
computing inverse document frequency given by mathematical
equation</p>
        <p>N
idf (t; D) = log</p>
        <p>jfd 2 D : t 2 dgj
where N is the total number of documents in the
corpus.</p>
        <p>Now TF-IDF can be calculated as</p>
        <p>tf idf (t; d; D) = tf (t; d) idf (t; D)</p>
        <p>Additionally the domain level features are added.
This includes a list of most commonly appeared words
and a list of special characters.
(3)
(4)
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Dataset details</title>
        <p>The email phishing detection is a task in shared
task on anti-phishing shared task at 4th ACM
International Workshop on Security and Privacy
Analytics [EDMB+18]. Let E = [e1,e2,...,en] and C =
[c1,c2,...,cn] be sets of email types such as legitimate
or phishing, the task was to classify each given email
samples into either legitimate or phishing. Two sets of
data sets were used one with header and one without
header. Data set statistics are integrated together in
Table 1 for training and Table 2 for testing.
We used count based representation to create our
model. A diagrammatic representation of our
architecture is shown in Figure 1. The email samples from
data set is rst passed through count based
representation, here TF-IDF, for word representation. It is
then combined with domain level features to get our
input word representation for machine learning
algorithms. The domain level features include most
commonly appeared words (40 features), for example
password, fraudulent, business, and special characters like
$ , #, !, (, [, &amp;, etc. and all the stop words were
removed. These are then passed through Logistic
Regression, Naive Bayes and Support Vector Machine to
do the classi cation of phishing and legitimate mails.
Our model build using above architecture was trained
for data sets with headers and without headers for
classi cation of phishing and legitimate mails. We trained
a total of six models, one each for Logistic Regression,
Naive Bayes, Support Vector Machine for mails with
header and without header. We used 10 fold cross
validation for our training data and the results obtained
by our model has been consolidated in Table 3. For
data set without headers SVM gave the highest
accuracy with 94.3% and for data set with headers SVM
gave the highest accuracy with 93.3%. We didn't
extract any features from header data set but
extracting features from headers may increase the accuracy.
Our model was tested using test data by IWSPA-AP
Shared Task committee and the corresponding results
for True Positive, True Negative, False Positive, False
Negative, Accuracy, Precision, Recall, F1 score for our
six models are summarized in Table 4.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper evaluated the performance of machine
learning based classi er for distinguishing phishing
emails from legitimate ones. We created a model
using count based representation combined with domain
level features as word representation and passed to
various machine learning techniques such as Logistic
Regression, Naive Bayes and Support Vector Machine to
classify whether it is phishing or legitimate. Both the
sub tasks belong to unconstrained category, i.e., any
data sets can be used during training and data sets for
both the tasks where highly imbalanced. Even then
we have not used any other external data set sources
and still were able to achieve good detection rate for
phishing email in both sub tasks. By adding some
additional data sources we can considerable increase
the detection rate of phishing emails for the proposed
methodology.
5.0.1</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>This research was supported in part by Paramount
Computer Systems. We are grateful to NVIDIA
India, for the GPU hardware support to the research
grant. We are grateful to Computational Engineering
and Networking (CEN) department for encouraging
the research.</p>
        <sec id="sec-4-1-1">
          <title>Ram Basnet, Srinivas Mukkamala, and Andrew H Sung. Detection of phishing attacks: A machine learning approach.</title>
          <p>In Soft Computing Applications in
Industry, pages 373{383. Springer, 2008.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Ammar Yahya Daeef, R Badlishah Ah</title>
          <p>mad, Yasmin Yacob, Naimah Yaakob,
and Mohd Nazri Bin Mohd Warip.
Phishing email classi ers evaluation: Email
body and header approach. Journal
of Theoretical and Applied Information
Technology, 80(2):354, 2015.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Louis Eugene and Isaac Caswell. Making a manageable email experience with deep learning.</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Ayman Elaassal, Avisha Das, Shahryar</title>
          <p>Baki, Luis De Moraes, and Rakesh
Verma. Iwspa-ap: Anti-phising shared
task at acm international workshop on
security and privacy analytics. In
Proceedings of the 1st IWSPA Anti-Phishing
Shared Task. CEUR, 2018.</p>
        </sec>
        <sec id="sec-4-1-5">
          <title>Lv Fang, Wang Bailing, Huang Junheng, Sun Yushan, and Wei Yuliang. A proactive discovery and ltering solution on phishing websites. In Big Data (Big</title>
          <p>Data), 2015 IEEE International
Conference on, pages 2348{2355. IEEE, 2015.</p>
        </sec>
        <sec id="sec-4-1-6">
          <title>Lew May Form, Kang Leng Chiew,</title>
          <p>Wei King Tiong, et al. Phishing email
detection technique by using hybrid
features. In IT in Asia (CITA), 2015 9th
International Conference on, pages 1{5.</p>
          <p>IEEE, 2015.
[BMS08]
[DAY+15]
[EC]
[EDB+18]
[FBJ+15]
[FCT+15]
[EDMB+18] Ayman Elaassal, Luis De Moraes,
Shahryar Baki, Rakesh Verma, and
Avisha Das. Iwspa-ap shared task email
dataset, 2018.</p>
        </sec>
        <sec id="sec-4-1-7">
          <title>Sukhjeel Kaui and Amrit Kaur. Detection of phishing webpages using weights computed through genetic algorithm. In</title>
          <p>MOOCs, Innovation and Technology in
Education (MITE), 2015 IEEE 3rd
International Conference on, pages 331{
336. IEEE, 2015.
[VSP18b]
[LNRW]
[MN+98]
[PR12]
[SAZ+15]
[SZL+15]
[TC09]</p>
        </sec>
        <sec id="sec-4-1-8">
          <title>Christopher Lennan, Bastian Naber, Jan</title>
          <p>Reher, and Leon Weber. End-to-end
spam classi cation with neural networks.</p>
        </sec>
        <sec id="sec-4-1-9">
          <title>Mayank Pandey and Vadlamani Ravi.</title>
          <p>Detecting phishing e-mails using text and
data mining. In Computational
Intelligence &amp; Computing Research (ICCIC),
2012 IEEE International Conference on,
pages 1{6. IEEE, 2012.</p>
        </sec>
        <sec id="sec-4-1-10">
          <title>R Vinayakumar, KP Soman, and Praba</title>
          <p>haran Poornachandran. Detecting
malicious domain names using deep learning
approaches at scale. Journal of
Intelligent &amp; Fuzzy Systems, 34(3):1355{1367,
2018.</p>
        </sec>
        <sec id="sec-4-1-11">
          <title>R Vinayakumar, KP Soman, and Praba</title>
          <p>haran Poornachandran. Evaluating deep
learning approaches to characterize and
classify malicious urls. Journal of
Intelligent &amp; Fuzzy Systems, 34(3):1333{1343,
2018.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [KKMK15]
          <string-name>
            <given-names>Binay</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Pankaj Kumar, Ankit Mundra, and
          <string-name>
            <given-names>Shikha</given-names>
            <surname>Kabra</surname>
          </string-name>
          .
          <article-title>Dc scanner: Detecting phishing attack</article-title>
          .
          <source>In Image Information Processing (ICIIP)</source>
          ,
          <year>2015</year>
          Third International Conference on, pages
          <volume>271</volume>
          {
          <fpage>276</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[LBH15]</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>