<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Tempe, Arizona,
USA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detecting Phishing E-mail using Machine learning techniques CEN-SecureNLP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sai Sundarakrishna Caterpillar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bangalore</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India sai.sundarakrishna@gmail.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nidhin A Unnithan</institution>
          ,
          <addr-line>Harikrishnan NB, Vinayakumar R</addr-line>
          ,
          <institution>Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>The number of unsolicited aka phishing emails are increasing tremendously day by day. This suggests the need to design a reliable framework to lter out phishing emails. In the proposed work, we develop a supervised classi er for distinguishing phishing email from legitimate ones. The term frequency-inverse document frequency (tf-idf) matrix and Doc2Vec are formed for legitimate and phishing emails. This is passed to various traditional machine learning classi ers for classi cation. The machine learning classi ers with Doc2Vec representation have performed well in comparison to the tf-idf representation. Thus we conclude Doc2Vec representation is more appropriate for detecting and classifying phishing and legitimate emails.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Electronic-mail (Email) is one of the most e ective
and easy source for transferring messages. It is
considered as the safest message transfer over networks
and is an inexpensive method. Even though there
are many modes of message transfer, email popularity
didn't reduced mostly in business, colleges and other
private and government sectors as email is considered
the safety transfer of message. Email communication
plays an important part in everybody's life. Nowadays
email usage gets a tremendous increase compared to
olden days. There is a tremendous increase in users
compared to 2016 in 2017. Nearly 4.8 billion persons
are using email in 2017 and calculations shows that
the number will rise to 5.6 billion users by 2021 over
other apps [RH11]. But main problem with email has
been phishing mails which causes malwares and are
used in fraud schemes, advertisements etc.
Considering previous years email phishing has increased
recently and many security threats evolves and cause
serious damages to business, individuals and economics.
Especially for business emails extracting and analyzing
these communication networks can reveal interesting
and complex patterns of processes and decision
making within a company. Detecting these fraud/phishing
Emails precisely in communication networks is
essential.</p>
      <p>Phishing mails are type of spam mail which are
hazardous to users. A phishing mail can steal our data
without our knowledge once its opened. Thus
identifying phishing mails from spam mails is very
important. One way to protect our data from phishing mail
is to add a secondary password to log in credentials.
Another way is to alarm the user once a Phishing mail
tries to steal our data.</p>
      <p>During the infant stages of email communication,]
clear rules was followed [SHP08], but recently due
to the diversity of email programs and formatting
standards we have the freedom to edit and change
quoted text. Despite with these limitations, Symantec
Brightmail Sanz [SHP08] has been showing good
performance even now for detection of phishing emails.
Moreover, it has the capability to keep track of IP
(internet protocol) addresses of that sent phishing mail.
The performance was comparable to [MW04]. Email
services like Microsoft Outlook, Mozilla Thunderbird,
or even online email communication such as Gmail,
usually group emails into conversations and attempt
to hide quoted parts in order to improve the
readability.</p>
      <p>In 2011 2.3 billion users were using emails which
have increased to about 4.3 billion by 2016 [RH11].
TREC has de ned phishing as an unwanted email sent
discriminately [C+08]. Thus emails have been used for
marketing and advertising purposes [CL98].</p>
      <p>Datasets such as the Enron [KY04] or Avocado
corpus [OWKG15] provide real world information about
business communication and contains a mix of
professional emails, personal emails, and phishing. [PS05]
published parts of his personal email archive for
research. A recent survey shows the diversity of email
classi cation tasks alone [MSR+17]. Similarly another
interesting analysis of communication networks based
on metadata like sender, recipients, and time extracted
from emails are discussed in [BCGJ11]. Models based
on the written contents of emails may get confused by
automatically inserted text blocks or quoted messages.
Thus working with real world data requires
normalization of data prior to solving the problem at hand.
Rauscher et al. [RMA15] developed an approach to
detect zones inside work-related emails where relevant
business knowledge may be found. By ending
overlapping text passages across the corpus, Jamison et
al. managed to resolve email threads of the Enron
corpus almost perfectly [JG13]. It has to be noted
that the claimed accuracy of almost 100% was only
tested on 20 email threats. In order to reassemble
email threats, Yeh et al. considered a similar approach
with a more elaborate evaluation reaching an
accuracy of 98% separating email conversations into parts
[YWD05]. To do so they rely on additional meta
information in emails sent through Microsoft Outlook
(thread index) and rules that match speci c client
headers. Thus, such an approach will not work on
arbitrary emails nor can it handle di erent localization
or edits by the user. Even though there are di
erent ways to detect phishing [DAY+15] gives an
overall evaluation of di erent classi ers used for phishing
detection. Recently deep learning methods has also
been used extensively for detecting phishing mails as
stated in [BMS08] and for detecting malicious URLs
and domains as stated in [VSP18b, VSP18a]. Domain
Generation Algorithms which can be used by malicious
families were also classi ed using deep learning
methods as said in [VSPSK18].</p>
      <p>In this task we propose a machine learning based
approach to extract the underlying structure in email
text to overcome problems of error-prone rule-based
approaches. This will enable the downstream tasks
to work with much cleaner data and additional
information by focusing on particular parts. Also further
we show the performance improvements and exibility
over the previous work on similar tasks.
Term frequency-inverse document frequency (tf-idf)
can be used in information retrieval. It will re ect
how much a word is important in a document or
corpus. Tf-idf is also used for text mining and user
modeling as a weighting factor. It will give less important
to the words which are frequently repeated in a
particular document. It is also used to remove stop words
from a corpus. Nowadays the importance of tf-idf in
search engine is very huge. Tf-idf can be calculated by
the following equations
(1)
(2)</p>
      <p>jfd 2 D : t 2 dgj
where N is the total number of documents in the
corpus.</p>
      <p>tf idf (t; d; D) = tf (t; d) idf (t; D)
2.2
Doc2Vec is an unsupervised learning algorithm which
gives a xed length vector representation of a variable
length text. The text can be a sentence, paragraph
or a document. It is an extension of Word2Vec in
which given a vector representation of context words
as the input it predicts the word which is most likely to
accompany the context words. Word2Vec is inspired
because it can be used to predict the next word in a
sentence given the context word vectors, thus
capturing the semantics of the sentence even though the word
vectors are randomly initialized. Instead of word
vector we use document vector to predict next word given
context from a document in Doc2Vec. In document
vector every document is represented by a column
of unique vector called document matrix and words
are represented by unique vectors called word matrix.
Next word in a context is predicted by the
concatenation or averaging of document and word vectors.</p>
      <p>In Doc2Vec the document vector is same for all
context generated from same document but di ers across
documents. However word vector matrix is same for
di erent document, i.e., the vector representation of
same word across di erent document have the same
vector representation.
2.3
2.3.1</p>
    </sec>
    <sec id="sec-2">
      <title>Machine Learning</title>
    </sec>
    <sec id="sec-3">
      <title>Decision Tree</title>
      <p>In modern era, the most sensible discrete method plus
a supervised algorithm personifying output in
graphical format is decision trees. It's an algorithm where
each element in the given domain is put to an element
of its range which could be either discrete or
continuous. It's better for class type variables. In this
procedure, each split is chosen in such a way that it reduces
the target variable's variance. The Decision tree input
This uses Bayes theorem. It is the most singular
feature with independence i.e. coordinates present
for any feature dependability in a class doesn't
affect other features. Naive Bayes Classi er model is
prone to outperform when the feature dimension is
high and is easy to build. Though it outperforms
most of the time when the condition of independence
is matched, its independence does not overcomes the
problems related to dimensionality. It utilizes
conditional probability model i.e. when a problem is
posed which needs to be classi ed and imitates a
vector X = (x1; x2; :::xn) which epitomizes features
yielding probabilities P (Ck=(x1; x2; :::xn)) for k outcomes.
Mathematically it can be expressed as</p>
      <p>P (Ck=x) =</p>
      <p>P (CkP (xjCk</p>
      <p>P (x)
(4)
2.3.3</p>
    </sec>
    <sec id="sec-4">
      <title>AdaBoost</title>
      <p>It is a continuous learning algorithm whose main
purpose lies in stepping up the achievement of the
learning algorithm. It is solemnly used for classi cation. It
performs this task by forming a strong classi er which
is a sequence of innumerable weak classi ers. When
Ada boost is combined with Decision tress it is
bestout-of the box classi er. Irrespective of its swiftness
in classifying it has been used as a feature learner as
well.
2.3.4</p>
    </sec>
    <sec id="sec-5">
      <title>Logistic Regression</title>
      <p>It is used when target variable is categorized. It hinges
on MLE (Maximum Likelihood Estimation) and is a
qualitative choice model. It is used to predict whether
the risk factor increases the odds of a given outcome
by a speci c factor. Logistic Regression can be used
to model binary classi cation problems. The
mathematical representation is given as</p>
      <p>F (x) =</p>
      <p>1
1 + exp( wT x)
(5)
where F can take values in the range 0 to 1.
2.3.5</p>
      <p>k-nearest neighbour (KNN)
It is the simplest algorithm of machine learning. It is
known as lazy learning because it furnishes only
approximate values. It is ubbed by local structure of
the data. This procedure validates the local posterior
probability of each class existing by the average of class
membership over its K-nearest neighbors.
2.3.6</p>
    </sec>
    <sec id="sec-6">
      <title>Support Vector Machine (SVM)</title>
      <p>Support Vector Machine (SVM) is a linear classi er
algorithm based on supervised learning. It helps to
create a boundary between the variables to classify
them. It creates a hyper plane boundary with
maximum margin to separate the variables. This algorithm
is robust to outliers. The co-ordinates of individual
observations are called as support vectors. SVM
creates a hyperplane separating support vectors with the
maximum possible margin.</p>
      <p>Support Vector Machines is one of the popularly
used method in supervised machine learning
techniques. Problems like linear regression and classi
cation tasks could be solved easily with it. Here the
training set is separated by a hyperplane where the
points nearer to the hyperplane are support vectors
which aid them in nding the position of hyper-plane.
In case training data set couldn't be linearly separated,
it is mapped to a high-dimensional space where it is
assumed to be linearly separable.
2.3.7</p>
    </sec>
    <sec id="sec-7">
      <title>Random Forest</title>
      <p>Random Forest is a supervised learning algorithm used
in both classi cation and regression problems. In the
random forest classi er, to get high accuracy results
we need to create large number of decision trees. The
prediction obtained from a Random Forest is prone
to be far better than the predictions obtained by an
individual decision tree. Random Forest utilizes the
concept of bagging for creating several minimal
correlated decision trees. Advantages of Random forest is
its ability to handle missing values and to avoid
overtting of the model.</p>
      <sec id="sec-7-1">
        <title>Experiments</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Description of data set</title>
      <p>
        The anti-phishing shared task is a part of First
Security and Privacy Analytics Anti-Phishing Sha
        <xref ref-type="bibr" rid="ref1">red
Task (IWSPA-AP 2018</xref>
        ) at 4th ACM
International Workshop on Security and Privacy Analytics
[EDMB+18][EDB+18]. Let E = [e1; e2; ::::en] be a set
of emails and C = [c1; c2; c3; :::cn] be a set of email
types such as legitimate or phishing. The task is to
classify each given email sample into either legitimate
or phishing. The detailed summary of training and
testing data set is summarized in Table 1 and Table 2.
3.2
      </p>
    </sec>
    <sec id="sec-9">
      <title>Proposed Architecture</title>
      <p>In our proposed architecture we used count based and
distributed representation for word representation. In
count based method we used tf-idf for word
representation and for distributed representation we used
Doc2Vec using gensim library. Once the word
representations were created we used di erent machine
learning techniques to classify the data as legitimate
or phishing. The machine learning techniques used
are Naive Bayse, Logistic Regression, Decision Tree,
K Nearest Neighbour, Random Forest, Adaboost and
Support Vector Machine.</p>
      <p>Our model was trained for seven di erent machine
learning techniques for two di erent representations,
i.e., with and without header data sets. All the
results have been consolidated in Table 3 and Table 4.
Out of all the di erent models the one in which SVM
combined with Doc2Vec gave the highest accuracy for
both the data sets, thus only that model was given for
submission even though we trained for seven di erent
techniques. The submitted models were tested using
test data and the result for True Positive, True
Negative, False Positive, False Negative are consolidated
into Table 5.
5</p>
      <sec id="sec-9-1">
        <title>Conclusion</title>
        <p>The main objective of this work is to develop a
supervised classi er which can detect phishing and
legitimate emails. We used count based and distributed
representations for our word representation and used
di erent machine learning techniques such as Naive
Bayse, Logistic Regression, Decision Tree, K
Nearest Neighbour, Random Forest, Adaboost and
Support Vector Machine for classi cation of legitimate and
phishing emails. The proposed methodology rely on
feature engineering and in future we can apply deep
learning on the phishing detection and can be
considered as one in the future direction.</p>
        <sec id="sec-9-1-1">
          <title>Emily Jamison and Iryna Gurevych. Headerless, quoteless, but not hopeless? using pairwise email classi cation to disentangle email threads. In Proceedings of</title>
        </sec>
      </sec>
      <sec id="sec-9-2">
        <title>Acknowledgements</title>
        <p>This research was supported in part by Paramount
Computer Systems. We are grateful to NVIDIA
India, for the GPU hardware support to the research
grant. We are grateful to Computational Engineering
and Networking (CEN) department for encouraging
the research.
[BCGJ11]
[KY04]
[MSR+17]
[MW04]
[PS05]
[RH11]
[RMA15]
[SHP08]
[VSP18a]
[OWKG15] Douglas Oard, William Webber, David
Kirsch, and Sergey Golitsynskiy.
Avocado research email collection.
Philadelphia: Linguistic Data Consortium, 2015.
the International Conference Recent
Advances in Natural Language Processing
RANLP 2013, pages 327{335, 2013.</p>
        <sec id="sec-9-2-1">
          <title>Bryan Klimt and Yiming Yang. The en</title>
          <p>ron corpus: A new dataset for email
classi cation research. In European
Conference on Machine Learning, pages 217{
226. Springer, 2004.</p>
        </sec>
        <sec id="sec-9-2-2">
          <title>Ghulam Mujtaba, Liyana Shuib,</title>
          <p>Ram Gopal Raj, Nahdia Majeed, and
Mohammed Ali Al-Garadi. Email
classi cation research trends: Review and
open issues. IEEE Access, 5:9044{9064,
2017.</p>
        </sec>
        <sec id="sec-9-2-3">
          <title>Tony A Meyer and Brendon Whateley. Spambayes: E ective open-source, bayesian based, email classi cation system. In CEAS. Citeseer, 2004.</title>
        </sec>
        <sec id="sec-9-2-4">
          <title>Adam Perer and Ben Shneiderman. Be</title>
          <p>
            yond threads: Identifying discussions
in email archives. Technical report,
MARYLAND UNIV COLLEGE PARK
HUMAN COMPUTER
            <xref ref-type="bibr" rid="ref4">INTERACTION
LAB, 2005</xref>
            .
          </p>
        </sec>
        <sec id="sec-9-2-5">
          <title>Sara Radicati and Quoc Hoang. Email statistics report, 2011-2015. Retrieved May, 25:2011, 2011.</title>
        </sec>
        <sec id="sec-9-2-6">
          <title>Francois Rauscher, Nada Matta, and Hassan Ati . Context aware knowledge zoning: Traceability and business emails.</title>
          <p>In IFIP International Workshop on
Arti cial Intelligence for Knowledge
Management, pages 66{79. Springer, 2015.</p>
        </sec>
        <sec id="sec-9-2-7">
          <title>Enrique Puertas Sanz, Jose Mar a Gomez Hidalgo, and Jose Carlos Cortizo Perez. Email spam ltering.</title>
          <p>Advances in computers, 74:45{114, 2008.
R Vinayakumar, KP Soman, and
Prabaharan Poornachandran. Detecting
malicious domain names using deep learning
approaches at scale. Journal of
Intelligent &amp; Fuzzy Systems, 34(3):1355{1367,
2018.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>R</given-names>
            <surname>Vinayakumar</surname>
          </string-name>
          , KP Soman, and
          <string-name>
            <given-names>Prabaharan</given-names>
            <surname>Poornachandran</surname>
          </string-name>
          .
          <article-title>Evaluating deep learning approaches to characterize and classify malicious urls</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <volume>1333</volume>
          {
          <fpage>1343</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>[VSPSK18] R Vinayakumar</surname>
            , KP Soman, Prabaharan Poornachandran, and
            <given-names>S Sachin</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Evaluating deep learning approaches to characterize and classify the dgas at scale</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <volume>1265</volume>
          {
          <fpage>1276</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [YWD05]
          <string-name>
            <surname>Chi-Yuan</surname>
            <given-names>Yeh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chili-Hung Wu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Shine-Hwang Doong</surname>
          </string-name>
          .
          <article-title>E ective spam classi cation based on meta-heuristics.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>In Systems, Man and Cybernetics</source>
          , 2005 IEEE International Conference on, volume
          <volume>4</volume>
          , pages
          <fpage>3872</fpage>
          {
          <fpage>3877</fpage>
          . IEEE,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>