<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributed Representation using Target Classes: Bag of Tricks for Security and Privacy Analytics Amrita-NLP@IWSPA-2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barathi Ganesh HB</string-name>
          <email>barathiganesh.hb@arnekt.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinayakumar R</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M</string-name>
          <email>m_anandkumar@cb.amrita.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soman KP</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arnekt Solutions Pvt. Ltd.</institution>
          ,
          <addr-line>Pentagon P-3, Magarpatta City Pune, Maharashtra</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Computational Engineering and Networking(CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <fpage>45</fpage>
      <lpage>53</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The extensive growth of internet users
provides the opportunities for anomalies to
intrude our privacy and security. Phishing is
one among them that has turned out to be a
major issue in the recent times, that directly
hit specific targeted group of people asking
for their credentials, personal and other
sensitive information. This paper elaborates the
module submitted to IWSPA-AP Shared Task
at IWSPA 2018 that focuses on
distinguishing the phishing and legitimate emails. In
fundamental it is a text classification
problem in which representation serves as the core
component and also has a direct relationship
to the final performance. This work assess
and reports the performance of distributed
representation in detection of phishing emails
as a text classification problem. The word
embedding and neural bag-of-ngrams
facilitates to extract syntactic and semantic
similarity of emails. The experimented module
obtains promising and consistence performance
Copyright c by the paper’s authors. Copying permitted for
private and academic purposes.</p>
      <p>In: R. Verma, A. Das (eds.): Proceedings of the 1st
AntiPhishing Shared Pilot at 4th ACM International Workshop on
Security and Privacy Analytics (IWSPA 2018), Tempe, Arizona,
USA, 21-03-2018, published at http://ceur-ws.org
on both the train and test corpus in-terms of
time and accuracy. The model obtains 99%
and 97% as the f1 score on the unseen test
corpus.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Phishing is one of the social engineering method which
is used to fetch the personal and sensitive information
of the internet users by installing malware on their
computers thereby exploiting the weaknesses in
current web security. At the year end of 2017, the
antiphishing system prevented nearly 60 million attempts
to phishing pages1, which shows the potential use of
an anti-phishing system. Every year, the phishing
attack on unique users of internet is increasing
worldwide. This ensures the need of effective anti phishing
methods and induces the research community to
integrate the Artificial Intelligence (AI) methods with
Cyber security modules [ZZJ+17]. The International
Workshop on Security and Privacy Analytics - Anti
Phishing (IWSPA-AP2) has been hosting a shared task
to build a classifier that will be able to distinguishing
the phishing and legitimate emails.</p>
      <p>Through spam emails people deliver all kinds of
malicious attacks. It can be delivered using several ways,
by attaching files with malicious content or by
sending a link of a compromised website. The frequently
used type of malware attack through spam emails are
1securelist.com/spam-and-phishing-in-q3-2017/82901/
2dasavisha.github.io/IWSPA-sharedtask/#oc
blended attacks. It uses more than one method to
deliver malware on an internal network. Blended
attacks often starts from illegitimate emails, which may
not contain malware but provide links to compromised
websites. Usually attackers send emails in such a way
that it looks legitimate to a normal user by mixing
authentic links and false links that will contain URLs
to some fake website. As per the survey produced by
IBM’s X-Force research team, more than half of the
emails produced worldwide are scam. The percentage
of spam email amounted to 55.9% in the first
quarter of 2017, which shows there are chances of greater
possibility for gradual increase of spam emails in the
coming years.</p>
      <p>The task can be formulated as a text classification
problem in which emails are the documents and
target classes are phishing and legitimate ones. Any text
classification application will contains representation
(representing text as a numeric values), feature
extraction (getting informative words with respect to the
target classes) and a classifier (transforms features to
target classes) as its base components [AZ12]. Among
them representation is more complex and core part of
the module, that represents the context of the text
in numbers. Representation defines the effectiveness
of the classification models to make final predictions.
Hence this experiment focuses much on the text
representation.</p>
      <p>The content from the phishing sites are highly
semantically similar to contents in original sites. Thus
this becomes mandatory to represent the context of the
text, than representing text as symbols. The classical
representation methods Vector Space Models (VSM)
failed to do so [BGAKS17]. The Vector Space Models
of Semantics (VSMs) or Distributional Representation
methods are able to include context only to some
extent [GKS16, BGAKS16a]. Unlike image and speech,
texts are represented using numerical values by
taking terms (words or phrases) as a symbols in classical
methods. Both these models requires high
computation because of well known problem called "Curse of
Dimensionality". Due to this, these methods cannot
be run on the huge corpus which is necessary for the
effective representation. Finally distributed
representation methods are introduced that reflects the context
of the text as a low dimensional dense vector and
provides the flexibility in choosing the dimension of the
vector. By considering these factors, this
experimentation is performed using distributed representation
methods.</p>
      <p>One of the well known distributed representation
method is word2vec (word to vector) and the latterly
introduced methods like doc2vec (document to
vector), Glove (global vectors) and fastText which are the
flavours of word2vec with some notable changes to
enhance the representation. Given a word to word2vec,
it will produce the vector in desired dimension that
reflects the context of word [GL14]. When it comes
to representing a text with multiple words, either
average of those word vectors or the matrix out of
concatenating those word vectors will be decomposed to
form a single vector [BGAKS16b]. The learning of
word2vec is improved by combining word2vec with the
co-occurrence matrix by forming the so called Glove.
Glove provides the flexibility to train small corpus with
promising performance [PSM14]. Both these methods
represent poor sequence of words since averaging of
word vector does not consider the order of the word.
The doc2vec method introduces a way to represent
the sequence of words to a vector [LM14]. The
architecture is similar to the word2vec, provided one
more weight matrix will also be learned along with
the weight matrix of word2vec for representing the
sequence of words. At-last fastText3 has been
introduced, where it learns vector for a given word from
the class it belongs to, rather than the earlier methods
where it learns by predicting next word of the given
word [JGBM16]. Since the number of classes is always
less than number of words, this method is faster than
others.</p>
      <p>By observing above, this work utilizes fastText for
representing texts as vector and softmax for making
the final predictions. The given email documents are
normalized through a preprocessor to remove
uninformative features, then fastText with hyper-parameter
tuning used for the document representation and
classification. The remaining part of the paper details
the related work performed in detection of phishing in
Section 2, problem formulation and working principles
of fastText are given in Section 3 and the Section 4
details the experiment conducted and discusses about
the obtained results.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Works</title>
      <p>Among the traditional methods that we have been
following since ages in text classification, artificial
intelligence (AI) is another technique which became popular
in last few decades. AI uses supervised learning
classification algorithms to do binary classification of spam
emails.</p>
      <p>Presently there are not many methods designed for
effective detection of phishing emails (focused on
finding phishing URLs) and most of the methods showing
good performance on detecting spam mails. Mostly
the researchers try to perform the manual feature
learning (number of words, number of domains, URLs,
number of links, number of dots, message hashes)
by analysing content of the email and then applies
3fasttext.cc</p>
      <p>Email headers play a key role in identifying spam
emails. It determines the recipient of a message and
also tracks the route of the mail as it passes the mail
servers. Email headers provide extremely useful
features that could be used for machine learning models
to efficiently classify spam emails [LT04, S+09, WC07].</p>
      <p>Recently authors started developing anti-phishing
models using deep learning algorithms like Deep Belief
Network (DBN), Recurrent Neural Network (RNN),
Convolutional Neural Network (CNN) and etc,. [ZL17,
BBV+17, SAZ18, RJ17, LNRW]. The manual
feature engineering has got eliminated and it has been
taken care by the intermediate hidden layers of
neural networks. We can conclude from the above that,
though the technology advances in text classification
anti-phishing problem is not addressed properly and
requires more research. The semantic representation
of text with less computation is suitable for real world
data and hence here this work assess the performance
of distributed representation of text with respect to
target classes in anti-phishing task.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Corpus Statistics</title>
      <p>Only a few set of benchmark corpus is available for
detecting phishing content from the emails. The corpus
for this experiment has been provided by the
IWSPAAP shared task organizers [EDMB+18]. There are two
set of corpus provided one with header and another
is without headers. The detailed statistics about the
training corpus has given in the Table 1 and testing
corpus in 2.
The objective of this experiment is to map the
documents di to one of the target class it belongs. n refers
to the total number of documents. The first step is to
find the distributed representation of di that reflects
the context of di. This is given as,
(Ri)1 m = distributed representation (di)</p>
      <p>(R)n m = distributed representation (D)
On successful representation each Ri will be maps to
the target class it belongs. In this paper we have
experimented distributed representation of text with respect
to it target class.</p>
      <p>Word2vec is the first distributed representation
method developed to represent the context of the given
word as a vector [GL14]. In word2vec the
representation is learned by feeding a wordi to the
architecture, which in turn has to predict the wordc i.e.
cooccurring words of the wordi. For an example "boy
chases the cat". Given the word "boy" the
architecture predicts its co-occurring words "chases", "the"
and "cat". During the learning phases word2vec learns
W1 inner matrix and W2 outer matrix that transforms
wordi to wordc and vice verse. W1 gives the
representation for words in the vocabulary. This is given
as,
h = W1T X = VWT1</p>
      <p>uw2 = W2T h
p (wordc j wordi) =</p>
      <p>exp (uW2 )
PV exp (uW2 )
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Where, xi is the normalized bag of features of the
ith document, yi the target label, W1 and W2 are
the weight matrices. Since the target classes are in
finite count at output layer, the computation required
by the softmax also become lower. These words will
be represented as a vector with respect to the target
class it belongs to. There are a number of
parameters available which needs to be tuned with respect to
the data and the classification problem. A higher level
common architecture for distributed representation is
given Figure 1 [JGBM16]. This same architecture is
used for both the sub tasks.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experiment and Observations</title>
      <p>The task is to classify each given email sample into
either legitimate and phishing [EDB+18]. The
subtask 1 contains email samples without header and
subtask 2 contains email samples with header. The given
data-set is unbalanced in the ratio of nearly 1:8
legitimate:phishing. The word count statistics of the
corpus is detailed in Table 3.</p>
      <p>This experiment has been performed on a system
with the configuration : 16 GB RAM and i7 processor.
The model has been built using Python 24, fastText
library package5 and code made publicly available6.</p>
      <p>By considering the computation required for 10-fold
10-cross validation, in this work the given corpus has
4www.python.org
5pypi.python.org/pypi/fasttext
6github.com/BarathiGanesh-HB/IWSPA-AP
been shuffled to avoid the localization of model to
particular subset and split into 80% for training and 20%
for validation. Before splitting the corpus, the
corpus is preprocessed to remove the punctuation,
special symbols and empty spaces. The hyper-parameters
are tuned (Dimension: 100 to 1000, Minimum Word
Count: 1 to 5, Epochs: 3 to 10, N-Grams: 2, Loss
Function: Softmax and Learning rate: 0.001, 0.01, 0.1)
to obtain the maximum f1 score for validation data.
The F1 score has been considered to make sure that the
system performs well over all the classes, which would
inherently mean a better sensitivity to the prediction
of phishing mails (lower in quantity), from the
legitimate mails (higher in quantity). The vector we get
from distributed representation will capture semantic
properties which will be very helpful in improving the
performance of the natural language processing (NLP)
system to get better results than traditional
bag-ofwords representations. Neural bag-of-ngrams vectors
resulting from fastText is a dense, real-valued vector
representation and also captures the semantics of the
context. It is the combination of bag-of-ngram and
neural word embedding which is robust, simple and
flexible.</p>
      <p>We have submitted two models, where the first
model is developed by considering data with header
and data without headers independently while the
second model built by combining both data to make a
single model. The results obtained during the training
phase is given in 4. It can be observed that combined
data model performs 1% lesser than the independent
models. The final model has been built using
hyperparameters listed in 5.</p>
      <p>The model performance on the test corpus has been
measured by the task organizers. The performance
reported by the task organizers are shown in detail in
Table 6. These reports given by organizers included
Precision, Recall and F1 measures as shown in Table
7. From the Table 4 and 7 we can conclude that the
model has performed well on the test corpus as on the
train corpus.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>An anti-phishing system has been built successfully
using distributed representation method. This attains
good performance during the training phase. The
combined data model performs 1% lesser than the
independent models built, with and without header files which
attains near 99% as the f1 score in training period. On
test corpus both the models gave similar performance.
The semantic representation of text with less
computation and reliable performance is suitable for real world
data. Hence this experimented model is suitable for
real world applications. The performance of the
system can be enhanced with more complex deep learning
architecture at the classification stage. In future this
architecture could be made more effective by training
using Graphics Processing Unit(GPU).
[Alt17]
[AN15]
[AZ12]
[AZZ+15]
[BBV+17]</p>
      <sec id="sec-6-1">
        <title>J Adamkani and K Nirmala. A con</title>
        <p>tent filtering scheme in social sites.
Indian Journal of Science and Technology,
8(33):1, 2015.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Charu C Aggarwal and ChengXiang Zhai. A survey of text classification algorithms. In Mining text data, pages 163–222. Springer, 2012.</title>
      </sec>
      <sec id="sec-6-3">
        <title>Ahmed Abbasi, Fatemeh Mariam Za</title>
        <p>hedi, Daniel Zeng, Yan Chen, Hsinchun
Chen, and Jay F Nunamaker Jr.
Enhancing predictive analytics for
antiphishing by exploiting website genre
information. Journal of
Management Information Systems, 31(4):109–
157, 2015.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Alejandro Correa Bahnsen, Ed</title>
        <p>uardo Contreras Bohorquez, Sergio
Villegas, Javier Vargas, and Fabio A
González. Classifying phishing urls
using recurrent neural networks. In
Electronic Crime Research (eCrime),
2017 APWG Symposium on, pages 1–8.</p>
        <p>IEEE, 2017.
[BGAKS16a] HB Barathi Ganesh, M Anand Kumar,
and KP Soman. Distributional
semantic representation for text classification
and information retrieval. CEUR, 1737,
2016.
[BGAKS16b] HB Barathi Ganesh, M Anand Kumar,
and KP Soman. Semantic relation from
word embeddings in higher dimension.</p>
        <p>Proceedings of SemEval, pages 1290–
1295, 2016.
[BGAKS17] HB Barathi Ganesh, M Anand Kumar,
and KP Soman. Vector space model
as cognitive space for text classification.</p>
        <p>arXiv preprint arXiv:1708.06068, 2017.
[EDB+18]</p>
      </sec>
      <sec id="sec-6-5">
        <title>Ayman Elaassal, Avisha Das, Shahryar</title>
        <p>Baki, Luis De Moraes, and Rakesh
Verma. Iwspa-ap: Anti-phising shared
task at acm international workshop on
security and privacy analytics. In
Proceedings of the 1st IWSPA
AntiPhishing Shared Task. CEUR, 2018.</p>
      </sec>
      <sec id="sec-6-6">
        <title>Altyeb Altaher. Phishing websites clas</title>
        <p>sification using hybrid svm and knn
approach. Int J Adv Comput Sc, 421:8,
2017.
[EDMB+18] Ayman Elaassal, Luis De Moraes,
Shahryar Baki, Rakesh Verma, and
Avisha Das. Iwspa-ap shared task email
dataset, 2018.
[JGBM16]</p>
      </sec>
      <sec id="sec-6-7">
        <title>HB Barathi Ganesh, M Anand Ku</title>
        <p>mar, and KP Soman. From vector
space models to vector space models
of semantics. In Forum for
Information Retrieval Evaluation, pages 50–60.
Springer, Cham, 2016.</p>
      </sec>
      <sec id="sec-6-8">
        <title>Yoav Goldberg and Omer Levy.</title>
        <p>word2vec explained: Deriving mikolov
et al.’s negative-sampling
wordembedding method. arXiv preprint
arXiv:1402.3722, 2014.</p>
      </sec>
      <sec id="sec-6-9">
        <title>Armand Joulin, Edouard Grave, Piotr</title>
        <p>Bojanowski, and Tomas Mikolov. Bag
of tricks for efficient text classification.
arXiv preprint arXiv:1607.01759, 2016.</p>
      </sec>
      <sec id="sec-6-10">
        <title>Quoc Le and Tomas Mikolov. Dis</title>
        <p>tributed representations of sentences
and documents. In International
Conference on Machine Learning, pages
1188–1196, 2014.</p>
      </sec>
      <sec id="sec-6-11">
        <title>Christopher Lennan, Bastian Naber, Jan Reher, and Leon Weber. End-toend spam classification with neural networks.</title>
      </sec>
      <sec id="sec-6-12">
        <title>Chih-Chin Lai and Ming-Chi Tsai. An</title>
        <p>empirical performance comparison of
machine learning methods for spam
email categorization. In Hybrid
Intelligent Systems, 2004. HIS’04. Fourth
International Conference on, pages 44–48.
IEEE, 2004.</p>
      </sec>
      <sec id="sec-6-13">
        <title>Tony A Meyer and Brendon Whateley. Spambayes: Effective open-source, bayesian based, email classification system. In CEAS. Citeseer, 2004.</title>
      </sec>
      <sec id="sec-6-14">
        <title>Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.</title>
        <p>In Proceedings of the 2014 conference on
empirical methods in natural language
processing (EMNLP), pages 1532–1543,
2014.
[S+09]
[SMAC17]
[WC07]
[ZZJ+17]</p>
      </sec>
      <sec id="sec-6-15">
        <title>Yafeng Ren and Donghong Ji. Neural networks for deceptive opinion spam detection: An empirical study. Information Sciences, 385:213–224, 2017.</title>
      </sec>
      <sec id="sec-6-16">
        <title>Jyh-Jian Sheu et al. An efficient twophase spam filtering method based on e-mails categorization. IJ Network Security, 9(1):34–43, 2009.</title>
      </sec>
      <sec id="sec-6-17">
        <title>Sami Smadi, Nauman Aslam, and</title>
        <p>Li Zhang. Detection of online
phishing email using dynamic evolving neural
network based on reinforcement
learning. Decision Support Systems, 2018.</p>
      </sec>
      <sec id="sec-6-18">
        <title>Abdulhamit Subasi, Esraa Molah, Fatin</title>
        <p>Almkallawi, and Touseef J Chaudhery.
Intelligent phishing website detection
using random forest classifier. In
Electrical and Computing Technologies and
Applications (ICECTA), 2017
International Conference on, pages 1–5. IEEE,
2017.</p>
      </sec>
      <sec id="sec-6-19">
        <title>Fadi Thabtah and Firuz Kamalov.</title>
        <p>Phishing detection: a case
analysis on classifiers with rules using
machine learning. Journal of
Information &amp; Knowledge Management,
16(04):1750034, 2017.</p>
      </sec>
      <sec id="sec-6-20">
        <title>Chih-Chien Wang and Sheng-Yi Chen.</title>
        <p>Using header session messages to
antispamming. Computers &amp; Security,
26(5):381–390, 2007.</p>
      </sec>
      <sec id="sec-6-21">
        <title>Xi Zhang, Yu Zeng, Xiao-Bo Jin, Zhi-Wei Yan, and Guang-Gang Geng. Boosting the phishing detection performance by semantic analysis. In Big</title>
        <p>Data (Big Data), 2017 IEEE
International Conference on, pages 1063–1070.
IEEE, 2017.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[SAZ18] [TK17] [ZL17]</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>