<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. Pedregosa, G. Varoquaux, A. Gram-
fort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Per-
rot, and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal
of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Deep Learning Model with Hierarchical LSTMs and Supervised Attention for Anti-Phishing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Minh Nguyen</string-name>
          <email>minh.nv142950@sis.hust.edu.vn</email>
          <email>minh.nv142950@sis.hust.edu.vn Thien Huu Nguyen University of Oregon Eugene, Oregon, USA thien@cs.uoregon.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toan Nguyen</string-name>
          <email>toan.v.nguyen@nyu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hanoi University of Science and Technology</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New York University</institution>
          ,
          <addr-line>Brooklyn, New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>12</volume>
      <issue>2825</issue>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Anti-phishing aims to detect phishing
content/documents in a pool of textual data.
This is an important problem in
cybersecurity that can help to guard users from
fraudulent information. Natural language
processing (NLP) o ers a natural solution for this
problem as it is capable of analyzing the
textual content to perform intelligent recognition.
In this work, we investigate the
state-of-theart techniques for text categorization in NLP
to address the problem of anti-phishing for
emails (i.e, predicting if an email is phishing
or not). These techniques are based on deep
learning models that have attracted much
attention from the community recently. In
particular, we present a framework with
hierarchical long short-term memory networks
(HLSTMs) and attention mechanisms to model
the emails simultaneously at the word and
the sentence level. Our expectation is to
produce an e ective model for anti-phishing and
demonstrate the e ectiveness of deep learning
for problems in cybersecurity.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Despite being one of the oldest tactics, email
phishing remains the most common attack used by
cybercriminals [phi17a] due to its e ectiveness. Phishing
attacks exploit users' inability to distinguish between
legitimate information from fake ones sent to them
[DTH06, SNM15, SNM17, SNG+17]. In an email
phishing campaign, attackers send emails appearing
to be from well-known enterprises or organizations
directly to their victims or by spoofed emails [Sin05].
These emails try to lure victims to divulge their private
information [JJJM07, SNM15, SNG+17] or to visit an
impersonated site (i.e., a fake banking website), on
which they will be asked for passwords, credit card
numbers or other sensitive information. The recent
hack of a high pro le US politician (usually referred
as \John Podesta's hack") is a famous example of this
type of attack. It was all started by a spoofed email
sent to the victim asking him to reset his Gmail
password by clicking on a link in the email [pod16]. The
technique of email phishing may seem simple, yet the
damage it makes is huge. In the US alone, the
estimated cost of phishing emails to business is half a
billion dollars per year [phi17b].</p>
      <p>Numerous methods have been proposed to
automatically detect phishing emails [BCP+08, FST07,
ANNWN07, GTJA17]. Chandrasekaran et. al
proposed to use structural properties of emails and
Support Vector Machines (SVM) to classify phishing
emails [CNU06]. In [ANNWN07], Abu-Nimeh et. al
evaluated six machine learning classi ers on a
public phishing email dataset using proposed 43 features.
Gupta et. al [GTJA17] presented a nice survey on
recent state-of-the-art research on phishing detection.
However, these methods mainly rely on feature
engineering e orts to generate characteristics (features) to
represent emails, over which machine learning
methods can be applied to perform the task. Such feature
engineering is often done manually and still requires
much labor and domain expertise. This has hindered
the portability of the systems to new domains and
limited the performance of the current systems.</p>
      <p>In order to overcome this problem, our work focuses
on deep learning techniques to solve the problem of
phishing email detection. The major bene t of deep
learning is its ability to automatically induce e ective
and task-speci c representations from data that can
be used as features to recognize phishing emails. As
deep learning has been shown to achieve
state-of-theart performance for many natural language processing
tasks, including text categorization [GBB11, LXLZ15],
information extraction [NG15b, NG15a, NG16],
machine translation [BCB14], among others, we expect
that it would also help to build e ective systems for
phishing email detection.</p>
      <p>We present a new deep learning model to solve the
problem of email phishing prediction using
hierarchical long short-term memory networks (H-LSTMs)
augmented with supervised attention technique. In the
hierarchical LSTM model [YYD+16], emails are
considered as hierarchical architectures with words in the
lower level (the word level) and sentences in the upper
level (the sentence level). LSTM models are rst
implemented in the word level whose results are passed to
LSTM models in the sentence level to generate a
representation vector for the entire email. The outputs
of the LSTM models in the two levels are combined
using the attention mechanism [BCB14] that assigns
contribution weights to the words and sentences in the
emails. A header network is also integrated to model
the headers of the emails if they are available. In
addition, we propose a novel technique to supervise the
attention mechanism [MWI16, LUF+16, LCLZ17] at
the word level of the hierarchical LSTMs based on the
appearance rank of the words in the vocabulary.
Experiments on the datasets for phishing email
detection in the First Security and Privacy Analytics
AntiPhishing Shared Task (IWSPA-AP 2018) [EDB+18]
demonstrate the bene ts of the proposed models,
being ranked among the top positions among the
participating systems of the shared task (in term of the
performance on the unseen test data).
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>Phishing email detection is a classic problem; however,
research on this topic often has the same limitation:
there is no o cial and big data set for it. Most
previous works typically used a public set consists of
legitimate or \ham" emails1 and another public set of
phishing emails2 for their classi cation evaluation[FST07,
BCP+08, BMS08, HA11, ZY12]. Other works used
private but small data sets[CNU06, ANNWN07]. In
addition, the ratio between phishing and legitimate
emails in these data sets was typically balanced. This
is not the case in the real-world scenario where the
number of legitimate emails is much larger than that
of phishing emails. Our current work relies on larger
data sets with unbalanced distributions of phishing
and legitimate emails collected for the the First
Security and Privacy Analytics Anti-Phishing Shared Task
(IWSPA-AP 2018) [EDB+18].</p>
      <p>Besides the limitation of small data sets, the
previous work has extensively relied on feature
engineering to manually nd representative features for the
problem. Apart from features extracted from emails,
[LV12] also uses a blacklist of phishing webistes to
get an additional feature for urls appearing in emails.
Some neural network systems are also introduced to
detect such blacklists [MTM14, MAS+11]. This is
undesirable because these engineered features need to be
updated once new types of phishing emails with new
content are presented. Our work di ers from the
previous work in this area in that we automate the
feature engineering process using a deep learning model.
This allows us to automatically learn e ective features
for phishing email detection from data. Deep learning
has recently been employed for feature extraction with
success on many natural language processing problems
[NG15b, NG15a].
3</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Model</title>
      <p>Phishing email detection is a binary classi cation
problem that can be formalized as follow.</p>
      <p>Let e = fb; sg be an email in which b and s are
the body content and header of the email respectively.
Let y the binary variable to indicate whether e is a
phishing email or not (y = 1 if e is a phishing email and
y = 0 otherwise). In order to predict the legitimacy
of the email, our goal is to estimate the probability
P (y = 1je) = P (y = 1jb; s). In the following, we
will describe our methods to model the body b and
header s with the body network and header network
respectively to achieve this goal.
3.1</p>
      <sec id="sec-4-1">
        <title>Body Network with Hierarchical LSTMs</title>
        <p>For the body b, we view it as a sequence of sentences
b = (u1; u2; : : : ; uL) where ui is the i-th sentence and
L is the number of sentences in the email body b. Each
sentence ui is in turn a sequence of words/tokens ui =
(vi;1; vi;2; : : : ; vi;K ) with vi;j as the j-th token in ui and
1https://spamassassin.apache.org/old/publiccorpus
2https://monkey.org/~jose/phishing
K as the length of the sentence. Note that we set L
and K to the xed values by padding the sentences ui
and the body b with dummy symbols.</p>
        <p>As there are two levels of information in b (i.e,
the word level with the words vi;j and the sentence
level with the sentence ui), we consider a hierarchical
network that involves two layers of bidirectional long
short-term memory networks (LSTMs) to model such
information. In particular, the rst layer consumes the
words in the sentences via the embedding module, the
bidirectional LSTM module and the attention module
to obtain representation vectors for every sentence ui
in b (the word level layer). Afterward, the second
network layer combines the representation vectors from
the rst layer with another bidirectional LSTM and
attention module, leading to a representation vector
for the whole body email b (the sentence level layer).
This body representation vector would then be used as
features to estimate P (yjb; s) and make the prediction
for the initial email e.
3.1.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>The Word Level Layer</title>
      </sec>
      <sec id="sec-4-3">
        <title>Embedding</title>
        <p>In the word level layer, every word vi;j in each sentence
ui in b is rst transformed into its embedding vector
wi;j . In this paper, wi;j is retrieved by taking the
corresponding column vector in the word embedding
matrix We [MSC+13] that has been pre-trained from
a large corpus: wi;j = We[vi;j ] (each column in the
matrix We corresponds to a word in the vocabulary).
As the result of this embedding step, every sentence
ui = (vi;1; vi;2; : : : ; vi;K ) in b would be converted into a
sequence of vectors (wi;1; wi;2; : : : ; wi;K ), constituting
the inputs for the bidirectional LSTM model in the
next step.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Bidirectional LSTMs for the word level</title>
        <p>This module employs two LSTMs [HS97, GS05]
that run over each input vector sequence
(wi;1; wi;2; : : : ; wi;K ) via two di erent directions,
i.e, forward (from wi;1 to wi;K ) and backward
(from wi;K to wi;1). Along with their operations,
the forward LSTM generates the forward hidden
vector sequence (h!i;1; h!i;2; : : : ; hi;!K ) while the
backward LSTM produce the backward hidden vector
sequence (hi;1; hi;2; : : : ; hi;K ). These two hidden
vector sequences are then concatenated at each
position, resulting in the new hidden vector sequence
(hi;1; hi;2; : : : ; hi;K ) for the sentence ui in b where
hi;j = [h!i;j ; hi;j ]. The notable characteristics of the
hidden vector hi;j is that it encodes the context
information over the whole sentence ui due to the
recurrent property of the forward and backward
LSTMs although a greater focus is put at the current
word vi;j .
In this module, the vectors in the hidden vector
sequence (hi;1; hi;j;2; : : : ; hi;K ) are combined to generate
a single representation vector for the initial sentence
ui. The attention mechanism [YYD+16] seeks to do
this by computing a weighted sum of the vectors in the
sequence. Each hidden vector hi;j would be assigned to
a weight i;j to estimate its importance/contribution
in the representation vector for ui for the phishing
prediction of the email e. In this work, the weight i;j for
hi;j is computed by:
(1)
(2)
(3)
i;j =</p>
        <p>exp(ai&gt;;j wa)</p>
        <p>Pj0 exp(ai&gt;;j0 wa)
in which</p>
        <p>ai;j = tanh(Watthi;j + batt)
Here, Watt, batt and wa are the model parameters that
would be learnt during the training process.
Consequently, the representation vector u^i for the sentence
ui in b would be:
u^i = X</p>
        <p>i;j hi;j
j</p>
        <p>After the word level layer completes its operation
on every sentence of b = (u1; u2; : : : ; uL), we obtain a
corresponding sequence of sentence representation
vectors (u^1; u^2; : : : ; u^L). This vector sequence would be
combined in the next sentence level layer to generate
a single vector to represent b for phishing prediction.
3.1.2</p>
      </sec>
      <sec id="sec-4-5">
        <title>The Sentence Level Layer</title>
        <p>The sentence level layer processes the vector
sequence (u^1; u^2; : : : ; u^L) in the same way that the
word level layer has employed for the vector sequence
(wi;1; wi;2; : : : ; wi;K ) for each sentence ui. Speci
cally, (u^1; u^2; : : : ; u^L) is also rst fed into a
bidirectional LSTM module (i.e, with a forward and
backward LSTM) whose results are concatenated at each
position to produce the corresponding hidden vector
sequence (h^1; h^2; : : : ; h^L). In the next step with the
attention module, the vectors in (h^1; h^2; : : : ; h^L) are
weighted and summed to nally generate the
representation vector rb for the email body b of e.
Assuming the attention weights for (h^1; h^2; : : : ; h^L) are
( 1; 2; : : : ; L) respectively. the body vector rb is then
computed by:
rb = X
^
ihi
i
(4)</p>
        <p>Note that the model parameters of the bidirectional
LSTM modules (and the attention modules) in the
word level layer and the sentence level layer are
separate and they are both learnt in a single training
process. Figure 1 shows the overview of the body network
with hierarchical LSTMs and attention.</p>
        <p>Once the body vector rb has been computed, we can
use it as features to estimate the phishing probability
via:</p>
        <p>P (y = 1jb; s) = (Woutrb + bout)
(5)
where Wout and bout are the model parameters and
is the logistic function.
3.2</p>
      </sec>
      <sec id="sec-4-6">
        <title>Header Network</title>
        <p>The probability estimation in Equation 5 does not
consider the headers of the emails. For the email datasets
with headers available, we can model the headers with
a separate network and use the resulting
representation as additional features to estimate the phishing
probability. In this work, we consider the header s
of the initial email e as a sequence of words/tokens:
(xi; x2; : : : ; xH ) where xi is the i-th word in the header
and H is the length of the header. In order to compute
the representation vector rs for s, we also employ the
same network architecture as the word level layer in
the body network using separate modules for
embedding module, bidirectional LSTM, and attention (i.e,
Section 3.1.1). An overview of this header network is
presented in Figure 2.</p>
        <p>Once the header representation vector rs is
generated, we concatenate it with the body representation
vector rb obtained from the body network, leading to
the nal representation vector r = [rb; rs] to compute
the probability P (y = 1jb; s) = (Wsubr + bsub) (Wsub
and bsub are model parameters).</p>
        <p>In order to train the models in this work, we
minimize the negative log-likelihood of the models on a
training dataset in which the negative log-likelihood
for the email e is computed by:</p>
        <p>Lc =
log(P (y = 1je))
(6)</p>
        <p>The model we have described so far is called
HLSTMs for convenience.
3.3</p>
      </sec>
      <sec id="sec-4-7">
        <title>Supervised Attention</title>
        <p>The attention mechanism in the body and header
networks is expected to assign high weights for the
informative words/sentences and downgrade the
irrelevant words/sentences for phishing detection in the
emails. However, this ideal operation can only be
achieved when an enormous training dataset is
provided to train the models. In our case of phishing
email detection, the size of the training dataset is
not large enough and we might not be able to
exploit the full advantages of the attention. In this
work, we seek for useful heuristics for the problem and
inject them into the models to facilitate the
operation of the attention mechanism. In particular, we
would rst heuristically decide a score for every word
in the sentences so that the words with higher scores
are considered as being more important for phishing
detection than those with lower scores. Afterward,
the models would be encouraged to produce attention
weights for words that are close to their heuristic
importance scores. The expectation is that this
mechanism would help to introduce our intuition into the
attention weights to compensate for the small scale of the
training dataset, potentially leading to a better
performance of the models. Assuming the importance scores
for the words in the sentence (vi;1; vi;2; : : : ; vi;K ) be
(gi;1; gi;2; : : : ; gi;K ) respectively, we force the attention
weights ( i;1; i;2; : : : ; i;K ) (Equation 1) to be close
to the importance scores by penalizing the models that
render large square di erence between the attention
weights and the importance scores. This amounts to
adding the square di erence into the objective function
in Equation 6:
Le = Lc +</p>
        <p>X(gi;j
i;j
i;j )2
(7)
where</p>
        <p>is a trade-o constant.</p>
      </sec>
      <sec id="sec-4-8">
        <title>Importance Score Computation</title>
        <p>In order to compute the importance scores, our
intuition is that a word is important for phishing detection
if it appears frequently in phishing emails and less
frequently in legitimate emails. The fact that an
important word does not appear in many legitimate emails
helps to eliminate the common words that are used
in most documents. Consequently, the frequent words
that are speci c to the phishing emails would receive
higher importance scores in our method. Note that
our method to nd the important words for phishing
emails is di erent from the prior work that has only
considered the most frequent words in the phishing
emails and ignored their appearance in the legitimate
emails.</p>
        <p>We compute the importance scores as follow. For
every word v in the vocabulary, we count the
number of the phishing and legitimate emails in a
training dataset that contain the word. We call the
results as the phishing email frequency and the
legitimate email frequency respectively for v. In the
next step, we sort the words in the vocabulary based
on its phishing and legitimate email frequencies in
the descending order. After that, a word v would
have a phishing rank (phishingRank(v)) and a
legitimate rank (legitimateRank(v)) in the sorted word
sequences based on the phishing and legitimate
frequencies (the higher the rank is, the less the frequency
is). Given these ranks, the unnormalized importance
score for v is computed by:3
score[v] =
legitimateRank[v]
phishingRank[v]
(8)</p>
        <p>The rationale for this formula is that a word would
have a high importance score for phishing prediction
if its legitimate rank is high and its phishing rank is
low. Note that we use the ranks of the words instead
of the frequencies because the frequencies are a ected
by the size of the training dataset, potentially making
the scores unstable. The ranks are less a ected by the
dataset size and provide a more stable measure.
Table 1 demonstrates the top 20 words with the highest
unnormalized importance scores in our vocabulary.</p>
        <p>The H-LSTMs model augmented with the
supervised attention mechanism above is called
HLSTM+supervised in the experiments.</p>
        <p>3The actual important scores of the words we use in Equation
7 are normalized for each sentence.
account
your
click
mailbox
cornell
link
verify
customer
access
reserved
dear
log
accounts
paypal
complete
service
protect
secure
mail
clicking
We train the models in this work with stochastic
gradient descent, shu ed mini-batches, Adam update
rules [KB14]. The gradients are computed via
backpropagation while dropout is used for regularization
[SHK+14]. We also implement gradient clipping to
rescale the Frobenius norms of the non-embedding
weights if they exceed a prede ned threshold.
The models in this work are developed to participate in
the First Security and Privacy Analytics Anti-Phishing
Shared Task (IWSPA-AP 2018) [EDB+18]. The
organizers provide two datasets to train the models for
email phishing recognition. The rst dataset involves
emails that only have the body part (called
datano-header ) while the second dataset contains emails
with both bodies and headers (called data-full-header.
These two datasets translate into two shared tasks to
be solved by the participants. The statistics of the
training data for these two datasets are shown in
Table 2.</p>
        <p>Datasets
data-no-header
data-full-header
#legit
5092
4082
#phish
629
503</p>
        <p>The raw test data (i.e, without labels) for these
datasets are released to the participants at a
specied time. The participants would have one week to
run their systems on such raw test data and submit
the results to the organizers for evaluation.</p>
        <p>Regarding the preprocessing procedure for the
datasets, we notice that a large part of the text in
the email bodies is quite unstructured. The sentences
are often short and/or not clearly separated by the
ending-sentence symbols (i.e, f. ! ?g). In order to
split the bodies of the emails into sentences for our
models, we develope an in-house sentence splitter
specially designed for the datasets. In particular, we
determine the beginning of a sentence by considering if
the rst word of a new line is capitalized or not, or
if a capitalized word is immediately followed by an
ending-sentence symbol. The sentences whose lengths
(numbers of words) are less than 3 are combined to
create a longer sentence. This reduces the number
of sentences signi cantly and expands the context for
the words in the sentences as they are processed by
the models. Figure 3 shows a phishing email from the
datasets.
In order to see how well the proposed deep
learning models (i.e, H-LSTMs and H-LSTMs+supervised)
perform with respect to the traditional methods for
email phishing detection, we compare the proposed
models with a baseline model based on Support Vector
Machines (SVM) [CNU06]. We use the tf-idf scores
of the words in the vocabulary as the features for this
baseline [CNU06]. Note that since the email addresses
and urls in the provided datasets have been mostly
hidden to protect personal information, we cannot use
them as features in our SVM baselines as do the
previous systems. In addition, we examine the performance
of this baseline when the pre-trained word embeddings
are included in its feature set. This allows a fairer
comparison of SVM with the deep learning models in this
work that take pre-trained word embeddings as the
input.</p>
        <p>We employ the implementation of linear and
nonlinear (kernel) SVM from the sklearn library [PVG+11]
for which the tf-idf representations of the emails are
obtained via the gensim toolkit [RS10]. The word
embedding features are computed by taking the mean
vector of the pre-trained embeddings of the words in
the emails [NPG15].
4.3</p>
      </sec>
      <sec id="sec-4-9">
        <title>Hyper-parameter Selection</title>
        <p>As the size of the provided datasets is small and no
development data is included, we use a 5-fold strati ed
cross validation on the training data of the provided
datasets to search for the best hyper-parameters for
the models. The hyper-parameters we found are as
follows.</p>
        <p>The size of word embedding vector is 300 while the
cell sizes are set to 60 for all the LSTMs in the body
and header networks. The size of attention vectors at
the attention modules for the body and header
networks are also set to 60. The coe cient for
supervised attention is set to 0.1, the threshold for
gradient clipping is 0.3 and the drop rate for drop-out is
0.5. For the Adam update rule, we use the learning
rate of 0.0025. Finally, we set C = 10:0 for the
linear SVM baseline. The nonlinear version of SVM we
use is C-SVC with radial basis function kernel and
(C; ) = (50:0; 0:1).
4.4</p>
      </sec>
      <sec id="sec-4-10">
        <title>Results</title>
        <p>In the experiments below, we employ the precision,
recall and F1-score to evaluate the performance of the
models for detecting phishing emails. In addition, the
proposed models H-LSTMs and H-LSTMs+supervised
only utilize the header network in the evaluation on
data-full-header.</p>
      </sec>
      <sec id="sec-4-11">
        <title>Data without header</title>
        <p>In the rst experiment, we focus on the rst shared
task where email headers are not considered. We
compare the proposed deep learning models with the SVM
baselines. In particular, in the rst setting, we use
data-no-header as the training data and perform a
5fold strati ed cross-validation to evaluate the models.
In the second setting, data-no-header is also utilized
as the training data, but the bodies extracted from
data-full-header (along with the corresponding labels)
are employed as the test data. The results of the rst
setting are shown in Table 3 while the results of the
second setting are presented in Table 4. Note that we
report the performance of the SVM baselines when
different combinations of the two types of features (i.e,
tf-idf and word embeddings) are employed in these
tables.</p>
        <p>The rst observation from the tables is that the
e ect of the word embedding features for the SVM
models are quite mixed. It improves the SVM
models with just tf-idf features signi cantly in the rst
experiment setting while the e ectiveness is somewhat
negative in the second experiment setting. Second,
we see that the two versions of hierarchical LSTMs
(i.e, H-LSTMs and H-LSTMs+supervised) outperform
the baseline SVM models in both experiment
settings. The performance improvement is signi cant
with large margins (up to 2.7% improvement on the
absolute F1 score) in the second experiment setting
(i.e, Table 4). The main gain is due to the recall,
demonstrating the generalization advantages of the
proposed deep learning models over the traditional
methods for phishing detection with SVM.
Comparing H-LSTMs+supervised and H-LSTMs, we see that
H-LSTMs+supervised is consistently better than
HLSTMs with signi cant improvement in the second
setting. This shows the bene ts of supervised
attention for hierarchical LSTM models for email phishing
detection. Finally, we see that the performance in the
rst setting is in general much better than those in the
second setting. We attribute this to the fact that text
data in data-no-header and data-full-header is quite
di erent, leading to the mismatch between data
dis0.9824
0.9529
0.9837
tributions of the training data and test data in the
second experiment setting.</p>
        <p>In the nal submission for the rst shared task (i.e,
without email headers), we combine the training data
from data-no-header with the extracted bodies (along
with the corresponding labels) from the training data
of data-full-header to generate a new training set. As
H-LSTM+supervised is the best model in this
development experiment, we train it on the new training
set and use the trained model to make predictions for
the actual test set of the rst shared task.</p>
      </sec>
      <sec id="sec-4-12">
        <title>Data with full header</title>
        <p>In this experiment, we aim to evaluate if the header
network can help to improve the performance of
HLSTMs. We take the training dataset from
data-fullheader to perform a 5-fold cross-validation evaluation.
The performance of H-LSTMs when the header
network is included or excluded is shown in Table 5.</p>
        <p>Models
H-LSTMs (only body)
H-LSTMs + headers</p>
        <p>From the table, we see that the header network is
also helpful for H-LSTMs as it helps to improve the
performance of H-LSTMs for the dataset with email
headers (an 0.7% improvement on the F1 score).</p>
        <p>In the nal submission for the second shared task
(i.e, with email headers), we simply train our best
model in this setting (i.e, H-LSTM+supervised) on the
training dataset of data-full-header.</p>
        <p>The time for the training and test process of the
proposed (and submitted) models is shown in Table 6.
Note that the training time of H-LSTMs+supervised
(for the rst shared task) is longer than that of
HLSTMs+headers+supervised (for the second shared
task) since the training data of the former model
includes both the original training data of the rst
task and the extracted bodies from the training
data of the second task. The test data of the
rst shared task with H-LSTMs+supervised is also
larger than that of the second shared task with
HLSTMs+headers+supervised.</p>
        <p>Models
H-LSTMs+supervised
H-LSTMs+headers+supervised
Training</p>
        <p>Time
3.7 hours
1.5 hours</p>
        <p>Test</p>
        <p>Time
4 minutes
1 minute
[BCB14]
[BCP+08]
[BMS08]
[CNU06]
[DTH06]
[EDB+18]
[FST07]
[GBB11]</p>
        <sec id="sec-4-12-1">
          <title>Dzmitry Bahdanau, Kyunghyun Cho,</title>
          <p>and Yoshua Bengio. Neural
machine translation by jointly learning to
align and translate. In arXiv preprint
arXiv:1409.0473, 2014.</p>
        </sec>
        <sec id="sec-4-12-2">
          <title>Andre Bergholz, Jeong Ho Chang,</title>
          <p>Gerhard Paass, Frank Reichartz, and
Siehyun Strobel. Improved phishing
detection using model-based features. In
CEAS, 2008.</p>
        </sec>
        <sec id="sec-4-12-3">
          <title>Ram Basnet, Srinivas Mukkamala, and Andrew H Sung. Detection of phishing attacks: A machine learning approach.</title>
          <p>In Soft Computing Applications in
Industry, pages 373{383. Springer, 2008.</p>
        </sec>
        <sec id="sec-4-12-4">
          <title>Madhusudhanan Chandrasekaran, Krishnan Narayanan, and Shambhu Upadhyaya. Phishing email detection based on structural properties. In NYS Cyber</title>
          <p>Security Conference, volume 3, 2006.</p>
        </sec>
        <sec id="sec-4-12-5">
          <title>Rachna Dhamija, J Doug Tygar, and Marti Hearst. Why phishing works. In</title>
          <p>Proceedings of the SIGCHI conference
on Human Factors in computing
systems, pages 581{590. ACM, 2006.</p>
        </sec>
        <sec id="sec-4-12-6">
          <title>Ayman Elaassal, Avisha Das, Shahryar</title>
          <p>Baki, Luis De Moraes, and Rakesh
Verma. Iwspa-ap: Anti-phising shared
task at acm international workshop on
security and privacy analytics. In
Proceedings of the 1st IWSPA
AntiPhishing Shared Task. CEUR, 2018.</p>
        </sec>
        <sec id="sec-4-12-7">
          <title>Ian Fette, Norman Sadeh, and Anthony Tomasic. Learning to detect phishing emails. pages 649{656. ACM, 2007.</title>
        </sec>
        <sec id="sec-4-12-8">
          <title>Xavier Glorot, Antoine Bordes, and</title>
          <p>Yoshua Bengio. Domain adaptation for
large-scale sentiment classi cation: A</p>
          <p>As we can see from the tables, our systems achieve
the best performance for the rst shared task and the
second best performance for the second shared task.
These results are very promising and demonstrate the
advantages of the proposed methods in particular and
deep learning in general for the problem of email
phishing recognition.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We present a deep learning model to detect
phishing emails. Our model employs hierarchical attentive
LSTMs to model the email bodies at both the word
level and the sentence level. A header network with
attentive LSTMs is also incorporated to model the
headers of the emails. In the models, we propose a
novel supervised attention technique to improve the
performance using the email frequency ranking of the
words in the vocabulary. Several experiments are
con</p>
      <sec id="sec-5-1">
        <title>Alex Graves and Jurgen Schmidhuber.</title>
        <p>Framewise phoneme classi cation with
bidirectional lstm and other neural
network architectures. Neural Networks,
18(5-6):602{610, 2005.</p>
      </sec>
      <sec id="sec-5-2">
        <title>BB Gupta, Aakanksha Tewari,</title>
        <p>Ankit Kumar Jain, and Dharma P
Agrawal. Fighting against phishing
attacks: state of the art and future
challenges. Neural Computing and
Applications, 28(12):3629{3654, 2017.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Isredza Rahmi A Hamid and Jemal</title>
        <p>Abawajy. Hybrid feature selection for
phishing email detection. In
International Conference on Algorithms and
Architectures for Parallel Processing,
pages 266{275. Springer, 2011.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. In Neural</title>
        <p>Computation, 1997.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Tom N Jagatic, Nathaniel A Johnson, Markus Jakobsson, and Filippo Menczer. Social phishing. Communications of the ACM, 50(10):94{100, 2007.</title>
      </sec>
      <sec id="sec-5-6">
        <title>Diederik P. Kingma and Jimmy. Ba. Adam: A method for stochastic optimization. In arXiv: 1412.6980, 2014.</title>
      </sec>
      <sec id="sec-5-7">
        <title>Shulin Liu, Yubo Chen, Kang Liu, and</title>
        <p>Jun Zhao. Exploiting argument
information to improve event detection via
supervised attention mechanisms. In
Proceedings of the 55th Annual Meeting of
the Association for Computational
Linguistics (Volume 1: Long Papers),
volume 1, pages 1789{1798, 2017.</p>
      </sec>
      <sec id="sec-5-8">
        <title>Lemao Liu, LemLiu, Masao Utiyama,</title>
        <p>Andrew Finch, ao Sumita, Masao
Utiyama, Andrew Finch, and Eiichiro
Sumita. Neural machine
translation with supervised attention. arXiv
preprint arXiv:1609.04186, 2016.</p>
      </sec>
      <sec id="sec-5-9">
        <title>V Santhana Lakshmi and MS Vijaya. E cient prediction of phishing websites using supervised learning algorithms.</title>
        <p>Procedia Engineering, 30:798{805, 2012.
[LXLZ15]
[MAS+11]
[MSC+13]
[MTM14]
[MWI16]
[NG15a]
[NG15b]
[NG16]
[NPG15]</p>
      </sec>
      <sec id="sec-5-10">
        <title>Siwei Lai, Liheng Xu, Kang Liu, and</title>
        <p>Jun Zhao. Recurrent convolutional
neural networks for text classi cation. In
AAAI, volume 333, pages 2267{2273,
2015.</p>
      </sec>
      <sec id="sec-5-11">
        <title>Anutthamaa Martin, Na Anutthamaa,</title>
        <p>M Sathyavathy, Marie Manjari Saint
Francois, Dr V Prasanna Venkatesan,
et al. A framework for predicting
phishing websites using neural networks.
arXiv preprint arXiv:1109.1074, 2011.</p>
      </sec>
      <sec id="sec-5-12">
        <title>Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Distributed representations of words and phrases and their compositionality.</title>
        <p>In Advances in neural information
processing systems, pages 3111{3119, 2013.</p>
      </sec>
      <sec id="sec-5-13">
        <title>Rami M Mohammad, Fadi Thabtah,</title>
        <p>and Lee McCluskey. Predicting phishing
websites based on self-structuring neural
network. Neural Computing and
Applications, 25(2):443{458, 2014.</p>
      </sec>
      <sec id="sec-5-14">
        <title>Haitao Mi, Zhiguo Wang, and Abe Itty</title>
        <p>cheriah. Supervised attentions for
neural machine translation. arXiv preprint
arXiv:1608.00112, 2016.</p>
      </sec>
      <sec id="sec-5-15">
        <title>Thien Huu Nguyen and Ralph Grish</title>
        <p>man. Event detection and domain
adaptation with convolutional neural
networks. In Proceedings of the 53rd Annual
Meeting of the Association for
Computational Linguistics and the 7th
International Joint Conference on Natural
Language Processing, pages 365{371, 2015.</p>
      </sec>
      <sec id="sec-5-16">
        <title>Thien Huu Nguyen and Ralph Grishman. Relation extraction: Perspective from convolutional neural networks.</title>
        <p>In Proceedings of the 1st Workshop on
Vector Space Modeling for Natural
Language Processing, pages 39{48, 2015.</p>
      </sec>
      <sec id="sec-5-17">
        <title>Thien Huu Nguyen and Ralph Grish</title>
        <p>man. Modeling skip-grams for event
detection with convolutional neural
networks. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing, 2016.</p>
      </sec>
      <sec id="sec-5-18">
        <title>Thien Huu Nguyen, Barbara Plank, and</title>
        <p>Ralph Grishman. Semantic
representations for domain adaptation: A case
study on the tree kernel-based method
[PVG+11]
[RS10]
[SHK+14]
[Sin05]
[SNG+17]
for relation extraction. In Proceedings of
the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference
on Natural Language Processing, 2015.
2017 data breach report nds
phishing, email attacks still potent. In
https://digitalguardian.com/blog/2017data-breach-report-
nds-phishing-emailattacks-still-potent, 2017.</p>
      </sec>
      <sec id="sec-5-19">
        <title>Phishing scams cost american businesses half a billion dollars a year. In Forbes:</title>
        <p>Phishing Scams Cost American
Businesses Half a Billion Dollars a Year,
2017.</p>
      </sec>
      <sec id="sec-5-20">
        <title>How john podesta's emails were hacked.</title>
        <p>In Forbes: How John Podestas Emails
Were Hacked and How to Prevent it
from Happening to You, 2016.</p>
      </sec>
      <sec id="sec-5-21">
        <title>Radim Rehurek and Petr Sojka. Soft</title>
        <p>ware Framework for Topic Modelling
with Large Corpora. In Proceedings of
the LREC 2010 Workshop on New
Challenges for NLP Frameworks, 2010.</p>
      </sec>
      <sec id="sec-5-22">
        <title>Nitish Srivastava, Geo rey Hinton,</title>
        <p>Alex Krizhevsky, Ilya Sutskever,
Ruslan Salakhutdinov, and Yoshua Bengio.
Dropout: A simple way to prevent
neural networks from over tting. In The
Journal of Machine Learning Research,
2014.
[SNM17]
[ZY12]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [ANNWN07]
          <string-name>
            <given-names>Saeed</given-names>
            <surname>Abu-Nimeh</surname>
          </string-name>
          , Dario Nappa,
          <string-name>
            <given-names>Xinlei</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Suku</given-names>
            <surname>Nair</surname>
          </string-name>
          .
          <article-title>A comparison of machine learning techniques for phishing detection</article-title>
          .
          <source>In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit</source>
          , pages
          <volume>60</volume>
          {
          <fpage>69</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [YYD+16]
          <string-name>
            <surname>Hossein</surname>
            <given-names>Siadati</given-names>
          </string-name>
          , Toan Nguyen, and
          <string-name>
            <given-names>Nasir</given-names>
            <surname>Memon</surname>
          </string-name>
          .
          <article-title>Veri cation code forwarding attack (short paper)</article-title>
          .
          <source>In International Conference on Passwords</source>
          , pages
          <volume>65</volume>
          {
          <fpage>71</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Hossein</given-names>
            <surname>Siadati</surname>
          </string-name>
          , Toan Nguyen, and
          <string-name>
            <given-names>Nasir</given-names>
            <surname>Memon</surname>
          </string-name>
          .
          <article-title>X-platform phishing: Abusing trust for targeted attacks short paper</article-title>
          .
          <source>In International Conference on Financial Cryptography and Data Security</source>
          , pages
          <volume>587</volume>
          {
          <fpage>596</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Zichao</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Diyi</given-names>
            <surname>Yang</surname>
          </string-name>
          , Chris Dyer, Xiaodong He,
          <string-name>
            <surname>Alex Smola</surname>
            , and
            <given-names>Eduard</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          .
          <article-title>Hierarchical attention networks for document classi cation</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <volume>1480</volume>
          {
          <fpage>1489</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Phishing detection using neural network</article-title>
          .
          <source>CS229 lecture notes</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>