A Deep Learning Model with Hierarchical LSTMs and
        Supervised Attention for Anti-Phishing

                          Minh Nguyen                             Toan Nguyen
            Hanoi University of Science and Technology        New York University
                         Hanoi, Vietnam                     Brooklyn, New York, USA
                 minh.nv142950@sis.hust.edu.vn               toan.v.nguyen@nyu.edu
                                        Thien Huu Nguyen
                                       University of Oregon
                                       Eugene, Oregon, USA
                                       thien@cs.uoregon.edu


                                                               1   Introduction
                                                               Despite being one of the oldest tactics, email phish-
                       Abstract                                ing remains the most common attack used by cyber-
                                                               criminals [phi17a] due to its effectiveness. Phishing
    Anti-phishing aims to detect phishing con-                 attacks exploit users’ inability to distinguish between
    tent/documents in a pool of textual data.                  legitimate information from fake ones sent to them
    This is an important problem in cybersecu-                 [DTH06, SNM15, SNM17, SNG+ 17]. In an email
    rity that can help to guard users from fraud-              phishing campaign, attackers send emails appearing
    ulent information. Natural language process-               to be from well-known enterprises or organizations di-
    ing (NLP) offers a natural solution for this               rectly to their victims or by spoofed emails [Sin05].
    problem as it is capable of analyzing the tex-             These emails try to lure victims to divulge their private
    tual content to perform intelligent recognition.           information [JJJM07, SNM15, SNG+ 17] or to visit an
    In this work, we investigate the state-of-the-             impersonated site (i.e., a fake banking website), on
    art techniques for text categorization in NLP              which they will be asked for passwords, credit card
    to address the problem of anti-phishing for                numbers or other sensitive information. The recent
    emails (i.e, predicting if an email is phishing            hack of a high profile US politician (usually referred
    or not). These techniques are based on deep                as “John Podesta’s hack”) is a famous example of this
    learning models that have attracted much at-               type of attack. It was all started by a spoofed email
    tention from the community recently. In par-               sent to the victim asking him to reset his Gmail pass-
    ticular, we present a framework with hierar-               word by clicking on a link in the email [pod16]. The
    chical long short-term memory networks (H-                 technique of email phishing may seem simple, yet the
    LSTMs) and attention mechanisms to model                   damage it makes is huge. In the US alone, the es-
    the emails simultaneously at the word and                  timated cost of phishing emails to business is half a
    the sentence level. Our expectation is to pro-             billion dollars per year [phi17b].
    duce an effective model for anti-phishing and                  Numerous methods have been proposed to auto-
    demonstrate the effectiveness of deep learning             matically detect phishing emails [BCP+ 08, FST07,
    for problems in cybersecurity.                             ANNWN07, GTJA17]. Chandrasekaran et. al pro-
                                                               posed to use structural properties of emails and Sup-
                                                               port Vector Machines (SVM) to classify phishing
Copyright c by the paper’s authors. Copying permitted for      emails [CNU06]. In [ANNWN07], Abu-Nimeh et. al
private and academic purposes.
                                                               evaluated six machine learning classifiers on a pub-
In: R. Verma, A. Das. (eds.): Proceedings of the 1st Anti-
Phishing Shared Pilot at 4th ACM International Workshop on
                                                               lic phishing email dataset using proposed 43 features.
Security and Privacy Analytics (IWSPA 2018), Tempe, Arizona,   Gupta et. al [GTJA17] presented a nice survey on
USA, 21-03-2018, published at http://ceur-ws.org               recent state-of-the-art research on phishing detection.
However, these methods mainly rely on feature engi-         mate or “ham” emails1 and another public set of phish-
neering efforts to generate characteristics (features) to   ing emails2 for their classification evaluation[FST07,
represent emails, over which machine learning meth-         BCP+ 08, BMS08, HA11, ZY12]. Other works used
ods can be applied to perform the task. Such feature        private but small data sets[CNU06, ANNWN07]. In
engineering is often done manually and still requires       addition, the ratio between phishing and legitimate
much labor and domain expertise. This has hindered          emails in these data sets was typically balanced. This
the portability of the systems to new domains and lim-      is not the case in the real-world scenario where the
ited the performance of the current systems.                number of legitimate emails is much larger than that
   In order to overcome this problem, our work focuses      of phishing emails. Our current work relies on larger
on deep learning techniques to solve the problem of         data sets with unbalanced distributions of phishing
phishing email detection. The major benefit of deep         and legitimate emails collected for the the First Secu-
learning is its ability to automatically induce effective   rity and Privacy Analytics Anti-Phishing Shared Task
and task-specific representations from data that can        (IWSPA-AP 2018) [EDB+ 18].
be used as features to recognize phishing emails. As           Besides the limitation of small data sets, the pre-
deep learning has been shown to achieve state-of-the-       vious work has extensively relied on feature engineer-
art performance for many natural language processing        ing to manually find representative features for the
tasks, including text categorization [GBB11, LXLZ15],       problem. Apart from features extracted from emails,
information extraction [NG15b, NG15a, NG16], ma-            [LV12] also uses a blacklist of phishing webistes to
chine translation [BCB14], among others, we expect          get an additional feature for urls appearing in emails.
that it would also help to build effective systems for      Some neural network systems are also introduced to
phishing email detection.                                   detect such blacklists [MTM14, MAS+ 11]. This is un-
   We present a new deep learning model to solve the        desirable because these engineered features need to be
problem of email phishing prediction using hierarchi-       updated once new types of phishing emails with new
cal long short-term memory networks (H-LSTMs) aug-          content are presented. Our work differs from the pre-
mented with supervised attention technique. In the          vious work in this area in that we automate the fea-
hierarchical LSTM model [YYD+ 16], emails are con-          ture engineering process using a deep learning model.
sidered as hierarchical architectures with words in the     This allows us to automatically learn effective features
lower level (the word level) and sentences in the upper     for phishing email detection from data. Deep learning
level (the sentence level). LSTM models are first im-       has recently been employed for feature extraction with
plemented in the word level whose results are passed to     success on many natural language processing problems
LSTM models in the sentence level to generate a rep-        [NG15b, NG15a].
resentation vector for the entire email. The outputs
of the LSTM models in the two levels are combined           3     Proposed Model
using the attention mechanism [BCB14] that assigns          Phishing email detection is a binary classification
contribution weights to the words and sentences in the      problem that can be formalized as follow.
emails. A header network is also integrated to model           Let e = {b, s} be an email in which b and s are
the headers of the emails if they are available. In ad-     the body content and header of the email respectively.
dition, we propose a novel technique to supervise the       Let y the binary variable to indicate whether e is a
attention mechanism [MWI16, LUF+ 16, LCLZ17] at             phishing email or not (y = 1 if e is a phishing email and
the word level of the hierarchical LSTMs based on the       y = 0 otherwise). In order to predict the legitimacy
appearance rank of the words in the vocabulary. Ex-         of the email, our goal is to estimate the probability
periments on the datasets for phishing email detec-         P (y = 1|e) = P (y = 1|b, s). In the following, we
tion in the First Security and Privacy Analytics Anti-      will describe our methods to model the body b and
Phishing Shared Task (IWSPA-AP 2018) [EDB+ 18]              header s with the body network and header network
demonstrate the benefits of the proposed models, be-        respectively to achieve this goal.
ing ranked among the top positions among the par-
ticipating systems of the shared task (in term of the       3.1    Body Network with Hierarchical LSTMs
performance on the unseen test data).
                                                            For the body b, we view it as a sequence of sentences
                                                            b = (u1 , u2 , . . . , uL ) where ui is the i-th sentence and
2   Related Work                                            L is the number of sentences in the email body b. Each
Phishing email detection is a classic problem; however,     sentence ui is in turn a sequence of words/tokens ui =
research on this topic often has the same limitation:       (vi,1 , vi,2 , . . . , vi,K ) with vi,j as the j-th token in ui and
there is no official and big data set for it. Most previ-       1 https://spamassassin.apache.org/old/publiccorpus

ous works typically used a public set consists of legiti-       2 https://monkey.org/
                                                                                        ~jose/phishing
K as the length of the sentence. Note that we set L                 LSTMs although a greater focus is put at the current
and K to the fixed values by padding the sentences ui               word vi,j .
and the body b with dummy symbols.
   As there are two levels of information in b (i.e,
the word level with the words vi,j and the sentence
level with the sentence ui ), we consider a hierarchical
network that involves two layers of bidirectional long
short-term memory networks (LSTMs) to model such
information. In particular, the first layer consumes the
words in the sentences via the embedding module, the
bidirectional LSTM module and the attention module
to obtain representation vectors for every sentence ui
in b (the word level layer). Afterward, the second net-
work layer combines the representation vectors from
the first layer with another bidirectional LSTM and
attention module, leading to a representation vector
for the whole body email b (the sentence level layer).
This body representation vector would then be used as                            Figure 1: Hierarchical LSTMs.
features to estimate P (y|b, s) and make the prediction
for the initial email e.
                                                                    Attention
3.1.1    The Word Level Layer                                       In this module, the vectors in the hidden vector se-
                                                                    quence (hi,1 , hi,j,2 , . . . , hi,K ) are combined to generate
Embedding
                                                                    a single representation vector for the initial sentence
In the word level layer, every word vi,j in each sentence           ui . The attention mechanism [YYD+ 16] seeks to do
ui in b is first transformed into its embedding vector              this by computing a weighted sum of the vectors in the
wi,j . In this paper, wi,j is retrieved by taking the               sequence. Each hidden vector hi,j would be assigned to
corresponding column vector in the word embedding                   a weight αi,j to estimate its importance/contribution
matrix We [MSC+ 13] that has been pre-trained from                  in the representation vector for ui for the phishing pre-
a large corpus: wi,j = We [vi,j ] (each column in the               diction of the email e. In this work, the weight αi,j for
matrix We corresponds to a word in the vocabulary).                 hi,j is computed by:
As the result of this embedding step, every sentence
ui = (vi,1 , vi,2 , . . . , vi,K ) in b would be converted into a                               exp(a>
                                                                                                     i,j wa )
                                                                                     αi,j = P            >
                                                                                                                               (1)
sequence of vectors (wi,1 , wi,2 , . . . , wi,K ), constituting                                 j 0 exp(ai,j 0 wa )
the inputs for the bidirectional LSTM model in the
                                                                    in which
next step.
                                                                                   ai,j = tanh(Watt hi,j + batt )              (2)
Bidirectional LSTMs for the word level                              Here, Watt , batt and wa are the model parameters that
                                                                    would be learnt during the training process. Conse-
This module employs two LSTMs [HS97, GS05]                          quently, the representation vector ûi for the sentence
that run over each input vector sequence                            ui in b would be:
(wi,1 , wi,2 , . . . , wi,K ) via two different directions,
i.e, forward (from wi,1 to wi,K ) and backward
                                                                                              X
                                                                                        ûi =    αi,j hi,j              (3)
(from wi,K to wi,1 ). Along with their operations,                                               j
the forward LSTM generates the forward hidden
                           −→ −→        −−→                            After the word level layer completes its operation
vector sequence (hi,1 , hi,2 , . . . , hi,K ) while the back-
ward LSTM produce the backward hidden vector                        on every sentence of b = (u1 , u2 , . . . , uL ), we obtain a
                 ←− ←−            ←−−                               corresponding sequence of sentence representation vec-
sequence (hi,1 , hi,2 , . . . , hi,K ).    These two hidden
vector sequences are then concatenated at each                      tors (û1 , û2 , . . . , ûL ). This vector sequence would be
position, resulting in the new hidden vector sequence               combined in the next sentence level layer to generate
(hi,1 , hi,2 , . . . , hi,K ) for the sentence ui in b where        a single vector to represent b for phishing prediction.
           −→ ←−
hi,j = [hi,j , hi,j ]. The notable characteristics of the
                                                                    3.1.2    The Sentence Level Layer
hidden vector hi,j is that it encodes the context
information over the whole sentence ui due to the                   The sentence level layer processes the vector se-
recurrent property of the forward and backward                      quence (û1 , û2 , . . . , ûL ) in the same way that the
word level layer has employed for the vector sequence
(wi,1 , wi,2 , . . . , wi,K ) for each sentence ui . Specifi-
cally, (û1 , û2 , . . . , ûL ) is also first fed into a bidirec-
tional LSTM module (i.e, with a forward and back-
ward LSTM) whose results are concatenated at each
position to produce the corresponding hidden vector
sequence (ĥ1 , ĥ2 , . . . , ĥL ). In the next step with the
attention module, the vectors in (ĥ1 , ĥ2 , . . . , ĥL ) are
weighted and summed to finally generate the repre-
sentation vector rb for the email body b of e. As-
suming the attention weights for (ĥ1 , ĥ2 , . . . , ĥL ) are
(β1 , β2 , . . . , βL ) respectively. the body vector rb is then
computed by:
                                                                       Figure 2: Hierarchical LSTMs with header network.
                                     X
                               rb =      βi ĥi                 (4)   training dataset in which the negative log-likelihood
                                 i                                    for the email e is computed by:
   Note that the model parameters of the bidirectional
LSTM modules (and the attention modules) in the                                         Lc = − log(P (y = 1|e))                     (6)
word level layer and the sentence level layer are sepa-
                                                                        The model we have described so far is called H-
rate and they are both learnt in a single training pro-
                                                                      LSTMs for convenience.
cess. Figure 1 shows the overview of the body network
with hierarchical LSTMs and attention.                                3.3    Supervised Attention
   Once the body vector rb has been computed, we can
use it as features to estimate the phishing probability               The attention mechanism in the body and header net-
via:                                                                  works is expected to assign high weights for the in-
                                                                      formative words/sentences and downgrade the irrel-
             P (y = 1|b, s) = σ(Wout rb + bout )               (5)    evant words/sentences for phishing detection in the
                                                                      emails. However, this ideal operation can only be
   where Wout and bout are the model parameters and                   achieved when an enormous training dataset is pro-
σ is the logistic function.                                           vided to train the models. In our case of phishing
                                                                      email detection, the size of the training dataset is
3.2    Header Network
                                                                      not large enough and we might not be able to ex-
The probability estimation in Equation 5 does not con-                ploit the full advantages of the attention. In this
sider the headers of the emails. For the email datasets               work, we seek for useful heuristics for the problem and
with headers available, we can model the headers with                 inject them into the models to facilitate the opera-
a separate network and use the resulting representa-                  tion of the attention mechanism. In particular, we
tion as additional features to estimate the phishing                  would first heuristically decide a score for every word
probability. In this work, we consider the header s                   in the sentences so that the words with higher scores
of the initial email e as a sequence of words/tokens:                 are considered as being more important for phishing
(xi , x2 , . . . , xH ) where xi is the i-th word in the header       detection than those with lower scores. Afterward,
and H is the length of the header. In order to compute                the models would be encouraged to produce attention
the representation vector rs for s, we also employ the                weights for words that are close to their heuristic im-
same network architecture as the word level layer in                  portance scores. The expectation is that this mecha-
the body network using separate modules for embed-                    nism would help to introduce our intuition into the at-
ding module, bidirectional LSTM, and attention (i.e,                  tention weights to compensate for the small scale of the
Section 3.1.1). An overview of this header network is                 training dataset, potentially leading to a better perfor-
presented in Figure 2.                                                mance of the models. Assuming the importance scores
   Once the header representation vector rs is gener-                 for the words in the sentence (vi,1 , vi,2 , . . . , vi,K ) be
ated, we concatenate it with the body representation                  (gi,1 , gi,2 , . . . , gi,K ) respectively, we force the attention
vector rb obtained from the body network, leading to                  weights (αi,1 , αi,2 , . . . , αi,K ) (Equation 1) to be close
the final representation vector r = [rb , rs ] to compute             to the importance scores by penalizing the models that
the probability P (y = 1|b, s) = σ(Wsub r + bsub ) (Wsub              render large square difference between the attention
and bsub are model parameters).                                       weights and the importance scores. This amounts to
   In order to train the models in this work, we min-                 adding the square difference into the objective function
imize the negative log-likelihood of the models on a                  in Equation 6:
                             X                                                         account    21.45
                                                 2                                     your       15.00
              Le = Lc + λ           (gi,j − αi,j )         (7)
                              i,j                                                      click      14.11
                                                                                       mailbox    9.59
   where λ is a trade-off constant.
                                                                                       cornell    9.58
                                                                                       link       9.37
Importance Score Computation
                                                                                       verify     8.83
In order to compute the importance scores, our intu-                                   customer   8.63
ition is that a word is important for phishing detection                               access     8.50
if it appears frequently in phishing emails and less fre-                              reserved   8.03
quently in legitimate emails. The fact that an impor-                                  dear       7.85
tant word does not appear in many legitimate emails                                    log        7.70
helps to eliminate the common words that are used                                      accounts   7.61
in most documents. Consequently, the frequent words                                    paypal     7.52
that are specific to the phishing emails would receive                                 complete   7.37
higher importance scores in our method. Note that                                      service    7.15
our method to find the important words for phishing                                    protect    6.95
emails is different from the prior work that has only                                  secure     6.94
considered the most frequent words in the phishing                                     mail       6.70
emails and ignored their appearance in the legitimate                                  clicking   6.63
emails.
    We compute the importance scores as follow. For                     Table 1: Top 20 words with the highest scores.
every word v in the vocabulary, we count the num-
ber of the phishing and legitimate emails in a train-             3.3.1     Training
ing dataset that contain the word. We call the re-                We train the models in this work with stochastic gra-
sults as the phishing email frequency and the le-                 dient descent, shuffled mini-batches, Adam update
gitimate email frequency respectively for v. In the               rules [KB14]. The gradients are computed via back-
next step, we sort the words in the vocabulary based              propagation while dropout is used for regularization
on its phishing and legitimate email frequencies in               [SHK+ 14]. We also implement gradient clipping to
the descending order. After that, a word v would                  rescale the Frobenius norms of the non-embedding
have a phishing rank (phishingRank(v)) and a legit-               weights if they exceed a predefined threshold.
imate rank (legitimateRank(v)) in the sorted word
sequences based on the phishing and legitimate fre-
                                                                  4      Evaluation
quencies (the higher the rank is, the less the frequency
is). Given these ranks, the unnormalized importance               4.1     Datasets and Preprocessing
score for v is computed by:3
                                                                  The models in this work are developed to participate in
                           legitimateRank[v]                      the First Security and Privacy Analytics Anti-Phishing
             score[v] =                                    (8)    Shared Task (IWSPA-AP 2018) [EDB+ 18]. The or-
                            phishingRank[v]
                                                                  ganizers provide two datasets to train the models for
    The rationale for this formula is that a word would           email phishing recognition. The first dataset involves
have a high importance score for phishing prediction              emails that only have the body part (called data-
if its legitimate rank is high and its phishing rank is           no-header ) while the second dataset contains emails
low. Note that we use the ranks of the words instead              with both bodies and headers (called data-full-header.
of the frequencies because the frequencies are affected           These two datasets translate into two shared tasks to
by the size of the training dataset, potentially making           be solved by the participants. The statistics of the
the scores unstable. The ranks are less affected by the           training data for these two datasets are shown in Ta-
dataset size and provide a more stable measure. Ta-               ble 2.
ble 1 demonstrates the top 20 words with the highest                        Datasets           #legit     #phish
unnormalized importance scores in our vocabulary.                           data-no-header      5092       629
    The H-LSTMs model augmented with the su-                                data-full-header    4082       503
pervised attention mechanism above is called H-
LSTM+supervised in the experiments.                               Table 2: Statistics of the data-no-header and data-full-
  3 The actual important scores of the words we use in Equation   header datasets. #legit and #phish are the numbers
7 are normalized for each sentence.                               of legitimate and phishing emails respectively.
   The raw test data (i.e, without labels) for these         ear (kernel) SVM from the sklearn library [PVG+ 11]
datasets are released to the participants at a speci-        for which the tf-idf representations of the emails are
fied time. The participants would have one week to           obtained via the gensim toolkit [ŘS10]. The word em-
run their systems on such raw test data and submit           bedding features are computed by taking the mean
the results to the organizers for evaluation.                vector of the pre-trained embeddings of the words in
   Regarding the preprocessing procedure for the             the emails [NPG15].
datasets, we notice that a large part of the text in
the email bodies is quite unstructured. The sentences        4.3   Hyper-parameter Selection
are often short and/or not clearly separated by the          As the size of the provided datasets is small and no
ending-sentence symbols (i.e, {. ! ?}). In order to          development data is included, we use a 5-fold stratified
split the bodies of the emails into sentences for our        cross validation on the training data of the provided
models, we develope an in-house sentence splitter spe-       datasets to search for the best hyper-parameters for
cially designed for the datasets. In particular, we de-      the models. The hyper-parameters we found are as
termine the beginning of a sentence by considering if        follows.
the first word of a new line is capitalized or not, or          The size of word embedding vector is 300 while the
if a capitalized word is immediately followed by an          cell sizes are set to 60 for all the LSTMs in the body
ending-sentence symbol. The sentences whose lengths          and header networks. The size of attention vectors at
(numbers of words) are less than 3 are combined to           the attention modules for the body and header net-
create a longer sentence. This reduces the number            works are also set to 60. The λ coefficient for super-
of sentences significantly and expands the context for       vised attention is set to 0.1, the threshold for gradi-
the words in the sentences as they are processed by          ent clipping is 0.3 and the drop rate for drop-out is
the models. Figure 3 shows a phishing email from the         0.5. For the Adam update rule, we use the learning
datasets.                                                    rate of 0.0025. Finally, we set C = 10.0 for the lin-
                                                             ear SVM baseline. The nonlinear version of SVM we
                                                             use is C-SVC with radial basis function kernel and
                                                             (C, γ) = (50.0, 0.1).

                                                             4.4   Results
                                                             In the experiments below, we employ the precision,
                                                             recall and F1-score to evaluate the performance of the
Figure 3: A case in which splitting body into sentences      models for detecting phishing emails. In addition, the
cannot be done as usual. (Phishing email: 28.txt in          proposed models H-LSTMs and H-LSTMs+supervised
data-no-header).                                             only utilize the header network in the evaluation on
                                                             data-full-header.

4.2   Baselines                                              Data without header
In order to see how well the proposed deep learn-            In the first experiment, we focus on the first shared
ing models (i.e, H-LSTMs and H-LSTMs+supervised)             task where email headers are not considered. We com-
perform with respect to the traditional methods for          pare the proposed deep learning models with the SVM
email phishing detection, we compare the proposed            baselines. In particular, in the first setting, we use
models with a baseline model based on Support Vector         data-no-header as the training data and perform a 5-
Machines (SVM) [CNU06]. We use the tf-idf scores             fold stratified cross-validation to evaluate the models.
of the words in the vocabulary as the features for this      In the second setting, data-no-header is also utilized
baseline [CNU06]. Note that since the email addresses        as the training data, but the bodies extracted from
and urls in the provided datasets have been mostly           data-full-header (along with the corresponding labels)
hidden to protect personal information, we cannot use        are employed as the test data. The results of the first
them as features in our SVM baselines as do the previ-       setting are shown in Table 3 while the results of the
ous systems. In addition, we examine the performance         second setting are presented in Table 4. Note that we
of this baseline when the pre-trained word embeddings        report the performance of the SVM baselines when dif-
are included in its feature set. This allows a fairer com-   ferent combinations of the two types of features (i.e,
parison of SVM with the deep learning models in this         tf-idf and word embeddings) are employed in these
work that take pre-trained word embeddings as the            tables.
input.                                                          The first observation from the tables is that the
   We employ the implementation of linear and nonlin-        effect of the word embedding features for the SVM
 Models                   Precision    Recall      F1       tributions of the training data and test data in the
 H-LSTMs+supervised        0.9784      0.9466    0.9621     second experiment setting.
 H-LSTMs                   0.9638      0.9448    0.9542        In the final submission for the first shared task (i.e,
 Linear SVM
                                                            without email headers), we combine the training data
 +tfidf                     0.9824     0.8856    0.9313
                                                            from data-no-header with the extracted bodies (along
 +emb                       0.9529     0.9206    0.9364
 +tfidf+emb                 0.9837     0.9253    0.9536
                                                            with the corresponding labels) from the training data
 Kernel SVM                                                 of data-full-header to generate a new training set. As
 +tfidf                     0.9684     0.8730    0.9180     H-LSTM+supervised is the best model in this devel-
 +emb                       0.9408     0.9141    0.9273     opment experiment, we train it on the new training
 +tfidf+emb                 0.9714     0.9174    0.9436     set and use the trained model to make predictions for
                                                            the actual test set of the first shared task.
Table 3: Performance comparison between the pro-
posed models H-LSTMs and H-LSTMs+supervised                 Data with full header
with the baseline models Linear and Kernel SVM. tfidf
indicates tf-idf features while emb denotes features        In this experiment, we aim to evaluate if the header
obtained from the pre-trained word embeddings.              network can help to improve the performance of H-
                                                            LSTMs. We take the training dataset from data-full-
 Models                   Precision    Recall      F1       header to perform a 5-fold cross-validation evaluation.
 H-LSTMs+supervised        0.8892      0.7395    0.8075     The performance of H-LSTMs when the header net-
 H-LSTMs                   0.8934      0.7054    0.7883     work is included or excluded is shown in Table 5.
 Linear SVM                                                  Models                   Precision     Recall      F1
 +tfidf                     0.8864     0.6978    0.7809      H-LSTMs (only body)       0.9732       0.9534    0.9631
 +emb                       0.8112     0.6918    0.7468      H-LSTMs + headers         0.9816       0.9596    0.9705
 +tfidf+emb                 0.8695     0.7018    0.7767
 Kernel SVM                                                 Table 5: Cross-validation performance of H-LSTMs
 +tfidf                     0.8698     0.7038    0.7780     with using headers compared to the original version.
 +emb                       0.8216     0.6501    0.7259
 +tfidf+emb                 0.8564     0.6937    0.7665         From the table, we see that the header network is
                                                            also helpful for H-LSTMs as it helps to improve the
Table 4: Performance of all models on the test data         performance of H-LSTMs for the dataset with email
(data-full-headers).                                        headers (an 0.7% improvement on the F1 score).
                                                                In the final submission for the second shared task
models are quite mixed. It improves the SVM mod-            (i.e, with email headers), we simply train our best
els with just tf-idf features significantly in the first    model in this setting (i.e, H-LSTM+supervised) on the
experiment setting while the effectiveness is somewhat      training dataset of data-full-header.
negative in the second experiment setting. Second,              The time for the training and test process of the
we see that the two versions of hierarchical LSTMs          proposed (and submitted) models is shown in Table 6.
(i.e, H-LSTMs and H-LSTMs+supervised) outperform            Note that the training time of H-LSTMs+supervised
the baseline SVM models in both experiment set-             (for the first shared task) is longer than that of H-
tings. The performance improvement is significant           LSTMs+headers+supervised (for the second shared
with large margins (up to 2.7% improvement on the           task) since the training data of the former model
absolute F1 score) in the second experiment setting         includes both the original training data of the first
(i.e, Table 4). The main gain is due to the recall,         task and the extracted bodies from the training
demonstrating the generalization advantages of the          data of the second task. The test data of the
proposed deep learning models over the traditional          first shared task with H-LSTMs+supervised is also
methods for phishing detection with SVM. Compar-            larger than that of the second shared task with H-
ing H-LSTMs+supervised and H-LSTMs, we see that             LSTMs+headers+supervised.
H-LSTMs+supervised is consistently better than H-
LSTMs with significant improvement in the second             Models                            Training        Test
setting. This shows the benefits of supervised atten-                                            Time          Time
tion for hierarchical LSTM models for email phishing         H-LSTMs+supervised                3.7 hours     4 minutes
detection. Finally, we see that the performance in the       H-LSTMs+headers+supervised        1.5 hours     1 minute
first setting is in general much better than those in the
                                                            Table 6: Training and test times of the submitted mod-
second setting. We attribute this to the fact that text
                                                            els. The experiments are run on a single NVIDIA Tesla
data in data-no-header and data-full-header is quite
                                                            K80 GPU.
different, leading to the mismatch between data dis-
Comparision with the participating systems on              ducted to demonstrate the benefits of the proposed
the actual test sets                                       deep learning models.
Tables 7 and 8 show the best performance on the ac-
tual test data of all the teams that participate in the
                                                           References
shared tasks. Table 7 reports the performance for the      [ANNWN07] Saeed Abu-Nimeh, Dario Nappa, Xin-
first shared task (i.e, without email headers) while Ta-             lei Wang, and Suku Nair. A compar-
ble 8 presents the performance for the second shared                 ison of machine learning techniques for
task (i.e, with email headers). These performance is                 phishing detection. In Proceedings of the
measured and released by the organizers. The perfor-                 anti-phishing working groups 2nd annual
mance of the systems we submitted is shown in the                    eCrime researchers summit, pages 60–
rows with our team name (i.e, TripleN ).                             69. ACM, 2007.
         Teams            Precision    Recall    F1        [BCB14]      Dzmitry Bahdanau, Kyunghyun Cho,
 TripleN (our team)        0.981       0.978    0.979                   and Yoshua Bengio.          Neural ma-
 Security-CEN@Amrita        0.962      0.989    0.975                   chine translation by jointly learning to
 Amrita-NLP                 0.972      0.974    0.973                   align and translate. In arXiv preprint
 CEN-DeepSpam               0.951      0.964    0.958
                                                                        arXiv:1409.0473, 2014.
 CENSec@Amrita              0.914      0.998    0.954
 CEN-SecureNLP              0.890      1.000    0.942      [BCP+ 08]    Andre Bergholz, Jeong Ho Chang,
 CEN-AISecurity             0.936      0.910    0.923                   Gerhard Paass, Frank Reichartz, and
 Crypt Coyotes              0.936      0.910    0.923                   Siehyun Strobel. Improved phishing de-
                                                                        tection using model-based features. In
Table 7: The best performance of all the participating
                                                                        CEAS, 2008.
teams in the first shared task with no email headers.
                                                           [BMS08]      Ram Basnet, Srinivas Mukkamala, and
         Teams            Precision    Recall    F1                     Andrew H Sung. Detection of phishing
 Amrita-NLP                 0.998      0.994    0.996                   attacks: A machine learning approach.
 TripleN (our team)        0.990       0.992    0.991
                                                                        In Soft Computing Applications in In-
 CEN-DeepSpam               1.000      0.978    0.989
                                                                        dustry, pages 373–383. Springer, 2008.
 Security-CEN@Amrita        0.998      0.976    0.987
 CENSec@Amrita              0.882      1.000    0.937      [CNU06]      Madhusudhanan Chandrasekaran, Kr-
 CEN-AISecurity             0.957      0.900    0.928                   ishnan Narayanan, and Shambhu Upad-
 CEN-SecureNLP              0.880      0.971    0.924                   hyaya. Phishing email detection based
 Crypt Coyotes              0.960      0.863    0.909
                                                                        on structural properties. In NYS Cyber
Table 8: The best performance of all the participating                  Security Conference, volume 3, 2006.
teams in the second shared task with email headers.        [DTH06]      Rachna Dhamija, J Doug Tygar, and
   As we can see from the tables, our systems achieve                   Marti Hearst. Why phishing works. In
the best performance for the first shared task and the                  Proceedings of the SIGCHI conference
second best performance for the second shared task.                     on Human Factors in computing sys-
These results are very promising and demonstrate the                    tems, pages 581–590. ACM, 2006.
advantages of the proposed methods in particular and
                                                           [EDB+ 18]    Ayman Elaassal, Avisha Das, Shahryar
deep learning in general for the problem of email phish-
                                                                        Baki, Luis De Moraes, and Rakesh
ing recognition.
                                                                        Verma. Iwspa-ap: Anti-phising shared
                                                                        task at acm international workshop on
5   Conclusions                                                         security and privacy analytics.    In
We present a deep learning model to detect phish-                       Proceedings of the 1st IWSPA Anti-
ing emails. Our model employs hierarchical attentive                    Phishing Shared Task. CEUR, 2018.
LSTMs to model the email bodies at both the word           [FST07]      Ian Fette, Norman Sadeh, and Anthony
level and the sentence level. A header network with                     Tomasic. Learning to detect phishing
attentive LSTMs is also incorporated to model the                       emails. pages 649–656. ACM, 2007.
headers of the emails. In the models, we propose a
novel supervised attention technique to improve the        [GBB11]      Xavier Glorot, Antoine Bordes, and
performance using the email frequency ranking of the                    Yoshua Bengio. Domain adaptation for
words in the vocabulary. Several experiments are con-                   large-scale sentiment classification: A
            deep learning approach. In Proceed-         [LXLZ15]    Siwei Lai, Liheng Xu, Kang Liu, and
            ings of the 28th international conference               Jun Zhao. Recurrent convolutional neu-
            on machine learning (ICML-11), pages                    ral networks for text classification. In
            513–520, 2011.                                          AAAI, volume 333, pages 2267–2273,
                                                                    2015.
[GS05]      Alex Graves and Jürgen Schmidhuber.
            Framewise phoneme classification with       [MAS+ 11]   Anutthamaa Martin, Na Anutthamaa,
            bidirectional lstm and other neural net-                M Sathyavathy, Marie Manjari Saint
            work architectures. Neural Networks,                    Francois, Dr V Prasanna Venkatesan,
            18(5-6):602–610, 2005.                                  et al.    A framework for predicting
                                                                    phishing websites using neural networks.
[GTJA17]    BB     Gupta,     Aakanksha     Tewari,                 arXiv preprint arXiv:1109.1074, 2011.
            Ankit Kumar Jain, and Dharma P
            Agrawal.    Fighting against phishing       [MSC+ 13]   Tomas Mikolov, Ilya Sutskever, Kai
            attacks: state of the art and future                    Chen, Greg S Corrado, and Jeff Dean.
            challenges.    Neural Computing and                     Distributed representations of words
            Applications, 28(12):3629–3654, 2017.                   and phrases and their compositionality.
                                                                    In Advances in neural information pro-
[HA11]      Isredza Rahmi A Hamid and Jemal                         cessing systems, pages 3111–3119, 2013.
            Abawajy. Hybrid feature selection for
            phishing email detection. In Interna-       [MTM14]     Rami M Mohammad, Fadi Thabtah,
            tional Conference on Algorithms and                     and Lee McCluskey. Predicting phishing
            Architectures for Parallel Processing,                  websites based on self-structuring neural
            pages 266–275. Springer, 2011.                          network. Neural Computing and Appli-
                                                                    cations, 25(2):443–458, 2014.
[HS97]      Sepp Hochreiter and Jurgen Schmidhu-
            ber. Long short-term memory. In Neural      [MWI16]     Haitao Mi, Zhiguo Wang, and Abe Itty-
            Computation, 1997.                                      cheriah. Supervised attentions for neu-
                                                                    ral machine translation. arXiv preprint
[JJJM07]    Tom N Jagatic, Nathaniel A John-                        arXiv:1608.00112, 2016.
            son, Markus Jakobsson, and Filippo
            Menczer. Social phishing. Communica-        [NG15a]     Thien Huu Nguyen and Ralph Grish-
            tions of the ACM, 50(10):94–100, 2007.                  man. Event detection and domain adap-
                                                                    tation with convolutional neural net-
[KB14]      Diederik P. Kingma and Jimmy. Ba.                       works. In Proceedings of the 53rd Annual
            Adam: A method for stochastic opti-                     Meeting of the Association for Compu-
            mization. In arXiv: 1412.6980, 2014.                    tational Linguistics and the 7th Interna-
                                                                    tional Joint Conference on Natural Lan-
[LCLZ17]    Shulin Liu, Yubo Chen, Kang Liu, and
                                                                    guage Processing, pages 365–371, 2015.
            Jun Zhao. Exploiting argument informa-
            tion to improve event detection via su-     [NG15b]     Thien Huu Nguyen and Ralph Grish-
            pervised attention mechanisms. In Pro-                  man. Relation extraction: Perspec-
            ceedings of the 55th Annual Meeting of                  tive from convolutional neural networks.
            the Association for Computational Lin-                  In Proceedings of the 1st Workshop on
            guistics (Volume 1: Long Papers), vol-                  Vector Space Modeling for Natural Lan-
            ume 1, pages 1789–1798, 2017.                           guage Processing, pages 39–48, 2015.
[LUF+ 16]   Lemao Liu, LemLiu, Masao Utiyama,           [NG16]      Thien Huu Nguyen and Ralph Grish-
            Andrew Finch, ao Sumita, Masao                          man. Modeling skip-grams for event de-
            Utiyama, Andrew Finch, and Eiichiro                     tection with convolutional neural net-
            Sumita.      Neural machine transla-                    works. In Proceedings of the 2016 Con-
            tion with supervised attention. arXiv                   ference on Empirical Methods in Natural
            preprint arXiv:1609.04186, 2016.                        Language Processing, 2016.
[LV12]      V Santhana Lakshmi and MS Vijaya.           [NPG15]     Thien Huu Nguyen, Barbara Plank, and
            Efficient prediction of phishing websites               Ralph Grishman. Semantic represen-
            using supervised learning algorithms.                   tations for domain adaptation: A case
            Procedia Engineering, 30:798–805, 2012.                 study on the tree kernel-based method
            for relation extraction. In Proceedings of   [SNM15]     Hossein Siadati, Toan Nguyen, and
            the 53rd Annual Meeting of the Associ-                   Nasir Memon. Verification code for-
            ation for Computational Linguistics and                  warding attack (short paper). In In-
            the 7th International Joint Conference                   ternational Conference on Passwords,
            on Natural Language Processing, 2015.                    pages 65–71. Springer, 2015.

[phi17a]    2017 data breach report finds phish-         [SNM17]     Hossein Siadati, Toan Nguyen, and
            ing, email attacks still potent.     In                  Nasir Memon. X-platform phishing:
            https://digitalguardian.com/blog/2017-                   Abusing trust for targeted attacks short
            data-breach-report-finds-phishing-email-                 paper. In International Conference on
            attacks-still-potent, 2017.                              Financial Cryptography and Data Secu-
                                                                     rity, pages 587–596. Springer, 2017.
[phi17b]    Phishing scams cost american businesses
            half a billion dollars a year. In Forbes:    [YYD+ 16]   Zichao Yang, Diyi Yang, Chris Dyer,
            Phishing Scams Cost American Busi-                       Xiaodong He, Alex Smola, and Eduard
            nesses Half a Billion Dollars a Year,                    Hovy. Hierarchical attention networks
            2017.                                                    for document classification. In Pro-
                                                                     ceedings of the 2016 Conference of the
[pod16]     How john podesta’s emails were hacked.                   North American Chapter of the Asso-
            In Forbes: How John Podestas Emails                      ciation for Computational Linguistics:
            Were Hacked and How to Prevent it                        Human Language Technologies, pages
            from Happening to You, 2016.                             1480–1489, 2016.

[PVG+ 11]   F. Pedregosa, G. Varoquaux, A. Gram-         [ZY12]      Ningxia Zhang and Yongqing Yuan.
            fort, V. Michel, B. Thirion, O. Grisel,                  Phishing detection using neural net-
            M. Blondel, P. Prettenhofer, R. Weiss,                   work. CS229 lecture notes, 2012.
            V. Dubourg, J. Vanderplas, A. Passos,
            D. Cournapeau, M. Brucher, M. Per-
            rot, and E. Duchesnay. Scikit-learn:
            Machine learning in Python. Journal
            of Machine Learning Research, 12:2825–
            2830, 2011.

[ŘS10]     Radim Řehůřek and Petr Sojka. Soft-
            ware Framework for Topic Modelling
            with Large Corpora. In Proceedings of
            the LREC 2010 Workshop on New Chal-
            lenges for NLP Frameworks, 2010.

[SHK+ 14]   Nitish Srivastava, Geoffrey Hinton,
            Alex Krizhevsky, Ilya Sutskever, Rus-
            lan Salakhutdinov, and Yoshua Bengio.
            Dropout: A simple way to prevent neu-
            ral networks from overfitting. In The
            Journal of Machine Learning Research,
            2014.

[Sin05]     David Singer. Identification of spoofed
            email. Google Patents, August 25 2005.
            US Patent App. 10/754,220.

[SNG+ 17]   Hossein Siadati, Toan Nguyen, Payas
            Gupta, Markus Jakobsson, and Nasir
            Memon. Mind your smses: Mitigat-
            ing social engineering in second factor
            authentication. Computers & Security,
            65:14–28, 2017.