A Deep Learning Model with Hierarchical LSTMs and Supervised Attention for Anti-Phishing Minh Nguyen Toan Nguyen Hanoi University of Science and Technology New York University Hanoi, Vietnam Brooklyn, New York, USA minh.nv142950@sis.hust.edu.vn toan.v.nguyen@nyu.edu Thien Huu Nguyen University of Oregon Eugene, Oregon, USA thien@cs.uoregon.edu 1 Introduction Despite being one of the oldest tactics, email phish- Abstract ing remains the most common attack used by cyber- criminals [phi17a] due to its effectiveness. Phishing Anti-phishing aims to detect phishing con- attacks exploit users’ inability to distinguish between tent/documents in a pool of textual data. legitimate information from fake ones sent to them This is an important problem in cybersecu- [DTH06, SNM15, SNM17, SNG+ 17]. In an email rity that can help to guard users from fraud- phishing campaign, attackers send emails appearing ulent information. Natural language process- to be from well-known enterprises or organizations di- ing (NLP) offers a natural solution for this rectly to their victims or by spoofed emails [Sin05]. problem as it is capable of analyzing the tex- These emails try to lure victims to divulge their private tual content to perform intelligent recognition. information [JJJM07, SNM15, SNG+ 17] or to visit an In this work, we investigate the state-of-the- impersonated site (i.e., a fake banking website), on art techniques for text categorization in NLP which they will be asked for passwords, credit card to address the problem of anti-phishing for numbers or other sensitive information. The recent emails (i.e, predicting if an email is phishing hack of a high profile US politician (usually referred or not). These techniques are based on deep as “John Podesta’s hack”) is a famous example of this learning models that have attracted much at- type of attack. It was all started by a spoofed email tention from the community recently. In par- sent to the victim asking him to reset his Gmail pass- ticular, we present a framework with hierar- word by clicking on a link in the email [pod16]. The chical long short-term memory networks (H- technique of email phishing may seem simple, yet the LSTMs) and attention mechanisms to model damage it makes is huge. In the US alone, the es- the emails simultaneously at the word and timated cost of phishing emails to business is half a the sentence level. Our expectation is to pro- billion dollars per year [phi17b]. duce an effective model for anti-phishing and Numerous methods have been proposed to auto- demonstrate the effectiveness of deep learning matically detect phishing emails [BCP+ 08, FST07, for problems in cybersecurity. ANNWN07, GTJA17]. Chandrasekaran et. al pro- posed to use structural properties of emails and Sup- port Vector Machines (SVM) to classify phishing Copyright c by the paper’s authors. Copying permitted for emails [CNU06]. In [ANNWN07], Abu-Nimeh et. al private and academic purposes. evaluated six machine learning classifiers on a pub- In: R. Verma, A. Das. (eds.): Proceedings of the 1st Anti- Phishing Shared Pilot at 4th ACM International Workshop on lic phishing email dataset using proposed 43 features. Security and Privacy Analytics (IWSPA 2018), Tempe, Arizona, Gupta et. al [GTJA17] presented a nice survey on USA, 21-03-2018, published at http://ceur-ws.org recent state-of-the-art research on phishing detection. However, these methods mainly rely on feature engi- mate or “ham” emails1 and another public set of phish- neering efforts to generate characteristics (features) to ing emails2 for their classification evaluation[FST07, represent emails, over which machine learning meth- BCP+ 08, BMS08, HA11, ZY12]. Other works used ods can be applied to perform the task. Such feature private but small data sets[CNU06, ANNWN07]. In engineering is often done manually and still requires addition, the ratio between phishing and legitimate much labor and domain expertise. This has hindered emails in these data sets was typically balanced. This the portability of the systems to new domains and lim- is not the case in the real-world scenario where the ited the performance of the current systems. number of legitimate emails is much larger than that In order to overcome this problem, our work focuses of phishing emails. Our current work relies on larger on deep learning techniques to solve the problem of data sets with unbalanced distributions of phishing phishing email detection. The major benefit of deep and legitimate emails collected for the the First Secu- learning is its ability to automatically induce effective rity and Privacy Analytics Anti-Phishing Shared Task and task-specific representations from data that can (IWSPA-AP 2018) [EDB+ 18]. be used as features to recognize phishing emails. As Besides the limitation of small data sets, the pre- deep learning has been shown to achieve state-of-the- vious work has extensively relied on feature engineer- art performance for many natural language processing ing to manually find representative features for the tasks, including text categorization [GBB11, LXLZ15], problem. Apart from features extracted from emails, information extraction [NG15b, NG15a, NG16], ma- [LV12] also uses a blacklist of phishing webistes to chine translation [BCB14], among others, we expect get an additional feature for urls appearing in emails. that it would also help to build effective systems for Some neural network systems are also introduced to phishing email detection. detect such blacklists [MTM14, MAS+ 11]. This is un- We present a new deep learning model to solve the desirable because these engineered features need to be problem of email phishing prediction using hierarchi- updated once new types of phishing emails with new cal long short-term memory networks (H-LSTMs) aug- content are presented. Our work differs from the pre- mented with supervised attention technique. In the vious work in this area in that we automate the fea- hierarchical LSTM model [YYD+ 16], emails are con- ture engineering process using a deep learning model. sidered as hierarchical architectures with words in the This allows us to automatically learn effective features lower level (the word level) and sentences in the upper for phishing email detection from data. Deep learning level (the sentence level). LSTM models are first im- has recently been employed for feature extraction with plemented in the word level whose results are passed to success on many natural language processing problems LSTM models in the sentence level to generate a rep- [NG15b, NG15a]. resentation vector for the entire email. The outputs of the LSTM models in the two levels are combined 3 Proposed Model using the attention mechanism [BCB14] that assigns Phishing email detection is a binary classification contribution weights to the words and sentences in the problem that can be formalized as follow. emails. A header network is also integrated to model Let e = {b, s} be an email in which b and s are the headers of the emails if they are available. In ad- the body content and header of the email respectively. dition, we propose a novel technique to supervise the Let y the binary variable to indicate whether e is a attention mechanism [MWI16, LUF+ 16, LCLZ17] at phishing email or not (y = 1 if e is a phishing email and the word level of the hierarchical LSTMs based on the y = 0 otherwise). In order to predict the legitimacy appearance rank of the words in the vocabulary. Ex- of the email, our goal is to estimate the probability periments on the datasets for phishing email detec- P (y = 1|e) = P (y = 1|b, s). In the following, we tion in the First Security and Privacy Analytics Anti- will describe our methods to model the body b and Phishing Shared Task (IWSPA-AP 2018) [EDB+ 18] header s with the body network and header network demonstrate the benefits of the proposed models, be- respectively to achieve this goal. ing ranked among the top positions among the par- ticipating systems of the shared task (in term of the 3.1 Body Network with Hierarchical LSTMs performance on the unseen test data). For the body b, we view it as a sequence of sentences b = (u1 , u2 , . . . , uL ) where ui is the i-th sentence and 2 Related Work L is the number of sentences in the email body b. Each Phishing email detection is a classic problem; however, sentence ui is in turn a sequence of words/tokens ui = research on this topic often has the same limitation: (vi,1 , vi,2 , . . . , vi,K ) with vi,j as the j-th token in ui and there is no official and big data set for it. Most previ- 1 https://spamassassin.apache.org/old/publiccorpus ous works typically used a public set consists of legiti- 2 https://monkey.org/ ~jose/phishing K as the length of the sentence. Note that we set L LSTMs although a greater focus is put at the current and K to the fixed values by padding the sentences ui word vi,j . and the body b with dummy symbols. As there are two levels of information in b (i.e, the word level with the words vi,j and the sentence level with the sentence ui ), we consider a hierarchical network that involves two layers of bidirectional long short-term memory networks (LSTMs) to model such information. In particular, the first layer consumes the words in the sentences via the embedding module, the bidirectional LSTM module and the attention module to obtain representation vectors for every sentence ui in b (the word level layer). Afterward, the second net- work layer combines the representation vectors from the first layer with another bidirectional LSTM and attention module, leading to a representation vector for the whole body email b (the sentence level layer). This body representation vector would then be used as Figure 1: Hierarchical LSTMs. features to estimate P (y|b, s) and make the prediction for the initial email e. Attention 3.1.1 The Word Level Layer In this module, the vectors in the hidden vector se- quence (hi,1 , hi,j,2 , . . . , hi,K ) are combined to generate Embedding a single representation vector for the initial sentence In the word level layer, every word vi,j in each sentence ui . The attention mechanism [YYD+ 16] seeks to do ui in b is first transformed into its embedding vector this by computing a weighted sum of the vectors in the wi,j . In this paper, wi,j is retrieved by taking the sequence. Each hidden vector hi,j would be assigned to corresponding column vector in the word embedding a weight αi,j to estimate its importance/contribution matrix We [MSC+ 13] that has been pre-trained from in the representation vector for ui for the phishing pre- a large corpus: wi,j = We [vi,j ] (each column in the diction of the email e. In this work, the weight αi,j for matrix We corresponds to a word in the vocabulary). hi,j is computed by: As the result of this embedding step, every sentence ui = (vi,1 , vi,2 , . . . , vi,K ) in b would be converted into a exp(a> i,j wa ) αi,j = P > (1) sequence of vectors (wi,1 , wi,2 , . . . , wi,K ), constituting j 0 exp(ai,j 0 wa ) the inputs for the bidirectional LSTM model in the in which next step. ai,j = tanh(Watt hi,j + batt ) (2) Bidirectional LSTMs for the word level Here, Watt , batt and wa are the model parameters that would be learnt during the training process. Conse- This module employs two LSTMs [HS97, GS05] quently, the representation vector ûi for the sentence that run over each input vector sequence ui in b would be: (wi,1 , wi,2 , . . . , wi,K ) via two different directions, i.e, forward (from wi,1 to wi,K ) and backward X ûi = αi,j hi,j (3) (from wi,K to wi,1 ). Along with their operations, j the forward LSTM generates the forward hidden −→ −→ −−→ After the word level layer completes its operation vector sequence (hi,1 , hi,2 , . . . , hi,K ) while the back- ward LSTM produce the backward hidden vector on every sentence of b = (u1 , u2 , . . . , uL ), we obtain a ←− ←− ←−− corresponding sequence of sentence representation vec- sequence (hi,1 , hi,2 , . . . , hi,K ). These two hidden vector sequences are then concatenated at each tors (û1 , û2 , . . . , ûL ). This vector sequence would be position, resulting in the new hidden vector sequence combined in the next sentence level layer to generate (hi,1 , hi,2 , . . . , hi,K ) for the sentence ui in b where a single vector to represent b for phishing prediction. −→ ←− hi,j = [hi,j , hi,j ]. The notable characteristics of the 3.1.2 The Sentence Level Layer hidden vector hi,j is that it encodes the context information over the whole sentence ui due to the The sentence level layer processes the vector se- recurrent property of the forward and backward quence (û1 , û2 , . . . , ûL ) in the same way that the word level layer has employed for the vector sequence (wi,1 , wi,2 , . . . , wi,K ) for each sentence ui . Specifi- cally, (û1 , û2 , . . . , ûL ) is also first fed into a bidirec- tional LSTM module (i.e, with a forward and back- ward LSTM) whose results are concatenated at each position to produce the corresponding hidden vector sequence (ĥ1 , ĥ2 , . . . , ĥL ). In the next step with the attention module, the vectors in (ĥ1 , ĥ2 , . . . , ĥL ) are weighted and summed to finally generate the repre- sentation vector rb for the email body b of e. As- suming the attention weights for (ĥ1 , ĥ2 , . . . , ĥL ) are (β1 , β2 , . . . , βL ) respectively. the body vector rb is then computed by: Figure 2: Hierarchical LSTMs with header network. X rb = βi ĥi (4) training dataset in which the negative log-likelihood i for the email e is computed by: Note that the model parameters of the bidirectional LSTM modules (and the attention modules) in the Lc = − log(P (y = 1|e)) (6) word level layer and the sentence level layer are sepa- The model we have described so far is called H- rate and they are both learnt in a single training pro- LSTMs for convenience. cess. Figure 1 shows the overview of the body network with hierarchical LSTMs and attention. 3.3 Supervised Attention Once the body vector rb has been computed, we can use it as features to estimate the phishing probability The attention mechanism in the body and header net- via: works is expected to assign high weights for the in- formative words/sentences and downgrade the irrel- P (y = 1|b, s) = σ(Wout rb + bout ) (5) evant words/sentences for phishing detection in the emails. However, this ideal operation can only be where Wout and bout are the model parameters and achieved when an enormous training dataset is pro- σ is the logistic function. vided to train the models. In our case of phishing email detection, the size of the training dataset is 3.2 Header Network not large enough and we might not be able to ex- The probability estimation in Equation 5 does not con- ploit the full advantages of the attention. In this sider the headers of the emails. For the email datasets work, we seek for useful heuristics for the problem and with headers available, we can model the headers with inject them into the models to facilitate the opera- a separate network and use the resulting representa- tion of the attention mechanism. In particular, we tion as additional features to estimate the phishing would first heuristically decide a score for every word probability. In this work, we consider the header s in the sentences so that the words with higher scores of the initial email e as a sequence of words/tokens: are considered as being more important for phishing (xi , x2 , . . . , xH ) where xi is the i-th word in the header detection than those with lower scores. Afterward, and H is the length of the header. In order to compute the models would be encouraged to produce attention the representation vector rs for s, we also employ the weights for words that are close to their heuristic im- same network architecture as the word level layer in portance scores. The expectation is that this mecha- the body network using separate modules for embed- nism would help to introduce our intuition into the at- ding module, bidirectional LSTM, and attention (i.e, tention weights to compensate for the small scale of the Section 3.1.1). An overview of this header network is training dataset, potentially leading to a better perfor- presented in Figure 2. mance of the models. Assuming the importance scores Once the header representation vector rs is gener- for the words in the sentence (vi,1 , vi,2 , . . . , vi,K ) be ated, we concatenate it with the body representation (gi,1 , gi,2 , . . . , gi,K ) respectively, we force the attention vector rb obtained from the body network, leading to weights (αi,1 , αi,2 , . . . , αi,K ) (Equation 1) to be close the final representation vector r = [rb , rs ] to compute to the importance scores by penalizing the models that the probability P (y = 1|b, s) = σ(Wsub r + bsub ) (Wsub render large square difference between the attention and bsub are model parameters). weights and the importance scores. This amounts to In order to train the models in this work, we min- adding the square difference into the objective function imize the negative log-likelihood of the models on a in Equation 6: X account 21.45 2 your 15.00 Le = Lc + λ (gi,j − αi,j ) (7) i,j click 14.11 mailbox 9.59 where λ is a trade-off constant. cornell 9.58 link 9.37 Importance Score Computation verify 8.83 In order to compute the importance scores, our intu- customer 8.63 ition is that a word is important for phishing detection access 8.50 if it appears frequently in phishing emails and less fre- reserved 8.03 quently in legitimate emails. The fact that an impor- dear 7.85 tant word does not appear in many legitimate emails log 7.70 helps to eliminate the common words that are used accounts 7.61 in most documents. Consequently, the frequent words paypal 7.52 that are specific to the phishing emails would receive complete 7.37 higher importance scores in our method. Note that service 7.15 our method to find the important words for phishing protect 6.95 emails is different from the prior work that has only secure 6.94 considered the most frequent words in the phishing mail 6.70 emails and ignored their appearance in the legitimate clicking 6.63 emails. We compute the importance scores as follow. For Table 1: Top 20 words with the highest scores. every word v in the vocabulary, we count the num- ber of the phishing and legitimate emails in a train- 3.3.1 Training ing dataset that contain the word. We call the re- We train the models in this work with stochastic gra- sults as the phishing email frequency and the le- dient descent, shuffled mini-batches, Adam update gitimate email frequency respectively for v. In the rules [KB14]. The gradients are computed via back- next step, we sort the words in the vocabulary based propagation while dropout is used for regularization on its phishing and legitimate email frequencies in [SHK+ 14]. We also implement gradient clipping to the descending order. After that, a word v would rescale the Frobenius norms of the non-embedding have a phishing rank (phishingRank(v)) and a legit- weights if they exceed a predefined threshold. imate rank (legitimateRank(v)) in the sorted word sequences based on the phishing and legitimate fre- 4 Evaluation quencies (the higher the rank is, the less the frequency is). Given these ranks, the unnormalized importance 4.1 Datasets and Preprocessing score for v is computed by:3 The models in this work are developed to participate in legitimateRank[v] the First Security and Privacy Analytics Anti-Phishing score[v] = (8) Shared Task (IWSPA-AP 2018) [EDB+ 18]. The or- phishingRank[v] ganizers provide two datasets to train the models for The rationale for this formula is that a word would email phishing recognition. The first dataset involves have a high importance score for phishing prediction emails that only have the body part (called data- if its legitimate rank is high and its phishing rank is no-header ) while the second dataset contains emails low. Note that we use the ranks of the words instead with both bodies and headers (called data-full-header. of the frequencies because the frequencies are affected These two datasets translate into two shared tasks to by the size of the training dataset, potentially making be solved by the participants. The statistics of the the scores unstable. The ranks are less affected by the training data for these two datasets are shown in Ta- dataset size and provide a more stable measure. Ta- ble 2. ble 1 demonstrates the top 20 words with the highest Datasets #legit #phish unnormalized importance scores in our vocabulary. data-no-header 5092 629 The H-LSTMs model augmented with the su- data-full-header 4082 503 pervised attention mechanism above is called H- LSTM+supervised in the experiments. Table 2: Statistics of the data-no-header and data-full- 3 The actual important scores of the words we use in Equation header datasets. #legit and #phish are the numbers 7 are normalized for each sentence. of legitimate and phishing emails respectively. The raw test data (i.e, without labels) for these ear (kernel) SVM from the sklearn library [PVG+ 11] datasets are released to the participants at a speci- for which the tf-idf representations of the emails are fied time. The participants would have one week to obtained via the gensim toolkit [ŘS10]. The word em- run their systems on such raw test data and submit bedding features are computed by taking the mean the results to the organizers for evaluation. vector of the pre-trained embeddings of the words in Regarding the preprocessing procedure for the the emails [NPG15]. datasets, we notice that a large part of the text in the email bodies is quite unstructured. The sentences 4.3 Hyper-parameter Selection are often short and/or not clearly separated by the As the size of the provided datasets is small and no ending-sentence symbols (i.e, {. ! ?}). In order to development data is included, we use a 5-fold stratified split the bodies of the emails into sentences for our cross validation on the training data of the provided models, we develope an in-house sentence splitter spe- datasets to search for the best hyper-parameters for cially designed for the datasets. In particular, we de- the models. The hyper-parameters we found are as termine the beginning of a sentence by considering if follows. the first word of a new line is capitalized or not, or The size of word embedding vector is 300 while the if a capitalized word is immediately followed by an cell sizes are set to 60 for all the LSTMs in the body ending-sentence symbol. The sentences whose lengths and header networks. The size of attention vectors at (numbers of words) are less than 3 are combined to the attention modules for the body and header net- create a longer sentence. This reduces the number works are also set to 60. The λ coefficient for super- of sentences significantly and expands the context for vised attention is set to 0.1, the threshold for gradi- the words in the sentences as they are processed by ent clipping is 0.3 and the drop rate for drop-out is the models. Figure 3 shows a phishing email from the 0.5. For the Adam update rule, we use the learning datasets. rate of 0.0025. Finally, we set C = 10.0 for the lin- ear SVM baseline. The nonlinear version of SVM we use is C-SVC with radial basis function kernel and (C, γ) = (50.0, 0.1). 4.4 Results In the experiments below, we employ the precision, recall and F1-score to evaluate the performance of the Figure 3: A case in which splitting body into sentences models for detecting phishing emails. In addition, the cannot be done as usual. (Phishing email: 28.txt in proposed models H-LSTMs and H-LSTMs+supervised data-no-header). only utilize the header network in the evaluation on data-full-header. 4.2 Baselines Data without header In order to see how well the proposed deep learn- In the first experiment, we focus on the first shared ing models (i.e, H-LSTMs and H-LSTMs+supervised) task where email headers are not considered. We com- perform with respect to the traditional methods for pare the proposed deep learning models with the SVM email phishing detection, we compare the proposed baselines. In particular, in the first setting, we use models with a baseline model based on Support Vector data-no-header as the training data and perform a 5- Machines (SVM) [CNU06]. We use the tf-idf scores fold stratified cross-validation to evaluate the models. of the words in the vocabulary as the features for this In the second setting, data-no-header is also utilized baseline [CNU06]. Note that since the email addresses as the training data, but the bodies extracted from and urls in the provided datasets have been mostly data-full-header (along with the corresponding labels) hidden to protect personal information, we cannot use are employed as the test data. The results of the first them as features in our SVM baselines as do the previ- setting are shown in Table 3 while the results of the ous systems. In addition, we examine the performance second setting are presented in Table 4. Note that we of this baseline when the pre-trained word embeddings report the performance of the SVM baselines when dif- are included in its feature set. This allows a fairer com- ferent combinations of the two types of features (i.e, parison of SVM with the deep learning models in this tf-idf and word embeddings) are employed in these work that take pre-trained word embeddings as the tables. input. The first observation from the tables is that the We employ the implementation of linear and nonlin- effect of the word embedding features for the SVM Models Precision Recall F1 tributions of the training data and test data in the H-LSTMs+supervised 0.9784 0.9466 0.9621 second experiment setting. H-LSTMs 0.9638 0.9448 0.9542 In the final submission for the first shared task (i.e, Linear SVM without email headers), we combine the training data +tfidf 0.9824 0.8856 0.9313 from data-no-header with the extracted bodies (along +emb 0.9529 0.9206 0.9364 +tfidf+emb 0.9837 0.9253 0.9536 with the corresponding labels) from the training data Kernel SVM of data-full-header to generate a new training set. As +tfidf 0.9684 0.8730 0.9180 H-LSTM+supervised is the best model in this devel- +emb 0.9408 0.9141 0.9273 opment experiment, we train it on the new training +tfidf+emb 0.9714 0.9174 0.9436 set and use the trained model to make predictions for the actual test set of the first shared task. Table 3: Performance comparison between the pro- posed models H-LSTMs and H-LSTMs+supervised Data with full header with the baseline models Linear and Kernel SVM. tfidf indicates tf-idf features while emb denotes features In this experiment, we aim to evaluate if the header obtained from the pre-trained word embeddings. network can help to improve the performance of H- LSTMs. We take the training dataset from data-full- Models Precision Recall F1 header to perform a 5-fold cross-validation evaluation. H-LSTMs+supervised 0.8892 0.7395 0.8075 The performance of H-LSTMs when the header net- H-LSTMs 0.8934 0.7054 0.7883 work is included or excluded is shown in Table 5. Linear SVM Models Precision Recall F1 +tfidf 0.8864 0.6978 0.7809 H-LSTMs (only body) 0.9732 0.9534 0.9631 +emb 0.8112 0.6918 0.7468 H-LSTMs + headers 0.9816 0.9596 0.9705 +tfidf+emb 0.8695 0.7018 0.7767 Kernel SVM Table 5: Cross-validation performance of H-LSTMs +tfidf 0.8698 0.7038 0.7780 with using headers compared to the original version. +emb 0.8216 0.6501 0.7259 +tfidf+emb 0.8564 0.6937 0.7665 From the table, we see that the header network is also helpful for H-LSTMs as it helps to improve the Table 4: Performance of all models on the test data performance of H-LSTMs for the dataset with email (data-full-headers). headers (an 0.7% improvement on the F1 score). In the final submission for the second shared task models are quite mixed. It improves the SVM mod- (i.e, with email headers), we simply train our best els with just tf-idf features significantly in the first model in this setting (i.e, H-LSTM+supervised) on the experiment setting while the effectiveness is somewhat training dataset of data-full-header. negative in the second experiment setting. Second, The time for the training and test process of the we see that the two versions of hierarchical LSTMs proposed (and submitted) models is shown in Table 6. (i.e, H-LSTMs and H-LSTMs+supervised) outperform Note that the training time of H-LSTMs+supervised the baseline SVM models in both experiment set- (for the first shared task) is longer than that of H- tings. The performance improvement is significant LSTMs+headers+supervised (for the second shared with large margins (up to 2.7% improvement on the task) since the training data of the former model absolute F1 score) in the second experiment setting includes both the original training data of the first (i.e, Table 4). The main gain is due to the recall, task and the extracted bodies from the training demonstrating the generalization advantages of the data of the second task. The test data of the proposed deep learning models over the traditional first shared task with H-LSTMs+supervised is also methods for phishing detection with SVM. Compar- larger than that of the second shared task with H- ing H-LSTMs+supervised and H-LSTMs, we see that LSTMs+headers+supervised. H-LSTMs+supervised is consistently better than H- LSTMs with significant improvement in the second Models Training Test setting. This shows the benefits of supervised atten- Time Time tion for hierarchical LSTM models for email phishing H-LSTMs+supervised 3.7 hours 4 minutes detection. Finally, we see that the performance in the H-LSTMs+headers+supervised 1.5 hours 1 minute first setting is in general much better than those in the Table 6: Training and test times of the submitted mod- second setting. We attribute this to the fact that text els. The experiments are run on a single NVIDIA Tesla data in data-no-header and data-full-header is quite K80 GPU. different, leading to the mismatch between data dis- Comparision with the participating systems on ducted to demonstrate the benefits of the proposed the actual test sets deep learning models. Tables 7 and 8 show the best performance on the ac- tual test data of all the teams that participate in the References shared tasks. Table 7 reports the performance for the [ANNWN07] Saeed Abu-Nimeh, Dario Nappa, Xin- first shared task (i.e, without email headers) while Ta- lei Wang, and Suku Nair. A compar- ble 8 presents the performance for the second shared ison of machine learning techniques for task (i.e, with email headers). These performance is phishing detection. In Proceedings of the measured and released by the organizers. The perfor- anti-phishing working groups 2nd annual mance of the systems we submitted is shown in the eCrime researchers summit, pages 60– rows with our team name (i.e, TripleN ). 69. ACM, 2007. Teams Precision Recall F1 [BCB14] Dzmitry Bahdanau, Kyunghyun Cho, TripleN (our team) 0.981 0.978 0.979 and Yoshua Bengio. Neural ma- Security-CEN@Amrita 0.962 0.989 0.975 chine translation by jointly learning to Amrita-NLP 0.972 0.974 0.973 align and translate. In arXiv preprint CEN-DeepSpam 0.951 0.964 0.958 arXiv:1409.0473, 2014. CENSec@Amrita 0.914 0.998 0.954 CEN-SecureNLP 0.890 1.000 0.942 [BCP+ 08] Andre Bergholz, Jeong Ho Chang, CEN-AISecurity 0.936 0.910 0.923 Gerhard Paass, Frank Reichartz, and Crypt Coyotes 0.936 0.910 0.923 Siehyun Strobel. Improved phishing de- tection using model-based features. In Table 7: The best performance of all the participating CEAS, 2008. teams in the first shared task with no email headers. [BMS08] Ram Basnet, Srinivas Mukkamala, and Teams Precision Recall F1 Andrew H Sung. Detection of phishing Amrita-NLP 0.998 0.994 0.996 attacks: A machine learning approach. TripleN (our team) 0.990 0.992 0.991 In Soft Computing Applications in In- CEN-DeepSpam 1.000 0.978 0.989 dustry, pages 373–383. Springer, 2008. Security-CEN@Amrita 0.998 0.976 0.987 CENSec@Amrita 0.882 1.000 0.937 [CNU06] Madhusudhanan Chandrasekaran, Kr- CEN-AISecurity 0.957 0.900 0.928 ishnan Narayanan, and Shambhu Upad- CEN-SecureNLP 0.880 0.971 0.924 hyaya. Phishing email detection based Crypt Coyotes 0.960 0.863 0.909 on structural properties. In NYS Cyber Table 8: The best performance of all the participating Security Conference, volume 3, 2006. teams in the second shared task with email headers. [DTH06] Rachna Dhamija, J Doug Tygar, and As we can see from the tables, our systems achieve Marti Hearst. Why phishing works. In the best performance for the first shared task and the Proceedings of the SIGCHI conference second best performance for the second shared task. on Human Factors in computing sys- These results are very promising and demonstrate the tems, pages 581–590. ACM, 2006. advantages of the proposed methods in particular and [EDB+ 18] Ayman Elaassal, Avisha Das, Shahryar deep learning in general for the problem of email phish- Baki, Luis De Moraes, and Rakesh ing recognition. Verma. Iwspa-ap: Anti-phising shared task at acm international workshop on 5 Conclusions security and privacy analytics. In We present a deep learning model to detect phish- Proceedings of the 1st IWSPA Anti- ing emails. Our model employs hierarchical attentive Phishing Shared Task. CEUR, 2018. LSTMs to model the email bodies at both the word [FST07] Ian Fette, Norman Sadeh, and Anthony level and the sentence level. A header network with Tomasic. Learning to detect phishing attentive LSTMs is also incorporated to model the emails. pages 649–656. ACM, 2007. headers of the emails. In the models, we propose a novel supervised attention technique to improve the [GBB11] Xavier Glorot, Antoine Bordes, and performance using the email frequency ranking of the Yoshua Bengio. Domain adaptation for words in the vocabulary. Several experiments are con- large-scale sentiment classification: A deep learning approach. In Proceed- [LXLZ15] Siwei Lai, Liheng Xu, Kang Liu, and ings of the 28th international conference Jun Zhao. Recurrent convolutional neu- on machine learning (ICML-11), pages ral networks for text classification. In 513–520, 2011. AAAI, volume 333, pages 2267–2273, 2015. [GS05] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with [MAS+ 11] Anutthamaa Martin, Na Anutthamaa, bidirectional lstm and other neural net- M Sathyavathy, Marie Manjari Saint work architectures. Neural Networks, Francois, Dr V Prasanna Venkatesan, 18(5-6):602–610, 2005. et al. A framework for predicting phishing websites using neural networks. [GTJA17] BB Gupta, Aakanksha Tewari, arXiv preprint arXiv:1109.1074, 2011. Ankit Kumar Jain, and Dharma P Agrawal. Fighting against phishing [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai attacks: state of the art and future Chen, Greg S Corrado, and Jeff Dean. challenges. Neural Computing and Distributed representations of words Applications, 28(12):3629–3654, 2017. and phrases and their compositionality. In Advances in neural information pro- [HA11] Isredza Rahmi A Hamid and Jemal cessing systems, pages 3111–3119, 2013. Abawajy. Hybrid feature selection for phishing email detection. In Interna- [MTM14] Rami M Mohammad, Fadi Thabtah, tional Conference on Algorithms and and Lee McCluskey. Predicting phishing Architectures for Parallel Processing, websites based on self-structuring neural pages 266–275. Springer, 2011. network. Neural Computing and Appli- cations, 25(2):443–458, 2014. [HS97] Sepp Hochreiter and Jurgen Schmidhu- ber. Long short-term memory. In Neural [MWI16] Haitao Mi, Zhiguo Wang, and Abe Itty- Computation, 1997. cheriah. Supervised attentions for neu- ral machine translation. arXiv preprint [JJJM07] Tom N Jagatic, Nathaniel A John- arXiv:1608.00112, 2016. son, Markus Jakobsson, and Filippo Menczer. Social phishing. Communica- [NG15a] Thien Huu Nguyen and Ralph Grish- tions of the ACM, 50(10):94–100, 2007. man. Event detection and domain adap- tation with convolutional neural net- [KB14] Diederik P. Kingma and Jimmy. Ba. works. In Proceedings of the 53rd Annual Adam: A method for stochastic opti- Meeting of the Association for Compu- mization. In arXiv: 1412.6980, 2014. tational Linguistics and the 7th Interna- tional Joint Conference on Natural Lan- [LCLZ17] Shulin Liu, Yubo Chen, Kang Liu, and guage Processing, pages 365–371, 2015. Jun Zhao. Exploiting argument informa- tion to improve event detection via su- [NG15b] Thien Huu Nguyen and Ralph Grish- pervised attention mechanisms. In Pro- man. Relation extraction: Perspec- ceedings of the 55th Annual Meeting of tive from convolutional neural networks. the Association for Computational Lin- In Proceedings of the 1st Workshop on guistics (Volume 1: Long Papers), vol- Vector Space Modeling for Natural Lan- ume 1, pages 1789–1798, 2017. guage Processing, pages 39–48, 2015. [LUF+ 16] Lemao Liu, LemLiu, Masao Utiyama, [NG16] Thien Huu Nguyen and Ralph Grish- Andrew Finch, ao Sumita, Masao man. Modeling skip-grams for event de- Utiyama, Andrew Finch, and Eiichiro tection with convolutional neural net- Sumita. Neural machine transla- works. In Proceedings of the 2016 Con- tion with supervised attention. arXiv ference on Empirical Methods in Natural preprint arXiv:1609.04186, 2016. Language Processing, 2016. [LV12] V Santhana Lakshmi and MS Vijaya. [NPG15] Thien Huu Nguyen, Barbara Plank, and Efficient prediction of phishing websites Ralph Grishman. Semantic represen- using supervised learning algorithms. tations for domain adaptation: A case Procedia Engineering, 30:798–805, 2012. study on the tree kernel-based method for relation extraction. In Proceedings of [SNM15] Hossein Siadati, Toan Nguyen, and the 53rd Annual Meeting of the Associ- Nasir Memon. Verification code for- ation for Computational Linguistics and warding attack (short paper). In In- the 7th International Joint Conference ternational Conference on Passwords, on Natural Language Processing, 2015. pages 65–71. Springer, 2015. [phi17a] 2017 data breach report finds phish- [SNM17] Hossein Siadati, Toan Nguyen, and ing, email attacks still potent. In Nasir Memon. X-platform phishing: https://digitalguardian.com/blog/2017- Abusing trust for targeted attacks short data-breach-report-finds-phishing-email- paper. In International Conference on attacks-still-potent, 2017. Financial Cryptography and Data Secu- rity, pages 587–596. Springer, 2017. [phi17b] Phishing scams cost american businesses half a billion dollars a year. In Forbes: [YYD+ 16] Zichao Yang, Diyi Yang, Chris Dyer, Phishing Scams Cost American Busi- Xiaodong He, Alex Smola, and Eduard nesses Half a Billion Dollars a Year, Hovy. Hierarchical attention networks 2017. for document classification. In Pro- ceedings of the 2016 Conference of the [pod16] How john podesta’s emails were hacked. North American Chapter of the Asso- In Forbes: How John Podestas Emails ciation for Computational Linguistics: Were Hacked and How to Prevent it Human Language Technologies, pages from Happening to You, 2016. 1480–1489, 2016. [PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gram- [ZY12] Ningxia Zhang and Yongqing Yuan. fort, V. Michel, B. Thirion, O. Grisel, Phishing detection using neural net- M. Blondel, P. Prettenhofer, R. Weiss, work. CS229 lecture notes, 2012. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825– 2830, 2011. [ŘS10] Radim Řehůřek and Petr Sojka. Soft- ware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, 2010. [SHK+ 14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Rus- lan Salakhutdinov, and Yoshua Bengio. Dropout: A simple way to prevent neu- ral networks from overfitting. In The Journal of Machine Learning Research, 2014. [Sin05] David Singer. Identification of spoofed email. Google Patents, August 25 2005. US Patent App. 10/754,220. [SNG+ 17] Hossein Siadati, Toan Nguyen, Payas Gupta, Markus Jakobsson, and Nasir Memon. Mind your smses: Mitigat- ing social engineering in second factor authentication. Computers & Security, 65:14–28, 2017.