-

F. Pedregosa, G. Varoquaux, A. Gram- fort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research

A Deep Learning Model with Hierarchical LSTMs and Supervised Attention for Anti-Phishing

Minh Nguyen

minh.nv142950@sis.hust.edu.vn minh.nv142950@sis.hust.edu.vn Thien Huu Nguyen University of Oregon Eugene, Oregon, USA thien@cs.uoregon.edu 0

Toan Nguyen

toan.v.nguyen@nyu.edu 1 0 Hanoi University of Science and Technology , Hanoi , Vietnam 1 New York University , Brooklyn, New York , USA

2018

12 2825

Anti-phishing aims to detect phishing content/documents in a pool of textual data. This is an important problem in cybersecurity that can help to guard users from fraudulent information. Natural language processing (NLP) o ers a natural solution for this problem as it is capable of analyzing the textual content to perform intelligent recognition. In this work, we investigate the state-of-theart techniques for text categorization in NLP to address the problem of anti-phishing for emails (i.e, predicting if an email is phishing or not). These techniques are based on deep learning models that have attracted much attention from the community recently. In particular, we present a framework with hierarchical long short-term memory networks (HLSTMs) and attention mechanisms to model the emails simultaneously at the word and the sentence level. Our expectation is to produce an e ective model for anti-phishing and demonstrate the e ectiveness of deep learning for problems in cybersecurity.

Introduction

Despite being one of the oldest tactics, email phishing remains the most common attack used by cybercriminals [phi17a] due to its e ectiveness. Phishing attacks exploit users' inability to distinguish between legitimate information from fake ones sent to them [DTH06, SNM15, SNM17, SNG+17]. In an email phishing campaign, attackers send emails appearing to be from well-known enterprises or organizations directly to their victims or by spoofed emails [Sin05]. These emails try to lure victims to divulge their private information [JJJM07, SNM15, SNG+17] or to visit an impersonated site (i.e., a fake banking website), on which they will be asked for passwords, credit card numbers or other sensitive information. The recent hack of a high pro le US politician (usually referred as \John Podesta's hack") is a famous example of this type of attack. It was all started by a spoofed email sent to the victim asking him to reset his Gmail password by clicking on a link in the email [pod16]. The technique of email phishing may seem simple, yet the damage it makes is huge. In the US alone, the estimated cost of phishing emails to business is half a billion dollars per year [phi17b].

Numerous methods have been proposed to automatically detect phishing emails [BCP+08, FST07, ANNWN07, GTJA17]. Chandrasekaran et. al proposed to use structural properties of emails and Support Vector Machines (SVM) to classify phishing emails [CNU06]. In [ANNWN07], Abu-Nimeh et. al evaluated six machine learning classi ers on a public phishing email dataset using proposed 43 features. Gupta et. al [GTJA17] presented a nice survey on recent state-of-the-art research on phishing detection. However, these methods mainly rely on feature engineering e orts to generate characteristics (features) to represent emails, over which machine learning methods can be applied to perform the task. Such feature engineering is often done manually and still requires much labor and domain expertise. This has hindered the portability of the systems to new domains and limited the performance of the current systems.

In order to overcome this problem, our work focuses on deep learning techniques to solve the problem of phishing email detection. The major bene t of deep learning is its ability to automatically induce e ective and task-speci c representations from data that can be used as features to recognize phishing emails. As deep learning has been shown to achieve state-of-theart performance for many natural language processing tasks, including text categorization [GBB11, LXLZ15], information extraction [NG15b, NG15a, NG16], machine translation [BCB14], among others, we expect that it would also help to build e ective systems for phishing email detection.

We present a new deep learning model to solve the problem of email phishing prediction using hierarchical long short-term memory networks (H-LSTMs) augmented with supervised attention technique. In the hierarchical LSTM model [YYD+16], emails are considered as hierarchical architectures with words in the lower level (the word level) and sentences in the upper level (the sentence level). LSTM models are rst implemented in the word level whose results are passed to LSTM models in the sentence level to generate a representation vector for the entire email. The outputs of the LSTM models in the two levels are combined using the attention mechanism [BCB14] that assigns contribution weights to the words and sentences in the emails. A header network is also integrated to model the headers of the emails if they are available. In addition, we propose a novel technique to supervise the attention mechanism [MWI16, LUF+16, LCLZ17] at the word level of the hierarchical LSTMs based on the appearance rank of the words in the vocabulary. Experiments on the datasets for phishing email detection in the First Security and Privacy Analytics AntiPhishing Shared Task (IWSPA-AP 2018) [EDB+18] demonstrate the bene ts of the proposed models, being ranked among the top positions among the participating systems of the shared task (in term of the performance on the unseen test data). 2

Related Work

Phishing email detection is a classic problem; however, research on this topic often has the same limitation: there is no o cial and big data set for it. Most previous works typically used a public set consists of legitimate or \ham" emails1 and another public set of phishing emails2 for their classi cation evaluation[FST07, BCP+08, BMS08, HA11, ZY12]. Other works used private but small data sets[CNU06, ANNWN07]. In addition, the ratio between phishing and legitimate emails in these data sets was typically balanced. This is not the case in the real-world scenario where the number of legitimate emails is much larger than that of phishing emails. Our current work relies on larger data sets with unbalanced distributions of phishing and legitimate emails collected for the the First Security and Privacy Analytics Anti-Phishing Shared Task (IWSPA-AP 2018) [EDB+18].

Besides the limitation of small data sets, the previous work has extensively relied on feature engineering to manually nd representative features for the problem. Apart from features extracted from emails, [LV12] also uses a blacklist of phishing webistes to get an additional feature for urls appearing in emails. Some neural network systems are also introduced to detect such blacklists [MTM14, MAS+11]. This is undesirable because these engineered features need to be updated once new types of phishing emails with new content are presented. Our work di ers from the previous work in this area in that we automate the feature engineering process using a deep learning model. This allows us to automatically learn e ective features for phishing email detection from data. Deep learning has recently been employed for feature extraction with success on many natural language processing problems [NG15b, NG15a]. 3

Proposed Model

Phishing email detection is a binary classi cation problem that can be formalized as follow.

Let e = fb; sg be an email in which b and s are the body content and header of the email respectively. Let y the binary variable to indicate whether e is a phishing email or not (y = 1 if e is a phishing email and y = 0 otherwise). In order to predict the legitimacy of the email, our goal is to estimate the probability P (y = 1je) = P (y = 1jb; s). In the following, we will describe our methods to model the body b and header s with the body network and header network respectively to achieve this goal. 3.1

Body Network with Hierarchical LSTMs

For the body b, we view it as a sequence of sentences b = (u1; u2; : : : ; uL) where ui is the i-th sentence and L is the number of sentences in the email body b. Each sentence ui is in turn a sequence of words/tokens ui = (vi;1; vi;2; : : : ; vi;K ) with vi;j as the j-th token in ui and 1https://spamassassin.apache.org/old/publiccorpus 2https://monkey.org/~jose/phishing K as the length of the sentence. Note that we set L and K to the xed values by padding the sentences ui and the body b with dummy symbols.

As there are two levels of information in b (i.e, the word level with the words vi;j and the sentence level with the sentence ui), we consider a hierarchical network that involves two layers of bidirectional long short-term memory networks (LSTMs) to model such information. In particular, the rst layer consumes the words in the sentences via the embedding module, the bidirectional LSTM module and the attention module to obtain representation vectors for every sentence ui in b (the word level layer). Afterward, the second network layer combines the representation vectors from the rst layer with another bidirectional LSTM and attention module, leading to a representation vector for the whole body email b (the sentence level layer). This body representation vector would then be used as features to estimate P (yjb; s) and make the prediction for the initial email e. 3.1.1

The Word Level Layer Embedding

In the word level layer, every word vi;j in each sentence ui in b is rst transformed into its embedding vector wi;j . In this paper, wi;j is retrieved by taking the corresponding column vector in the word embedding matrix We [MSC+13] that has been pre-trained from a large corpus: wi;j = We[vi;j ] (each column in the matrix We corresponds to a word in the vocabulary). As the result of this embedding step, every sentence ui = (vi;1; vi;2; : : : ; vi;K ) in b would be converted into a sequence of vectors (wi;1; wi;2; : : : ; wi;K ), constituting the inputs for the bidirectional LSTM model in the next step.

Bidirectional LSTMs for the word level

This module employs two LSTMs [HS97, GS05] that run over each input vector sequence (wi;1; wi;2; : : : ; wi;K ) via two di erent directions, i.e, forward (from wi;1 to wi;K ) and backward (from wi;K to wi;1). Along with their operations, the forward LSTM generates the forward hidden vector sequence (h!i;1; h!i;2; : : : ; hi;!K ) while the backward LSTM produce the backward hidden vector sequence (hi;1; hi;2; : : : ; hi;K ). These two hidden vector sequences are then concatenated at each position, resulting in the new hidden vector sequence (hi;1; hi;2; : : : ; hi;K ) for the sentence ui in b where hi;j = [h!i;j ; hi;j ]. The notable characteristics of the hidden vector hi;j is that it encodes the context information over the whole sentence ui due to the recurrent property of the forward and backward LSTMs although a greater focus is put at the current word vi;j . In this module, the vectors in the hidden vector sequence (hi;1; hi;j;2; : : : ; hi;K ) are combined to generate a single representation vector for the initial sentence ui. The attention mechanism [YYD+16] seeks to do this by computing a weighted sum of the vectors in the sequence. Each hidden vector hi;j would be assigned to a weight i;j to estimate its importance/contribution in the representation vector for ui for the phishing prediction of the email e. In this work, the weight i;j for hi;j is computed by: (1) (2) (3) i;j =

exp(ai>;j wa)

Pj0 exp(ai>;j0 wa) in which

ai;j = tanh(Watthi;j + batt) Here, Watt, batt and wa are the model parameters that would be learnt during the training process. Consequently, the representation vector u^i for the sentence ui in b would be: u^i = X

i;j hi;j j

After the word level layer completes its operation on every sentence of b = (u1; u2; : : : ; uL), we obtain a corresponding sequence of sentence representation vectors (u^1; u^2; : : : ; u^L). This vector sequence would be combined in the next sentence level layer to generate a single vector to represent b for phishing prediction. 3.1.2

The Sentence Level Layer

The sentence level layer processes the vector sequence (u^1; u^2; : : : ; u^L) in the same way that the word level layer has employed for the vector sequence (wi;1; wi;2; : : : ; wi;K ) for each sentence ui. Speci cally, (u^1; u^2; : : : ; u^L) is also rst fed into a bidirectional LSTM module (i.e, with a forward and backward LSTM) whose results are concatenated at each position to produce the corresponding hidden vector sequence (h^1; h^2; : : : ; h^L). In the next step with the attention module, the vectors in (h^1; h^2; : : : ; h^L) are weighted and summed to nally generate the representation vector rb for the email body b of e. Assuming the attention weights for (h^1; h^2; : : : ; h^L) are ( 1; 2; : : : ; L) respectively. the body vector rb is then computed by: rb = X ^ ihi i (4)

Note that the model parameters of the bidirectional LSTM modules (and the attention modules) in the word level layer and the sentence level layer are separate and they are both learnt in a single training process. Figure 1 shows the overview of the body network with hierarchical LSTMs and attention.

Once the body vector rb has been computed, we can use it as features to estimate the phishing probability via:

P (y = 1jb; s) = (Woutrb + bout) (5) where Wout and bout are the model parameters and is the logistic function. 3.2

Header Network

The probability estimation in Equation 5 does not consider the headers of the emails. For the email datasets with headers available, we can model the headers with a separate network and use the resulting representation as additional features to estimate the phishing probability. In this work, we consider the header s of the initial email e as a sequence of words/tokens: (xi; x2; : : : ; xH ) where xi is the i-th word in the header and H is the length of the header. In order to compute the representation vector rs for s, we also employ the same network architecture as the word level layer in the body network using separate modules for embedding module, bidirectional LSTM, and attention (i.e, Section 3.1.1). An overview of this header network is presented in Figure 2.

Once the header representation vector rs is generated, we concatenate it with the body representation vector rb obtained from the body network, leading to the nal representation vector r = [rb; rs] to compute the probability P (y = 1jb; s) = (Wsubr + bsub) (Wsub and bsub are model parameters).

In order to train the models in this work, we minimize the negative log-likelihood of the models on a training dataset in which the negative log-likelihood for the email e is computed by:

Lc = log(P (y = 1je)) (6)

The model we have described so far is called HLSTMs for convenience. 3.3

Supervised Attention

The attention mechanism in the body and header networks is expected to assign high weights for the informative words/sentences and downgrade the irrelevant words/sentences for phishing detection in the emails. However, this ideal operation can only be achieved when an enormous training dataset is provided to train the models. In our case of phishing email detection, the size of the training dataset is not large enough and we might not be able to exploit the full advantages of the attention. In this work, we seek for useful heuristics for the problem and inject them into the models to facilitate the operation of the attention mechanism. In particular, we would rst heuristically decide a score for every word in the sentences so that the words with higher scores are considered as being more important for phishing detection than those with lower scores. Afterward, the models would be encouraged to produce attention weights for words that are close to their heuristic importance scores. The expectation is that this mechanism would help to introduce our intuition into the attention weights to compensate for the small scale of the training dataset, potentially leading to a better performance of the models. Assuming the importance scores for the words in the sentence (vi;1; vi;2; : : : ; vi;K ) be (gi;1; gi;2; : : : ; gi;K ) respectively, we force the attention weights ( i;1; i;2; : : : ; i;K ) (Equation 1) to be close to the importance scores by penalizing the models that render large square di erence between the attention weights and the importance scores. This amounts to adding the square di erence into the objective function in Equation 6: Le = Lc +

X(gi;j i;j i;j )2 (7) where

is a trade-o constant.

Importance Score Computation

In order to compute the importance scores, our intuition is that a word is important for phishing detection if it appears frequently in phishing emails and less frequently in legitimate emails. The fact that an important word does not appear in many legitimate emails helps to eliminate the common words that are used in most documents. Consequently, the frequent words that are speci c to the phishing emails would receive higher importance scores in our method. Note that our method to nd the important words for phishing emails is di erent from the prior work that has only considered the most frequent words in the phishing emails and ignored their appearance in the legitimate emails.

We compute the importance scores as follow. For every word v in the vocabulary, we count the number of the phishing and legitimate emails in a training dataset that contain the word. We call the results as the phishing email frequency and the legitimate email frequency respectively for v. In the next step, we sort the words in the vocabulary based on its phishing and legitimate email frequencies in the descending order. After that, a word v would have a phishing rank (phishingRank(v)) and a legitimate rank (legitimateRank(v)) in the sorted word sequences based on the phishing and legitimate frequencies (the higher the rank is, the less the frequency is). Given these ranks, the unnormalized importance score for v is computed by:3 score[v] = legitimateRank[v] phishingRank[v] (8)

The rationale for this formula is that a word would have a high importance score for phishing prediction if its legitimate rank is high and its phishing rank is low. Note that we use the ranks of the words instead of the frequencies because the frequencies are a ected by the size of the training dataset, potentially making the scores unstable. The ranks are less a ected by the dataset size and provide a more stable measure. Table 1 demonstrates the top 20 words with the highest unnormalized importance scores in our vocabulary.

The H-LSTMs model augmented with the supervised attention mechanism above is called HLSTM+supervised in the experiments.

3The actual important scores of the words we use in Equation 7 are normalized for each sentence. account your click mailbox cornell link verify customer access reserved dear log accounts paypal complete service protect secure mail clicking We train the models in this work with stochastic gradient descent, shu ed mini-batches, Adam update rules [KB14]. The gradients are computed via backpropagation while dropout is used for regularization [SHK+14]. We also implement gradient clipping to rescale the Frobenius norms of the non-embedding weights if they exceed a prede ned threshold. The models in this work are developed to participate in the First Security and Privacy Analytics Anti-Phishing Shared Task (IWSPA-AP 2018) [EDB+18]. The organizers provide two datasets to train the models for email phishing recognition. The rst dataset involves emails that only have the body part (called datano-header ) while the second dataset contains emails with both bodies and headers (called data-full-header. These two datasets translate into two shared tasks to be solved by the participants. The statistics of the training data for these two datasets are shown in Table 2.

Datasets data-no-header data-full-header #legit 5092 4082 #phish 629 503

The raw test data (i.e, without labels) for these datasets are released to the participants at a specied time. The participants would have one week to run their systems on such raw test data and submit the results to the organizers for evaluation.

Regarding the preprocessing procedure for the datasets, we notice that a large part of the text in the email bodies is quite unstructured. The sentences are often short and/or not clearly separated by the ending-sentence symbols (i.e, f. ! ?g). In order to split the bodies of the emails into sentences for our models, we develope an in-house sentence splitter specially designed for the datasets. In particular, we determine the beginning of a sentence by considering if the rst word of a new line is capitalized or not, or if a capitalized word is immediately followed by an ending-sentence symbol. The sentences whose lengths (numbers of words) are less than 3 are combined to create a longer sentence. This reduces the number of sentences signi cantly and expands the context for the words in the sentences as they are processed by the models. Figure 3 shows a phishing email from the datasets. In order to see how well the proposed deep learning models (i.e, H-LSTMs and H-LSTMs+supervised) perform with respect to the traditional methods for email phishing detection, we compare the proposed models with a baseline model based on Support Vector Machines (SVM) [CNU06]. We use the tf-idf scores of the words in the vocabulary as the features for this baseline [CNU06]. Note that since the email addresses and urls in the provided datasets have been mostly hidden to protect personal information, we cannot use them as features in our SVM baselines as do the previous systems. In addition, we examine the performance of this baseline when the pre-trained word embeddings are included in its feature set. This allows a fairer comparison of SVM with the deep learning models in this work that take pre-trained word embeddings as the input.

We employ the implementation of linear and nonlinear (kernel) SVM from the sklearn library [PVG+11] for which the tf-idf representations of the emails are obtained via the gensim toolkit [RS10]. The word embedding features are computed by taking the mean vector of the pre-trained embeddings of the words in the emails [NPG15]. 4.3

Hyper-parameter Selection

As the size of the provided datasets is small and no development data is included, we use a 5-fold strati ed cross validation on the training data of the provided datasets to search for the best hyper-parameters for the models. The hyper-parameters we found are as follows.

The size of word embedding vector is 300 while the cell sizes are set to 60 for all the LSTMs in the body and header networks. The size of attention vectors at the attention modules for the body and header networks are also set to 60. The coe cient for supervised attention is set to 0.1, the threshold for gradient clipping is 0.3 and the drop rate for drop-out is 0.5. For the Adam update rule, we use the learning rate of 0.0025. Finally, we set C = 10:0 for the linear SVM baseline. The nonlinear version of SVM we use is C-SVC with radial basis function kernel and (C; ) = (50:0; 0:1). 4.4

Results

In the experiments below, we employ the precision, recall and F1-score to evaluate the performance of the models for detecting phishing emails. In addition, the proposed models H-LSTMs and H-LSTMs+supervised only utilize the header network in the evaluation on data-full-header.

Data without header

In the rst experiment, we focus on the rst shared task where email headers are not considered. We compare the proposed deep learning models with the SVM baselines. In particular, in the rst setting, we use data-no-header as the training data and perform a 5fold strati ed cross-validation to evaluate the models. In the second setting, data-no-header is also utilized as the training data, but the bodies extracted from data-full-header (along with the corresponding labels) are employed as the test data. The results of the rst setting are shown in Table 3 while the results of the second setting are presented in Table 4. Note that we report the performance of the SVM baselines when different combinations of the two types of features (i.e, tf-idf and word embeddings) are employed in these tables.

The rst observation from the tables is that the e ect of the word embedding features for the SVM models are quite mixed. It improves the SVM models with just tf-idf features signi cantly in the rst experiment setting while the e ectiveness is somewhat negative in the second experiment setting. Second, we see that the two versions of hierarchical LSTMs (i.e, H-LSTMs and H-LSTMs+supervised) outperform the baseline SVM models in both experiment settings. The performance improvement is signi cant with large margins (up to 2.7% improvement on the absolute F1 score) in the second experiment setting (i.e, Table 4). The main gain is due to the recall, demonstrating the generalization advantages of the proposed deep learning models over the traditional methods for phishing detection with SVM. Comparing H-LSTMs+supervised and H-LSTMs, we see that H-LSTMs+supervised is consistently better than HLSTMs with signi cant improvement in the second setting. This shows the bene ts of supervised attention for hierarchical LSTM models for email phishing detection. Finally, we see that the performance in the rst setting is in general much better than those in the second setting. We attribute this to the fact that text data in data-no-header and data-full-header is quite di erent, leading to the mismatch between data dis0.9824 0.9529 0.9837 tributions of the training data and test data in the second experiment setting.

In the nal submission for the rst shared task (i.e, without email headers), we combine the training data from data-no-header with the extracted bodies (along with the corresponding labels) from the training data of data-full-header to generate a new training set. As H-LSTM+supervised is the best model in this development experiment, we train it on the new training set and use the trained model to make predictions for the actual test set of the rst shared task.

Data with full header

In this experiment, we aim to evaluate if the header network can help to improve the performance of HLSTMs. We take the training dataset from data-fullheader to perform a 5-fold cross-validation evaluation. The performance of H-LSTMs when the header network is included or excluded is shown in Table 5.

Models H-LSTMs (only body) H-LSTMs + headers

From the table, we see that the header network is also helpful for H-LSTMs as it helps to improve the performance of H-LSTMs for the dataset with email headers (an 0.7% improvement on the F1 score).

In the nal submission for the second shared task (i.e, with email headers), we simply train our best model in this setting (i.e, H-LSTM+supervised) on the training dataset of data-full-header.

The time for the training and test process of the proposed (and submitted) models is shown in Table 6. Note that the training time of H-LSTMs+supervised (for the rst shared task) is longer than that of HLSTMs+headers+supervised (for the second shared task) since the training data of the former model includes both the original training data of the rst task and the extracted bodies from the training data of the second task. The test data of the rst shared task with H-LSTMs+supervised is also larger than that of the second shared task with HLSTMs+headers+supervised.

Models H-LSTMs+supervised H-LSTMs+headers+supervised Training

Time 3.7 hours 1.5 hours

Test

Time 4 minutes 1 minute [BCB14] [BCP+08] [BMS08] [CNU06] [DTH06] [EDB+18] [FST07] [GBB11]

Dzmitry Bahdanau, Kyunghyun Cho,

and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In arXiv preprint arXiv:1409.0473, 2014.

Andre Bergholz, Jeong Ho Chang,

Gerhard Paass, Frank Reichartz, and Siehyun Strobel. Improved phishing detection using model-based features. In CEAS, 2008.

Ram Basnet, Srinivas Mukkamala, and Andrew H Sung. Detection of phishing attacks: A machine learning approach.

In Soft Computing Applications in Industry, pages 373{383. Springer, 2008.

Madhusudhanan Chandrasekaran, Krishnan Narayanan, and Shambhu Upadhyaya. Phishing email detection based on structural properties. In NYS Cyber

Security Conference, volume 3, 2006.

Rachna Dhamija, J Doug Tygar, and Marti Hearst. Why phishing works. In

Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 581{590. ACM, 2006.

Ayman Elaassal, Avisha Das, Shahryar

Baki, Luis De Moraes, and Rakesh Verma. Iwspa-ap: Anti-phising shared task at acm international workshop on security and privacy analytics. In Proceedings of the 1st IWSPA AntiPhishing Shared Task. CEUR, 2018.

Ian Fette, Norman Sadeh, and Anthony Tomasic. Learning to detect phishing emails. pages 649{656. ACM, 2007. Xavier Glorot, Antoine Bordes, and

Yoshua Bengio. Domain adaptation for large-scale sentiment classi cation: A

As we can see from the tables, our systems achieve the best performance for the rst shared task and the second best performance for the second shared task. These results are very promising and demonstrate the advantages of the proposed methods in particular and deep learning in general for the problem of email phishing recognition. 5

Conclusions

We present a deep learning model to detect phishing emails. Our model employs hierarchical attentive LSTMs to model the email bodies at both the word level and the sentence level. A header network with attentive LSTMs is also incorporated to model the headers of the emails. In the models, we propose a novel supervised attention technique to improve the performance using the email frequency ranking of the words in the vocabulary. Several experiments are con

Alex Graves and Jurgen Schmidhuber.

Framewise phoneme classi cation with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602{610, 2005.

BB Gupta, Aakanksha Tewari,

Ankit Kumar Jain, and Dharma P Agrawal. Fighting against phishing attacks: state of the art and future challenges. Neural Computing and Applications, 28(12):3629{3654, 2017.

Isredza Rahmi A Hamid and Jemal

Abawajy. Hybrid feature selection for phishing email detection. In International Conference on Algorithms and Architectures for Parallel Processing, pages 266{275. Springer, 2011.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. In Neural

Computation, 1997.

Tom N Jagatic, Nathaniel A Johnson, Markus Jakobsson, and Filippo Menczer. Social phishing. Communications of the ACM, 50(10):94{100, 2007. Diederik P. Kingma and Jimmy. Ba. Adam: A method for stochastic optimization. In arXiv: 1412.6980, 2014. Shulin Liu, Yubo Chen, Kang Liu, and

Jun Zhao. Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1789{1798, 2017.

Lemao Liu, LemLiu, Masao Utiyama,

Andrew Finch, ao Sumita, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Neural machine translation with supervised attention. arXiv preprint arXiv:1609.04186, 2016.

V Santhana Lakshmi and MS Vijaya. E cient prediction of phishing websites using supervised learning algorithms.

Procedia Engineering, 30:798{805, 2012. [LXLZ15] [MAS+11] [MSC+13] [MTM14] [MWI16] [NG15a] [NG15b] [NG16] [NPG15]

Siwei Lai, Liheng Xu, Kang Liu, and

Jun Zhao. Recurrent convolutional neural networks for text classi cation. In AAAI, volume 333, pages 2267{2273, 2015.

Anutthamaa Martin, Na Anutthamaa,

M Sathyavathy, Marie Manjari Saint Francois, Dr V Prasanna Venkatesan, et al. A framework for predicting phishing websites using neural networks. arXiv preprint arXiv:1109.1074, 2011.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Distributed representations of words and phrases and their compositionality.

In Advances in neural information processing systems, pages 3111{3119, 2013.

Rami M Mohammad, Fadi Thabtah,

and Lee McCluskey. Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25(2):443{458, 2014.

Haitao Mi, Zhiguo Wang, and Abe Itty

cheriah. Supervised attentions for neural machine translation. arXiv preprint arXiv:1608.00112, 2016.

Thien Huu Nguyen and Ralph Grish

man. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 365{371, 2015.

Thien Huu Nguyen and Ralph Grishman. Relation extraction: Perspective from convolutional neural networks.

In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39{48, 2015.

Thien Huu Nguyen and Ralph Grish

man. Modeling skip-grams for event detection with convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.

Thien Huu Nguyen, Barbara Plank, and

Ralph Grishman. Semantic representations for domain adaptation: A case study on the tree kernel-based method [PVG+11] [RS10] [SHK+14] [Sin05] [SNG+17] for relation extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015. 2017 data breach report nds phishing, email attacks still potent. In https://digitalguardian.com/blog/2017data-breach-report- nds-phishing-emailattacks-still-potent, 2017.

Phishing scams cost american businesses half a billion dollars a year. In Forbes:

Phishing Scams Cost American Businesses Half a Billion Dollars a Year, 2017.

How john podesta's emails were hacked.

In Forbes: How John Podestas Emails Were Hacked and How to Prevent it from Happening to You, 2016.

Radim Rehurek and Petr Sojka. Soft

ware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.

Nitish Srivastava, Geo rey Hinton,

Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, and Yoshua Bengio. Dropout: A simple way to prevent neural networks from over tting. In The Journal of Machine Learning Research, 2014. [SNM17] [ZY12]

[ANNWN07]

Saeed

Abu-Nimeh , Dario Nappa,

Xinlei

Wang , and

Suku

Nair . A comparison of machine learning techniques for phishing detection . In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit , pages 60 { 69 . ACM, 2007 .

[YYD+16] Hossein

Siadati

, Toan Nguyen, and

Nasir

Memon . Veri cation code forwarding attack (short paper) . In International Conference on Passwords , pages 65 { 71 . Springer, 2015 .

Hossein

Siadati , Toan Nguyen, and

Nasir

Memon . X-platform phishing: Abusing trust for targeted attacks short paper . In International Conference on Financial Cryptography and Data Security , pages 587 { 596 . Springer, 2017 .

Zichao

Yang ,

Diyi

Yang , Chris Dyer, Xiaodong He, Alex Smola , and Eduard Hovy . Hierarchical attention networks for document classi cation . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480 { 1489 , 2016 .

Phishing detection using neural network . CS229 lecture notes , 2012 .