A.R.E.S : Automatic Rogue Email Spotter Crypt Coyotes Vysakh S Mohan, Naveen J R, Vinayakumar R, Soman KP Center for Computational Engineering and Networking(CEN), Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, India vsmo92@gmail.com,naveenaksharam@gmail.com 1 Introduction Abstract Internet and staying connected through it is what dis- tinguishes this era from the previous. More and more people rely on the internet for their communication as Be it formal or casual, email is undoubtedly well as data transaction requirements. Email has rev- the most popular means of communication in olutionized the way people communicate over the web. modern times. Their popularity owes to the From its inception, electronic mails have outgrown fact that they are reliable, fast and more over its real world counterpart to become mainstream and free to use. One issue that plagues this oth- serve as both casual and official way of passing a mes- erwise solid technology is phishing emails re- sage. Now we have several service providers offering ceived by users. Phishing emails have always email platforms for free and with a plethora of fea- bothered users as it’s a huge waste of stor- tures. This means that the number of people taking age, time, money and resource to any user. advantage of these services have grown dramatically. Many previous attempts to eradicate or at This mass adoption is one aspect any malignant adver- least block phishing emails have been deemed sary could use to his benefit. Such malignant emails futile. This work uses word embedding as text are called spam[CM01], and they are unsolicited as representation for supervised classification ap- well as junk info usually unwanted for the user. They proach to identify phishing emails. Ruled are commonly characterized by the following: they are based and machine learning models with fea- mass mailed, may contain explicit content, useless ad- ture engineering were attempted but failed vertisements, fraudulent, may contain hidden links to due to the ever increasing ways of threats phishing websites etc. On a personal front the user and lack of scalability of the model. Deep could face issues like, annoyance due to irrelevant info, learning based models have shown to surpass unwanted use of bandwidth, waste of storage, makes the older techniques in spam email detection. the communication channel less productive via loss of This work aims at attempting the same using time sorting junk mails, unnecessary use of comput- a CNN/RNN/MLP network with Word2vec ing power, causes spread of viruses, loss of money via embeddings on phishing email corpus, where phishing etc. Word2vec helps to capture the synaptic and These issues have brought immense focus on safety semantic similarity of phishing and legitimate of users against spam emails. Massive pool of users emails in an email corpus. This work aims to using these platforms is one reason for it being tar- show the abilities of word embedding have to geted more often. It is an inexpensive means to gain solve issues related to cybersecurity use cases. access to millions of people, which forces adversaries to target it more often. The most dangerous type of Copyright c by the paper’s authors. Copying permitted for emails are the spam emails[KRA+ 07]. It may be via a private and academic purposes. spam email server or from personal servers containing In: R. Verma, A. Das (eds.): Proceedings of the 1st AntiPhish- ing Shared Pilot at 4th ACM International Workshop on Se- malicious URLs that could direct the users to phish- curity and Privacy Analytics (IWSPA 2018), Tempe, Arizona, ing sites. This is a challenging task and many solu- USA, 21-03-2018, published at http://ceur-ws.org tions have been devised to solve this problem over the past few years, but they all come with some downsides. and co-relation issues are ignored[SDHH98], that is, One reason it gets challenging is the variety of ways the multi variate nature of the problem breaks down in which the attacker can serve a spam email. A fre- to a uni-variate one without compromising on accu- quently used method is the blended attack. Malware racy. Different authors have tried to incorporate mod- delivery through such attacks may vary. Usually the ifications on top of the naive bayes pipeline, but the email itself may not contain the malware, but possibly approach was unable to find the correlation between contain a link to some compromised website. These words and the algorithm failed in certain tasks. In emails may look normal, but would contain a mix of 2004 Chih-Chin Lai and Tsai[LT04] introduced the legitimate as well as malicious content. A former re- TF-IDF, K-NN and SVM to overcome the issues in search by IBM’s X-Force team, found that more than the email filtering task. SVM, TF-IDF got a satisfac- 50% of the emails produced worldwide are fraudulent. tory result while K-NN got worst result among them. These figures are going to increase in the subsequent Blanzieri and Bryl came up with feature extraction years. methods in[BB08], along with SVM. During this time, One reason such attacks are successful is the care- unsupervised machine leaning techniques were also de- lessness from the generic user. Most internet users are veloped. Data were clustered into spam and ham. illiterate when it comes to cybersecurity and they sim- Whissell and Clarke[WC11] in 2011, came up with ply ignore the safety precautions that need to be exer- a novel research on spam clustering, which attained cised in the online space. There are no sure shot ways state of art result compared to all the previous meth- to check if a person has been a victim to such attacks, ods. Since the spam filtering is a diverse area, ensem- but can be prevented by being a bit cautious. You ble methods (combining different algorithms on same could check the email headers and check for grammat- problem), like boosting and bagging[GGWM+ 10], are ical mistakes. But these may not be sufficient when the applied to get effective classification. Caruana and scale of such attacks escalates. These type of states re- Li, (2012) focused[CL12] on distributed computuing quire some automated solution to detect spam email. paradigram using SVM and ANN by removing the in- Emails headers can help to a certain extent. They teroperability and implementation issues. can be used as features to some machine learning based classifiers[LT04, S+ 09]. The advantage of using header features compared to body features have been detailed in[ZZY04]. Header features like sender address, mes- sage ID etc. were used in[WC07] to make the detec- Machine learning models usually rely on some sort tion. of engineered features that are generated from the Most of the popular machine learning techniques data and has been proved to surpass the accuracy of consists of two steps: obtain the proper features rep- its predecessors in spam email classification[FRID+ 07, resentation from the data and use these features for AAY11], whereas, very few machine learning models learning and predicting the system. First step focuses for phishing emails exist today and most of them are on extracting useful info from the given URL, which in their infancy. With acquired domain knowledge, is stored as a vector so that the algorithm can fit dif- various feature engineering strategies are employed on ferent machine learning based models in it. Different the data to build the model[SAZ18], [PHS18], [VH13], categories of features have been taken[SLH17]. Lexi- [VSH12], [MG18], [HDC+ 18], [MBA18]. A main plus cal features, content features, host based features and to this method is the reduced effort to train the clas- context features are some of the popular ones. An al- sifier rather than developing complex rules for a filter. gorithm requires some form of mathematical represen- This feature engineering method could also deem the tation to work with. This work uses Word2vec embed- system vulnerable to manipulation and the model may ding methods for effective representation of the data. not scale well to newer threats. Deep learning mod- Spam filtering is a supervised classification problem els can be used to overcome this issue as they learn where the problem is considered as a binary classifi- the features themselves and modify it according to cation task with 2 classes: legitimate (good) emails newer inputs. On top of that these models are com- and spam emails. Tretyakov used methods like naive paratively more accurate and scalable. Nowadays deep bayes and K-NN machine learning algorithms for spam learning models combined with word embeddings have detection[Tre04], which doesn’t deal with feature se- given good performance for various cybersecurity use- lection but beneficial for beginners. Spam detection cases[VSP18a], [VSP18b], [VSP17], [LF17], [SKP18]. or automatic email filtering starts with statistical ap- This motivated the use of word embeddings with deep proaches primarily. The development began with pop- learning models like Multi-Layer Perceptron (MLP), ular naive bayes approaches, which reduced the prob- Recurrent Neural Network (RNN), Convolutional Neu- lem into a space where dependencies between the data ral Networks and Long Short Term Memory (LSTM). 2 Background which then slides over the entire rows and columns of the matrix. In this matrix each individual row is a This section details the theory behind various deep vector representing one word, more accurately speak- learning models used. ing, these are word embedding models like Glove1 or 2.1 Word2vec Word2vec2 . This work used Word2vec model before applying CNN in this task. CNN performs well on Word2vec is a model proposed by Mikalov[MSC+ 13] sequential data with faster training times and is ex- to learn the word embedding which is inspired ceptional for predictive analysis. CNN normally con- by distributed representation introduced by sist of an input layer followed by convolutional layers, Hinton[HMR+ 86], but in the Word2vec frame- maxpooling layers for dimensionality reduction pur- work, word representation is learned using a shallow pose and fully connected layers with a specific non- neural network. The fundamental assumption in linear activation function (ReLU in this work). In this word embedding or distributional methods is that, phishing email detection task (text based), one dimen- words with similar sense tends to happen in similar sional maxpooling layers and fully connected layers are context and they capture the similarity between used. Filters used in this network model slides above words[BG17], [BG18]. Word2vec is a popular model the embedding vector to output a continuous value at to generate word embeddings on text data. They have each step. This outputs better representations of the the ability to reproduce linguistic context of words word vectors. For text based applications 1D CNN is through training their shallow two layer architectures. used. The input to the Word2vec model may be a huge corpus and the generated outputs are vectors in some 2.3 Multi-layer perceptron (MLP) multi-dimensional space, with each unique word in the corpus have a corresponding vector associated Rosenblatt introduced the concept of a single percep- with it. This makes learning the word representation tron. Multi-layer perceptron (MLP) is typically a net- significantly faster than the previous methods. In the work of perceptrons or simple neurons. MLP consists Word2vec framework the distributed representation of one input and output layer. Dimensions of input of the words in the vocabulary is learned in an output nodes depends on the no of sample vectors and unsupervised way. Learning can be done via two the no of label vectors present in the input data. In be- architectures like skip-gram and continuous bag of tween these two layers, many hidden layers are present. words. There exist layers where the output is being fed as in- put to the following hidden layers and each unit does N 1 X X a relatively straight forward computation. It takes in- logp(Qn+k |Qk ) (1) put X multiplies it by a weight W , performs a sum- N n=1 −sks,j6=0 mation and passes all of that through an activation Skip-gram method tries to maximize the average function to yield the output. Perceptrons compute a probability value of the word sequence Q1 ,Q2 ,...QN . score or a single output from sequential inputs that Here ’s’ indicates training context size that is directly are usually real valued. This calculated score is used related to the center word Qn and p(Qn+k |wk ) is soft- for backward pass, where cost function is calculated by max function. In the skip-gram model, the context or matching wrongly predicted output to the truth label surrounding word is predicted given the centre word value, and is expressed as root mean square (RMS) as the input and in Continuous Bag of Words(CBOW) error value. This RMS error is minimized using gra- model, given the surrounding words the centre word is dient descent technique and optimum weight and base predicted. value is figured out from this network model. It uses activation functions like sigmoid or tanh to produce 2.2 Convolutional neural Nets (CNN) the output. One nature of MLP is the fully connected CNN is commonly used for computer vision tasks, architecture within its deep layers. where their local receptive field is advantageous for fea- ture learning in images. CNN models are also used for 2.4 Recurrent neural network (RNN) text classification tasks. CNN can be thought of as an The problem associated with MLP and CNN model is artificial neural network that has the ability to pick out that every input and outputs vectors are independent. or detect patterns and make sense out of them. These Or in other words above models can’t capture the se- pattern detection makes CNN useful for data analysis. quential info between the words. In phishing email CNN has hidden layers called convolutional layers are a tad bit different from MLP. For each convolutional 1 https://nlp.stanford.edu/projects/glove layers, the number of filters needs to be specified, 2 https://www.tensorflow.org/tutorials/Word2vec Table 1: Hyper Parameter for Word2vec Model Hyperparameter Batch-Size 250 The number of training samples required Embedding-Size 300 Word vector dimension Skip-Window 7 Context window, five words before and after each word Num-skips 12 How many prediction pairs are selected from the window Num-sampled 128 Number of negative samples Learning rate 0.1 Determines how quickly or slowly model update the parameter n-epoch 50 No of (forward+backward pass) detection task it is highly useful to identify the asso- ciated words for classification purposes. RNN model is popular in time series and sequence data analysis. It can take variable size inputs and return a variable size output. State of recurrent NN at time ’T’ is a function of its old state and the input at the time ’T’. Since it is storing previous state of system we can say that RNN has a ’memory’ to capture sequential info between words. Recurrent neural net is a varied it- eration of feed forward nets. The cyclic connections between the neurons makes way for results from pre- vious time step to compute the current state, in a way remembering the temporal information about the in- put data. This makes RNN learn well on data with long term dependency, like for natural language pro- cessing and speech processing applications. 3 DATASET DESCRIPTION The dataset[EDMB+ 18] used is provided at the 4th ACM International Workshop on Security and Privacy Figure 1: Proposed Architecture Analytics shared task[EDB+ 18]. The task was to de- tect phishing emails. Details of the dataset is shown from the header and the methodology used for conver- in Table2 & 3 sion of raw email samples to feature vectors the same for both the sub tasks. In both the sub tasks, the raw Table 2: Training Dataset details email corpus is fed to the embedding layer that uses Category Legitimate Phishing Total Word2vec model to generate distributed word embed- With No header 5088 612 5700 ding. The learned word embedding model is used to With header 4082 501 4583 represent the input data, which is then fed to a deep learning models. The hyperparameters used to create Word2vec model is detailed in Table 1. Table 3: Testing Dataset details The deep learning models learn additional features Training Dataset Data Samples which will be pushed to the fully connected layer. Pre- With No header 4300 vious work on similar problem suggests to use RNN to With header 4195 solve such tasks, but in order to have a better analysis on the performance of different models we incorpo- rated CNN and MLP to this work. Finally, due to the 4 Experiments and Result binary nature of this task we used sigmoid to clas- The proposed tool is christened A.R.E.S which stands sify legitimate emails from the phishing based on its for Automatic Rogue Email Spotter. A detailed vi- threshold and used binary cross entropy for loss reduc- sualization of the model is shown in Fig 1. The ar- tion. chitecture is a combination of word embedding with a From the statistics shown in Table 4 and 5, the word CNN, RNN, and MLP. This task is categorized into embedding model along with an MLP network gives 2 subs tasks, which are emails with ’no header’ and a commendable score for both the sub tasks. Fur- ’with header’. We didn’t extract any other features ther, when the same word embedding model is passed Table 4: Statistics of training results 10-fold cross Method Task validation accuracy Word embedding + MLP Sub task 1 0.921 Word embedding + CNN Sub task 1 0.952 Word embedding + RNN Sub task 1 0.951 Word embedding + MLP Sub task 2 0.901 Word embedding + CNN Sub task 2 0.912 Word embedding + RNN Sub task 2 0.931 Table 5: Statistics of test results Method Task TP TN FP FN Word embedding + CNN Sub task 1 3479 237 238 346 Word embedding + RNN Sub task 1 3446 224 251 379 Word embedding + RNN Sub task 2 3193 363 133 506 through CNN and RNN, it registered an overall im- olating the training corpus and by adding deeper lay- proved score from the previous MLP model. Specif- ers to infuse more feature learning capabilities to the ically, the CNN gave the highest score for sub task model. This work also demonstrates the possibilities 1, whereas RNN gave the best score for sub task 2, of amalgamating techniques from text analytics and over the validation set. The MLP model with 6 hid- deep learning for cybersecurity use cases. den layers of size 300 are used primarily for building the base model. The activation function is ReLU Acknowledgements and the dropout is 0.01. Model is implemented in Keras, which used the best validation score among This research was supported in part by Paramount 500 epochs. Then the model structure is extended Computer Systems. We are grateful to NVIDIA In- into CNN and RNN neural network models. CNN is dia for the GPU hardware support to the research implemented with 256 filters and maxpooling is used grant. We are grateful to Computational Engineering for dimensionality reduction between the dense lay- and Networking (CEN) department for encouraging ers. All experiments were performed on GPU enabled the research. + TensorFlow[ABC 16] in conjunction with the Keras framework[C+ 15]. All models are trained using back- References propagation. [AAY11] Tiago A Almeida, Jurandy Almeida, and Akebo Yamakami. Spam filtering: 5 Conclusion how the dimensionality reduction affects Phishing emails have always plagued even the average the accuracy of naive bayes classifiers. user and classifying the same properly is a challeng- Journal of Internet Services and Appli- ing task. Where former machine learning techniques cations, 1(3):183–200, 2011. failed, deep learning models have provided state of [ABC+ 16] Martı́n Abadi, Paul Barham, Jianmin the art performance. The CNN/RNN/MLP architec- Chen, Zhifeng Chen, Andy Davis, Jef- ture along with the Word2vec embeddings used in this frey Dean, Matthieu Devin, Sanjay work has outperformed former rule based and machine Ghemawat, Geoffrey Irving, Michael Is- learning based models. During training the model gave ard, et al. Tensorflow: A system for high accuracy, while the test accuracy were compara- large-scale machine learning. In OSDI, tively low due to the highly unbalance nature of the volume 16, pages 265–283, 2016. dataset. In the proposed system, no external data was provided to train the model. CNN had a slightly [BB08] Enrico Blanzieri and Anton Bryl. A better performance over RNN model on subtask1 and survey of learning-based techniques of RNN perform well for subtask2, on the test data. For email spam filtering. Artificial Intelli- subtask 1, the CNN managed a score of 95.2%, al- gence Review, 29(1):63–92, 2008. most comparable to RNN and for subtask 2, the RNN managed a score of 93.1%, making the RNN a better [BG17] Reshma U. Anand Kumar M. So- and more versatile overall performer. More accuracy man K.P. Barathi Ganesh, H.B. can be achieved with these trained model by extrap- Representation of target classes for text classification-amrita-cen- [HDC+ 18] Reza Hassanpour, Erdogan Dogdu, nlp@rusprofiling pan 2017. In CEUR Roya Choupani, Onur Goker, and Na- Workshop Proceedings, pages 25–27, zli Nazli. Phishing e-mail detection by 2017. using deep learning algorithms. In Pro- ceedings of the ACMSE 2018 Confer- [BG18] Anand Kumar M. Soman K.P. ence, page 45. ACM, 2018. Barathi Ganesh, H.B. From vec- tor space models to vector space models [HMR+ 86] Geoffrey E Hinton, James L McClel- of semantics. In Lecture Notes in land, David E Rumelhart, et al. Dis- Computer Science (including subseries tributed representations. Parallel dis- Lecture Notes in Artificial Intelligence tributed processing: Explorations in the and Lecture Notes in Bioinformatics), microstructure of cognition, 1(3):77– 10478 LNCS., pages 50–60, 2018. 109, 1986. [KRA+ 07] Ponnurangam Kumaraguru, Yong [C+ 15] François Chollet et al. Keras, 2015. Rhee, Alessandro Acquisti, Lorrie Faith Cranor, Jason Hong, and Elizabeth [CL12] Godwin Caruana and Maozhen Li. A Nunge. Protecting people from phish- survey of emerging approaches to spam ing: the design and evaluation of an filtering. ACM Computing Surveys embedded training email system. In (CSUR), 44(2):9, 2012. Proceedings of the SIGCHI confer- [CM01] Xavier Carreras and Lluis Marquez. ence on Human factors in computing Boosting trees for anti-spam email fil- systems, pages 905–914. ACM, 2007. tering. arXiv preprint cs/0109015, [LF17] Ruidan Li and Errin W Fulp. Evolu- 2001. tionary approaches for resilient surveil- lance management. In 2017 IEEE Se- [EDB+ 18] Ayman Elaassal, Avisha Das, Shahryar curity and Privacy Workshops (SPW), Baki, Luis De Moraes, and Rakesh pages 23–28. IEEE, 2017. Verma. Iwspa-ap: Anti-phising shared task at acm international workshop on [LT04] Chih-Chin Lai and Ming-Chi Tsai. An security and privacy analytics. In empirical performance comparison of Proceedings of the 1st IWSPA Anti- machine learning methods for spam e- Phishing Shared Task. CEUR, 2018. mail categorization. In Hybrid Intelli- gent Systems, 2004. HIS’04. Fourth In- [EDMB+ 18] Ayman Elaassal, Luis De Moraes, ternational Conference on, pages 44–48. Shahryar Baki, Rakesh Verma, and IEEE, 2004. Avisha Das. Iwspa-ap shared task email dataset, 2018. [MBA18] Youness Mourtaji, Mohammed Bouhorma, and Daniyal Alghaz- [FRID+ 07] Florentino Fdez-Riverola, Eva Lorenzo zawi. New phishing hybrid detection Iglesias, Fernando Dı́az, José Ramon framework. Journal of Theoretical & Méndez, and Juan M Corchado. Ap- Applied Information Technology, 96(6), plying lazy learning algorithms to tackle 2018. concept drift in spam filtering. Expert [MG18] Ankur Mishra and BB Gupta. In- Systems with Applications, 33(1):36–48, telligent phishing detection system us- 2007. ing similarity matching algorithms. International Journal of Information [GGWM+ 10] Pedro H Calais Guerra, Dorgival and Communication Technology, 12(1- Guedes, J Wagner Meira, Cristine 2):51–73, 2018. Hoepers, MHPC Chaves, and Klaus Steding-Jessen. Exploring the spam [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai arms race to characterize spam evolu- Chen, Greg S Corrado, and Jeff Dean. tion. In Proceedings of the 7th Col- Distributed representations of words laboration, Electronic messaging, Anti- and phrases and their compositionality. Abuse and Spam Conference (CEAS), In Advances in neural information pro- Redmond, WA, 2010. cessing systems, pages 3111–3119, 2013. [PHS18] Tianrui Peng, Ian Harris, and Yuki [VSP17] R Vinayakumar, KP Soman, and Sawa. Detecting phishing attacks us- Prabaharan Poornachandran. Deep en- ing natural language processing and ma- crypted text categorization. In Ad- chine learning. In Semantic Computing vances in Computing, Communications (ICSC), 2018 IEEE 12th International and Informatics (ICACCI), 2017 Inter- Conference on, pages 300–301. IEEE, national Conference on, pages 364–370. 2018. IEEE, 2017. [S+ 09] Jyh-Jian Sheu et al. An efficient two- [VSP18a] R Vinayakumar, KP Soman, and phase spam filtering method based on Prabaharan Poornachandran. Detect- e-mails categorization. IJ Network Se- ing malicious domain names using deep curity, 9(1):34–43, 2009. learning approaches at scale. Jour- nal of Intelligent & Fuzzy Systems, [SAZ18] Sami Smadi, Nauman Aslam, and 34(3):1355–1367, 2018. Li Zhang. Detection of online phish- ing email using dynamic evolving neural [VSP18b] R Vinayakumar, KP Soman, and network based on reinforcement learn- Prabaharan Poornachandran. Evaluat- ing. Decision Support Systems, 2018. ing deep learning approaches to charac- terize and classify malicious urls. Jour- [SDHH98] Mehran Sahami, Susan Dumais, David nal of Intelligent & Fuzzy Systems, Heckerman, and Eric Horvitz. A 34(3):1333–1343, 2018. bayesian approach to filtering junk e- mail. In Learning for Text Categoriza- [WC07] Chih-Chien Wang and Sheng-Yi Chen. tion: Papers from the 1998 workshop, Using header session messages to anti- volume 62, pages 98–105, 1998. spamming. Computers & Security, 26(5):381–390, 2007. [SKP18] Vysakh S Mohan Soman Kp, Vinayaku- mar R and Prabaharan Poornachan- [WC11] John S Whissell and Charles LA Clarke. dran. S.p.o.o.f net: Syntactic pat- Clustering for semi-supervised spam fil- terns for identification of ominous on- tering. In Proceedings of the 8th An- line factors. In 2018 IEEE Security and nual Collaboration, Electronic messag- Privacy Workshops (SPW). IEEE, [In- ing, Anti-Abuse and Spam Conference, Press], 2018. pages 125–134. ACM, 2011. [SLH17] Doyen Sahoo, Chenghao Liu, and [ZZY04] Le Zhang, Jingbo Zhu, and Tianshun Steven CH Hoi. Malicious url detection Yao. An evaluation of statistical spam using machine learning: A survey. arXiv filtering techniques. ACM Transactions preprint arXiv:1701.07179, 2017. on Asian Language Information Pro- cessing (TALIP), 3(4):243–269, 2004. [Tre04] Konstantin Tretyakov. Machine learn- ing techniques in spam filtering. In Data Mining Problem-oriented Semi- nar, MTAT, volume 3, pages 60–79, 2004. [VH13] Rakesh Verma and Nabil Hossain. Se- mantic feature selection for text with application to phishing email detection. In International Conference on Infor- mation Security and Cryptology, pages 455–468. Springer, 2013. [VSH12] Rakesh Verma, Narasimha Shashidhar, and Nabil Hossain. Detecting phishing emails the natural language way. In Eu- ropean Symposium on Research in Com- puter Security, pages 824–841. Springer, 2012.