=Paper=
{{Paper
|id=Vol-1353/paper_06
|storemode=property
|title=Massively Parallel kNN using CUDA on Spam-Classification
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_06.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/SmithrudMA15
}}
==Massively Parallel kNN using CUDA on Spam-Classification==
Massively Parallel kNN using CUDA on Spam-Classification Joshua M. Smithrud Patrick McElroy Răzvan Andonie Computer Science Department Computer Science Department Computer Science Department Central Washington University Central Washington University Central Washington University Ellensburg, WA, USA Ellensburg, WA, USA Ellensburg, WA, USA Email: jsmithrud37@gmail.com Email: mcelrp@gmail.com and Electronics and Computers Department Transilvania University Braşov, Romania andonie@cwu.edu Abstract be difficult. Machine Learning can help solve this problem: the email client can be trained to learn where to put each Email Spam-classification is a fundamental, unseen element email. There are many Machine Learning techniques avail- of everyday life. As email communication becomes more pro- lific, and email systems become more robust, it becomes in- able to filter emails, including: Bayesian methods, Neural creasingly necessary for Spam-classification systems to run Networks, Support Vector Machines, and Deep Belief Net- accurately and efficiently while remaining all but invisible to works, kNN, see: (Cormack 2008), (Tretyakov 2004), (Pan- the user. We propose a massively parallel implementation of igrahi 2012), (Blanzieri and Bryl 2008), (Sallab and Rash- Spam-classification using the k-Nearest Neighbors (kNN) al- wan 2012). Actually, the most popular and well-developed gorithm on nVIDIA GPUs using CUDA. Being very simple approaches to anti-Spam are Machine Learning-based. and straightforward, the performance of the kNN search de- grades dramatically for large data sets, since the task is com- putationally intensive. By utilizing the benefits of GPUs and Filte et al. (Firte, Lemnaru, and Potolea 2010) proposed CUDA, we seek to overcome that cost. a kNN-based Spam-detection filter. In this scheme, mes- sages are classified with the algorithm based on a set of fea- tures extracted from the properties and content of the emails. 1 Introduction The training set is resampled to the most appropriate size A study estimated that over 70% of today’s business emails and positive class distribution determined by several experi- are Spam (see (Blanzieri and Bryl 2008)). The purpose of ments. The system performs a constant update of the data set email Spam is advertising, promotion, and spreading back- and the list of most frequent words that appear in Spam mes- doors or malicious programs. The cost of sending Spam sages. The main reason for selecting the kNN algorithm was emails is much lower than the cost of automatically de- that it does not require a training step for each query. How- tecting and removing these emails. The processes of Spam- ever, the method is offline and the authors focus on accuracy, detection and filtration have become everyday occurrences rather than execution time. Actually, the kNN method can within the average person’s life. Though unbeknownst to run quite slow, especially for larger values of k and when most, each and every instance of digital communiqué goes the input dataset is very large. through a series of automated filters to determine whether or not the user would have any interest in that particular item. But how can we classify in real-time (online) incoming Because Spam-filtering systems are expected to be trans- emails? Do we have to start the whole process from scratch parent to the user, there are two key aspects a good Spam- each time? This is practically impossible when we have to filter must contain: accuracy and efficiency. As the volume deal with very large sets of emails. One solution would be of email correspondences increases, so too does the amount to use incremental and parallel processing. Good scalabil- of data required to properly classify emails into the cate- ity can be achieved by efficiently using parallel computer gories ”Spam” and ”Ham” (non-Spam). The larger these architectures like GPUs. Presently, GPUs are widely avail- training-sets become, the more time is consumed by the able and relatively inexpensive. In this context, the high- process. As a result, in order to prevent the process from parallelism inherent to the GPU makes this device especially becoming a burden to the user, Spam-filters must also be well-suited to address Machine Learning problems with pro- implemented more efficiently in order to compensate for hibitively computationally intensive tasks. An increasing the increased workload. No one wants to have to deal with number of machine learning algorithms are implemented on Spam; the email client should handle it on its own. Our GPU’s (Lopes and Ribeiro 2011). At this moment, one of goal is to demonstrate the feasibility of an email client with the universally accepted programming languages for GPUs hyper-accurate and hyper-efficient Spam-filtration capabili- is CUDA (Compute Unified Device Architecture) created ties, such that the standard email-user need not worry about by nVIDIA. Since its introduction in 2006, CUDA has been the process. widely deployed through thousands of applications and pub- A Spam-filter needs to sort incoming mail into wanted lished research papers, and supported by an installed base of and unwanted, and it needs to do it accurately, which can over 500 million CUDA-enabled GPUs in notebooks, work- stations, compute clusters and supercomputers1 . Spam within the user’s account. In order to maintain an ac- Our goal was to design an email client with a highly- curate list, the training set must be rebased with some regu- accurate, highly-efficient Spam-filtration capability. We larity. Depending on the environment in which the applica- propose a massively parallel implementation of Spam- tion is being used, this can be done on an absolute schedule, classification using the kNN algorithm on nVIDIA GPUs or could be done as specified by the user. using CUDA. The system we created can incrementally add The rebasing process accomplishes a couple things. First, new email samples to improve its accuracy. The proposed the list of 50 keywords is re-evaluated. Using Lucene.NET’s approach is not only fast but also scalable to large-scale in- open source text analytics libraries2 , we determine the stances. It also has the potential to scale for future devices 50 most common keywords found within the user’s Spam with increasing number of compute units. folder. Once the new 50 keywords are found, the keyword The rest of the paper is structured as follows. Section 2 counts for preexisting elements in the training set are recal- describes how we extract features from emails. In Section 3, culated. Second, training set values are normalized again (as we introduce our parallel implementation of the kNN algo- new emails are added to the training set, it is possible for rithm. Section 4 presents the architecture of the Spam filter, attribute values to exceed 1, so rebasing regularly becomes both from the designer and user perspectives. In Section 5 important to maintain consistency). we describe experiments, discussing execution time and ac- Basing the 50 keywords on the emails within the user’s ac- curacy. Section 6 concludes with final remarks. count allows for better personal accuracy. One man’s Ham is another man’s Spam (and vice-versa), so personalizing the 2 Feature Extraction from Email training set is a valuable tool. There is a notable downside, An email message does not fall neatly into the domain of text however. Without a general consensus defining the differ- classification. Email messages are much more than simple ence between Ham and Spam, this system requires that the text, and must be handled accordingly. In addition to raw initial training set be adequately comprehensive, while be- text, emails contain various elements of formatting (HTML ing small enough that the user’s preferences take precedence tags being a notable example), meta-data, etc. In order to quickly. This makes the structure of the initial training-set properly classify and filter emails, all of these elements must used by new accounts critically important. be taken into account. In order for an incoming email to be classified as ei- 3 kNN Implemented in CUDA ther Spam or Ham, it is first processed into quantifiable at- Once an incoming email has been processed into its quan- tributes. The attributes used are based on previous work in tifiable attributes, it is used as the query point in our kNN Spam Classification (Firte, Lemnaru, and Potolea 2010). Us- classifier. ing MailSystem.NET’s open source libraries, we extract the The kNN algorithm was implemented on almost all par- different components of the email message including: allel architectures. A CUDA implementation of the kNN • The number of recipients in the ”To”, ”CC”, and ”BCC” algorithm is described in (Arefin et al. 2012)). Under spe- fields cific conditions, speed-ups of 50 to 60 times compared with CPU implementation were achieved, but this largely de- • The validity of the addresses in the ”To” and ”From” pends on the dataset and the GPU characteristics. There are fields also parallel kNN implementations which use the MapRe- • The validity of the ”Message-ID” field duce paradigm (Yokoyama, Ishikawa, and Suzuki 2012). We • The length of the ”Subject” field will use here our own kNN CUDA implementation, specifi- cally designed for the spam-classification problem. • The over use of capitalization For our work, we use sample emails from the SpamAs- • The length of the message body sassin dataset3 and stored them, after being processed, in an • The number of attachments SQLite database4 . Our implementation is divided into three phases: a distance calculation phase, a sorting phase, and a • The number of phony/suspicious HTML tags present in voting phase. Though the second phase is the primary focus the email text of this work, we will examine each phase in detail. • The number of occurrences of the 50 most common Spam The first phase is determining the distance between each keywords in the message point in the training set and the query point. Since this step These account for a total of 63 attributes. After being cal- is embarrassingly parallel, the process to determine the dis- culated, these attributes are normalized to weighted values tances is simple and efficient. The query point and training between 0 and 1 based on the min and max values present set are transferred to the GPU, where the distance calcula- for that attribute, such that no individual attribute contributes tion kernel allocates a single thread for each data point in more to classification than another (this could be adjusted to the training set. Each thread then simply calculates the Eu- fit specific environments if certain attributes are identified to clidean distance between its assigned point and the query be more useful than others). point. The results of each calculation are stored in a struct The 50 keywords used for matching are determined based 2 http://lucenenet.apache.org/ on the 50 most common terms found in emails classified as 3 http://csmining.org/index.php/spam-assassin-datasets.html 1 4 https://developer.nvidia.com/ http://www.sqlite.org/ 4 System architecture To demonstrate the functionality of the email classifica- tion scheme we developed, we created a lightweight email client, which we dubbed ”SpamSystem.NET”5 . The appli- cation provided both full email functionality with Gmail ac- counts and used our own customizable kNN email classifi- cation scheme for Spam-filtration. To achieve email client functionality, we relied on some Figure 1: The input is divided among the executed blocks, aspects of the MailSystem.NET open source libraries6 . with each block executing a reduction and returning the k MailSystem.NET provided functionality that allowed us to smallest distances. Once all outputs have been collected, easily open connections to Gmail’s servers. For our pur- they are used as new inputs and the process repeats. poses, we modified the Gmail account settings so that Gmail automatically sent all messages to the inbox, even those messages identified as Spam by Google’s filters. This al- lowed us to test our own implementation without outside within the GPU’s memory containing just the distance from interference. It is worth noting that this is not necessary, the query point and the item’s own classification. Since the and both filters could be used in tandem. We did this strictly attribute data itself is no longer of any use, we free it up such for the purpose of testing. MailSystem.NET’s libraries al- that the next phase of the algorithm can work on smaller data lowed us to send and receive email messages, create our points. Additionally, the k data points we end up returning own functions to manage the user’s Gmail account, maintain to the CPU are smaller and therefore reduced the overhead bidirectional synchronization with Google’s servers). It also produced by transmitting over the PCIe Bus. allowed the system to detect changes to the user’s account, The next phase is the sorting phase. To improve the ef- which happen outside of our application. The main user in- ficiency of the sorting process, we implemented a parallel terface can be seen in Fig. 3, with the user’s different email reduction to determine the k smallest distances. Initially, the folders and options for standard email client actions. Mail- array of distances stored in the GPU’s memory is used as in- System.NET also provided functionality for creating email put for the reduction. The array is divided among the blocks, objects, which allowed us to more easily parse emails for the and then the threads in each block perform a parallel reduc- classification metrics we required. tion to find the smallest k distance within that block. Since Additionally, our application provides easy options for we don’t know where the smallest values are located among managing the kNN Spam-filter. Fig. 4 shows the options- the blocks, and because threads cannot intercommunicate screen of our SpamSystem.NET application. In addition to between blocks, each block must be reduced to its own min- managing the user’s email account, there are options for imal k. Once a block has determined its k elements of min- maintaining a Black-List and a White-List, which specify imal distance, the results are stored in a secondary array on keywords, phrases, domain names, and email addresses that, the GPU, of size equal to k * the number of blocks used. Af- should they appear in an email, indicate that the email should ter all blocks have stored their results into the secondary ar- automatically go to the Spam folder (for Black- List items) ray, the reduction is run again with the output of the previous or the inbox (for White-List items). There are also nec- reduction as the new input (juggling the reduced quantity of essary features for maintaining the kNN Spam-filter itself. data points between the 2 previously allocated storage ar- There are options for determining the time interval between rays (Fig. 1). This process is repeated until only k elements training-set rebasings, as well as options for determining remain, at which point those elements are returned to the which GPU(s) should be used to run the kNN classification, CPU. should more than one GPU be present in the system. The final phase is the voting phase. Since the number of To summarize how the system operates, first an unclassi- data points used in the voting phase is k, which is generally fied incoming email is received from the Gmail server in the set to a small value in order to maintain accuracy and avoid form of an IMAP message. It is then parsed by our applica- over-fitting, the computation for this step is negligible. For tion into quantifiable attributes. Next, it is sent to the kNN this reason, we execute this step on the CPU. The weighted classifier, which determines whether it is Ham or Spam. Fi- distance for each of the k data points is added to its classifi- nally, the email is moved to the appropriate email folder in cation’s total. Once each of the k data points has contributed our application, and the application notifies the Gmail server its vote, the classification with majority would generally be to do the same. A detailed diagram representing the compo- determined to be the correct classification. However, since nents and how they interact can be seen in Fig. 2. our implementation is designed specifically for use in Spam- Our system runs locally on the user’s machine. Our goal Classification systems, a simple majority vote is not ideal. in developing this local implementation was to demonstrate Generally speaking, it is preferable to have Spam emails its efficiency and scalability. incorrectly classified as Ham rather than the opposite (we prefer false-negatives over false-positives). As a result, our 5 For this implementation stage, we gratefully acknowledge the implementation utilizes a threshold value (θ), that the vote contribution of our student colleague Leonard Patterson. 6 must exceed in order for an email to be classified as Spam. http://mailsystem.codeplex.com/ Figure 2: The overall system architecture for SpamSystem.NET. Figure 3: Main screen of SpamSystem.NET’s user interface. 5 Experiments A significant amount of previous research has been done showing that the kNN algorithm can be highly accurate (see, for example, (Firte, Lemnaru, and Potolea 2010)). The algo- rithm’s real deficit lies in the cost of its computation. As a result, the primary focus of our research and resulting bench- marks relate to demonstrating the efficiency gained by im- plementing kNN in CUDA. That being said, one cannot dis- Figure 4: Option screen of SpamSystem.NET’s user inter- cuss a classification algorithm without discussing accuracy. face. This section details both time and accuracy benchmarks. Our implementation was designed and tested on a GeForce GTX 260 (PCIe 2.0) on a Windows 7 PC with an Intel i7 950 CPU. All of our tests were run on this configu- ration. Time (ms), legend pos= north west 8,000 Sequential Implementation Cuda Implamentation 6,000 7 Speedup: CUDA vs Sequential 4,000 6 2,000 5 0 4 1 2 3 4 5 N (millions) 3 Figure 5: Performance of CUDA implementation on GTX GPU versus Sequential implementation on CPU. 1 2 3 4 5 N (millions) Efficiency Benchmarks Figure 6: Speedup of CUDA implementation on GTX GPU versus Sequential implementation on CPU. All efficiency benchmarks are the result of 10 trials on a dummy dataset. Fig. 5 shows the efficiency in terms of time as the size of the training set (N ) increases. Here we compare a naı̈ve, se- quential implementation of kNN versus our implementation on the GTX 260. For these tests we used a k value of 10. Fig. 6 shows the speedup gained by the CUDA implemen- tation on the GTX 260 over the sequential implementation as N increases. Accuracy Benchmarks 20 All accuracy benchmarks are the result of a 10-fold cross- validation (10:90 test-set to training-set ratio) on the Spa- mAssassin dataset. 15 Percent Error Fig. 7 shows the accuracy (in terms of false-negatives and false positives) for varying threshold values (θ). For this ex- periment, a k-value of 10 was used. The results show that 10 while higher θ-values yield fewer instances of false-positive classifications, it also yields more instances of false-negative classifications (as one would expect). Depending on the en- 5 vironment in which this application was being used, a proper θ could be easily determined via training to fit the desired false-positive to false-negative ratio, but a ”best” value to 0 use is not necessarily apparent. Fig. 8 shows the accuracy (in terms of false-negatives and 0.5 0.55 0.6 0.65 0.7 0.75 0.8 false positives) for varying values of k. For this experiment, θ a θ-value of 0.7 was used. Based on our results, we deter- False Negative False Positive mined that k = 3 was optimal for our purposes based on its false-positive to false-negative ratio. Additionally, since smaller k-values tend to yield faster computational perfor- Figure 7: Accuracy of classification on varying θ values. mance, this choice was obvious. That being said, an optimal k value could also be determined via training for any envi- ronment using this application based on the specific needs of that environment. Cormack, G. V. 2008. Email spam filtering: A systematic review. Found. Trends Inf. Retr. 1(4):335–455. Firte, L.; Lemnaru, C.; and Potolea, R. 2010. Spam de- 15 tection filter using kNN algorithm and resampling. In In- telligent Computer Communication and Processing (ICCP), Percent Error 2010 IEEE International Conference on, 27–33. Lopes, L., and Ribeiro, B. 2011. GPUMLib: An efficient 10 open-source GPU machine learning library. International Journal of Computer Information Systems and Industrial Management Applications 2150–7988. 5 Panigrahi, P. 2012. A comparative study of supervised machine learning techniques for spam e-mail filtering. In Computational Intelligence and Communication Networks (CICN), 2012 Fourth International Conference on, 506–512. Sallab, A. A. A., and Rashwan, M. A. 2012. E-mail clas- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 sification using deep networks. Journal of Theoretical and k Applied Information Technology 37:241–251. False Negative False Positive Tretyakov, K. 2004. Machine learning techniques in spam filtering. Technical report, Institute of Computer Science, Figure 8: Accuracy of classification on varying k values. University of Tartu. Yokoyama, T.; Ishikawa, Y.; and Suzuki, Y. 2012. Process- ing all k-nearest neighbor queries in hadoop. In Gao, H.; 6 Conclusions Lim, L.; Wang, W.; Li, C.; and Chen, L., eds., Web-Age In- Although many email Spam-filtering tools exists in the formation Management, volume 7418 of Lecture Notes in world, due to the existence of spammers and adoption of Computer Science. Springer Berlin Heidelberg. 346–351. new techniques, email Spam-filtering becomes a challenging problem to researchers. Automatic email filtering seems to be the most effective method for countering Spam at the mo- ment and a tight competition between spammers and spam- filtering methods is going on: as anti-Spam methods become more refined, so too do those of the spammers (Tretyakov 2004). In this work, we have demonstrated an efficient imple- mentation of the kNN algorithm utilizing CUDA-enabled GPUs. Under our scheme, an incoming email is parsed into its quantifiable attributes and is then sent to the classification system. Here, the distances between the newly-processed email and each email in the training set are quickly calcu- lated on the GPU. Afterwards, a parallel reduction process is used to identify the k training set emails with the smallest distances from the query email. These k email cast weighted votes for their own classification. If the Spam-vote exceeds the required threshold (θ), then the query email is classi- fied as Spam. Otherwise, it is classified as Ham. Our ex- perimental results show that our implementation is highly efficient and scalable. This efficiency, coupled with kNN’s previously demonstrated accuracy for classification, makes it potentially beneficial for large-scale Spam-filtration, and could be easily adapted to tackle other similar problems. References Arefin, A. S.; Riveros, C.; Berretta, R.; and Moscato, P. 2012. GPU-FS-kNN: A software tool for fast and scalable kNN computation using GPUs. PLoS one 7(8):e44000. Blanzieri, E., and Bryl, A. 2008. A survey of learning- based techniques of email spam filtering. Artif. Intell. Rev. 29(1):63–92.