=Paper= {{Paper |id=Vol-1353/paper_06 |storemode=property |title=Massively Parallel kNN using CUDA on Spam-Classification |pdfUrl=https://ceur-ws.org/Vol-1353/paper_06.pdf |volume=Vol-1353 |dblpUrl=https://dblp.org/rec/conf/maics/SmithrudMA15 }} ==Massively Parallel kNN using CUDA on Spam-Classification== https://ceur-ws.org/Vol-1353/paper_06.pdf
                  Massively Parallel kNN using CUDA on Spam-Classification
         Joshua M. Smithrud                                Patrick McElroy                       Răzvan Andonie
      Computer Science Department                    Computer Science Department            Computer Science Department
      Central Washington University                  Central Washington University          Central Washington University
          Ellensburg, WA, USA                            Ellensburg, WA, USA                    Ellensburg, WA, USA
      Email: jsmithrud37@gmail.com                    Email: mcelrp@gmail.com                             and
                                                                                        Electronics and Computers Department
                                                                                                Transilvania University
                                                                                                   Braşov, Romania
                                                                                                  andonie@cwu.edu
                            Abstract                                  be difficult. Machine Learning can help solve this problem:
                                                                      the email client can be trained to learn where to put each
  Email Spam-classification is a fundamental, unseen element
                                                                      email. There are many Machine Learning techniques avail-
  of everyday life. As email communication becomes more pro-
  lific, and email systems become more robust, it becomes in-         able to filter emails, including: Bayesian methods, Neural
  creasingly necessary for Spam-classification systems to run         Networks, Support Vector Machines, and Deep Belief Net-
  accurately and efficiently while remaining all but invisible to     works, kNN, see: (Cormack 2008), (Tretyakov 2004), (Pan-
  the user. We propose a massively parallel implementation of         igrahi 2012), (Blanzieri and Bryl 2008), (Sallab and Rash-
  Spam-classification using the k-Nearest Neighbors (kNN) al-         wan 2012). Actually, the most popular and well-developed
  gorithm on nVIDIA GPUs using CUDA. Being very simple                approaches to anti-Spam are Machine Learning-based.
  and straightforward, the performance of the kNN search de-
  grades dramatically for large data sets, since the task is com-
  putationally intensive. By utilizing the benefits of GPUs and
                                                                         Filte et al. (Firte, Lemnaru, and Potolea 2010) proposed
  CUDA, we seek to overcome that cost.                                a kNN-based Spam-detection filter. In this scheme, mes-
                                                                      sages are classified with the algorithm based on a set of fea-
                                                                      tures extracted from the properties and content of the emails.
                     1    Introduction                                The training set is resampled to the most appropriate size
A study estimated that over 70% of today’s business emails            and positive class distribution determined by several experi-
are Spam (see (Blanzieri and Bryl 2008)). The purpose of              ments. The system performs a constant update of the data set
email Spam is advertising, promotion, and spreading back-             and the list of most frequent words that appear in Spam mes-
doors or malicious programs. The cost of sending Spam                 sages. The main reason for selecting the kNN algorithm was
emails is much lower than the cost of automatically de-               that it does not require a training step for each query. How-
tecting and removing these emails. The processes of Spam-             ever, the method is offline and the authors focus on accuracy,
detection and filtration have become everyday occurrences             rather than execution time. Actually, the kNN method can
within the average person’s life. Though unbeknownst to               run quite slow, especially for larger values of k and when
most, each and every instance of digital communiqué goes             the input dataset is very large.
through a series of automated filters to determine whether or
not the user would have any interest in that particular item.            But how can we classify in real-time (online) incoming
    Because Spam-filtering systems are expected to be trans-          emails? Do we have to start the whole process from scratch
parent to the user, there are two key aspects a good Spam-            each time? This is practically impossible when we have to
filter must contain: accuracy and efficiency. As the volume           deal with very large sets of emails. One solution would be
of email correspondences increases, so too does the amount            to use incremental and parallel processing. Good scalabil-
of data required to properly classify emails into the cate-           ity can be achieved by efficiently using parallel computer
gories ”Spam” and ”Ham” (non-Spam). The larger these                  architectures like GPUs. Presently, GPUs are widely avail-
training-sets become, the more time is consumed by the                able and relatively inexpensive. In this context, the high-
process. As a result, in order to prevent the process from            parallelism inherent to the GPU makes this device especially
becoming a burden to the user, Spam-filters must also be              well-suited to address Machine Learning problems with pro-
implemented more efficiently in order to compensate for               hibitively computationally intensive tasks. An increasing
the increased workload. No one wants to have to deal with             number of machine learning algorithms are implemented on
Spam; the email client should handle it on its own. Our               GPU’s (Lopes and Ribeiro 2011). At this moment, one of
goal is to demonstrate the feasibility of an email client with        the universally accepted programming languages for GPUs
hyper-accurate and hyper-efficient Spam-filtration capabili-          is CUDA (Compute Unified Device Architecture) created
ties, such that the standard email-user need not worry about          by nVIDIA. Since its introduction in 2006, CUDA has been
the process.                                                          widely deployed through thousands of applications and pub-
    A Spam-filter needs to sort incoming mail into wanted             lished research papers, and supported by an installed base of
and unwanted, and it needs to do it accurately, which can             over 500 million CUDA-enabled GPUs in notebooks, work-
stations, compute clusters and supercomputers1 .                    Spam within the user’s account. In order to maintain an ac-
   Our goal was to design an email client with a highly-            curate list, the training set must be rebased with some regu-
accurate, highly-efficient Spam-filtration capability. We           larity. Depending on the environment in which the applica-
propose a massively parallel implementation of Spam-                tion is being used, this can be done on an absolute schedule,
classification using the kNN algorithm on nVIDIA GPUs               or could be done as specified by the user.
using CUDA. The system we created can incrementally add                The rebasing process accomplishes a couple things. First,
new email samples to improve its accuracy. The proposed             the list of 50 keywords is re-evaluated. Using Lucene.NET’s
approach is not only fast but also scalable to large-scale in-      open source text analytics libraries2 , we determine the
stances. It also has the potential to scale for future devices      50 most common keywords found within the user’s Spam
with increasing number of compute units.                            folder. Once the new 50 keywords are found, the keyword
   The rest of the paper is structured as follows. Section 2        counts for preexisting elements in the training set are recal-
describes how we extract features from emails. In Section 3,        culated. Second, training set values are normalized again (as
we introduce our parallel implementation of the kNN algo-           new emails are added to the training set, it is possible for
rithm. Section 4 presents the architecture of the Spam filter,      attribute values to exceed 1, so rebasing regularly becomes
both from the designer and user perspectives. In Section 5          important to maintain consistency).
we describe experiments, discussing execution time and ac-             Basing the 50 keywords on the emails within the user’s ac-
curacy. Section 6 concludes with final remarks.                     count allows for better personal accuracy. One man’s Ham
                                                                    is another man’s Spam (and vice-versa), so personalizing the
           2    Feature Extraction from Email                       training set is a valuable tool. There is a notable downside,
An email message does not fall neatly into the domain of text       however. Without a general consensus defining the differ-
classification. Email messages are much more than simple            ence between Ham and Spam, this system requires that the
text, and must be handled accordingly. In addition to raw           initial training set be adequately comprehensive, while be-
text, emails contain various elements of formatting (HTML           ing small enough that the user’s preferences take precedence
tags being a notable example), meta-data, etc. In order to          quickly. This makes the structure of the initial training-set
properly classify and filter emails, all of these elements must     used by new accounts critically important.
be taken into account.
    In order for an incoming email to be classified as ei-                   3    kNN Implemented in CUDA
ther Spam or Ham, it is first processed into quantifiable at-       Once an incoming email has been processed into its quan-
tributes. The attributes used are based on previous work in         tifiable attributes, it is used as the query point in our kNN
Spam Classification (Firte, Lemnaru, and Potolea 2010). Us-         classifier.
ing MailSystem.NET’s open source libraries, we extract the              The kNN algorithm was implemented on almost all par-
different components of the email message including:                allel architectures. A CUDA implementation of the kNN
• The number of recipients in the ”To”, ”CC”, and ”BCC”             algorithm is described in (Arefin et al. 2012)). Under spe-
    fields                                                          cific conditions, speed-ups of 50 to 60 times compared with
                                                                    CPU implementation were achieved, but this largely de-
• The validity of the addresses in the ”To” and ”From”
                                                                    pends on the dataset and the GPU characteristics. There are
    fields
                                                                    also parallel kNN implementations which use the MapRe-
• The validity of the ”Message-ID” field                            duce paradigm (Yokoyama, Ishikawa, and Suzuki 2012). We
• The length of the ”Subject” field                                 will use here our own kNN CUDA implementation, specifi-
                                                                    cally designed for the spam-classification problem.
• The over use of capitalization                                        For our work, we use sample emails from the SpamAs-
• The length of the message body                                    sassin dataset3 and stored them, after being processed, in an
• The number of attachments                                         SQLite database4 . Our implementation is divided into three
                                                                    phases: a distance calculation phase, a sorting phase, and a
• The number of phony/suspicious HTML tags present in               voting phase. Though the second phase is the primary focus
    the email text                                                  of this work, we will examine each phase in detail.
• The number of occurrences of the 50 most common Spam                  The first phase is determining the distance between each
    keywords in the message                                         point in the training set and the query point. Since this step
These account for a total of 63 attributes. After being cal-        is embarrassingly parallel, the process to determine the dis-
culated, these attributes are normalized to weighted values         tances is simple and efficient. The query point and training
between 0 and 1 based on the min and max values present             set are transferred to the GPU, where the distance calcula-
for that attribute, such that no individual attribute contributes   tion kernel allocates a single thread for each data point in
more to classification than another (this could be adjusted to      the training set. Each thread then simply calculates the Eu-
fit specific environments if certain attributes are identified to   clidean distance between its assigned point and the query
be more useful than others).                                        point. The results of each calculation are stored in a struct
    The 50 keywords used for matching are determined based             2
                                                                         http://lucenenet.apache.org/
on the 50 most common terms found in emails classified as              3
                                                                         http://csmining.org/index.php/spam-assassin-datasets.html
   1                                                                   4
       https://developer.nvidia.com/                                     http://www.sqlite.org/
                                                                                    4    System architecture
                                                                    To demonstrate the functionality of the email classifica-
                                                                    tion scheme we developed, we created a lightweight email
                                                                    client, which we dubbed ”SpamSystem.NET”5 . The appli-
                                                                    cation provided both full email functionality with Gmail ac-
                                                                    counts and used our own customizable kNN email classifi-
                                                                    cation scheme for Spam-filtration.
                                                                       To achieve email client functionality, we relied on some
Figure 1: The input is divided among the executed blocks,           aspects of the MailSystem.NET open source libraries6 .
with each block executing a reduction and returning the k           MailSystem.NET provided functionality that allowed us to
smallest distances. Once all outputs have been collected,           easily open connections to Gmail’s servers. For our pur-
they are used as new inputs and the process repeats.                poses, we modified the Gmail account settings so that Gmail
                                                                    automatically sent all messages to the inbox, even those
                                                                    messages identified as Spam by Google’s filters. This al-
                                                                    lowed us to test our own implementation without outside
within the GPU’s memory containing just the distance from           interference. It is worth noting that this is not necessary,
the query point and the item’s own classification. Since the        and both filters could be used in tandem. We did this strictly
attribute data itself is no longer of any use, we free it up such   for the purpose of testing. MailSystem.NET’s libraries al-
that the next phase of the algorithm can work on smaller data       lowed us to send and receive email messages, create our
points. Additionally, the k data points we end up returning         own functions to manage the user’s Gmail account, maintain
to the CPU are smaller and therefore reduced the overhead           bidirectional synchronization with Google’s servers). It also
produced by transmitting over the PCIe Bus.                         allowed the system to detect changes to the user’s account,
   The next phase is the sorting phase. To improve the ef-          which happen outside of our application. The main user in-
ficiency of the sorting process, we implemented a parallel          terface can be seen in Fig. 3, with the user’s different email
reduction to determine the k smallest distances. Initially, the     folders and options for standard email client actions. Mail-
array of distances stored in the GPU’s memory is used as in-        System.NET also provided functionality for creating email
put for the reduction. The array is divided among the blocks,       objects, which allowed us to more easily parse emails for the
and then the threads in each block perform a parallel reduc-        classification metrics we required.
tion to find the smallest k distance within that block. Since          Additionally, our application provides easy options for
we don’t know where the smallest values are located among           managing the kNN Spam-filter. Fig. 4 shows the options-
the blocks, and because threads cannot intercommunicate             screen of our SpamSystem.NET application. In addition to
between blocks, each block must be reduced to its own min-          managing the user’s email account, there are options for
imal k. Once a block has determined its k elements of min-          maintaining a Black-List and a White-List, which specify
imal distance, the results are stored in a secondary array on       keywords, phrases, domain names, and email addresses that,
the GPU, of size equal to k * the number of blocks used. Af-        should they appear in an email, indicate that the email should
ter all blocks have stored their results into the secondary ar-     automatically go to the Spam folder (for Black- List items)
ray, the reduction is run again with the output of the previous     or the inbox (for White-List items). There are also nec-
reduction as the new input (juggling the reduced quantity of        essary features for maintaining the kNN Spam-filter itself.
data points between the 2 previously allocated storage ar-          There are options for determining the time interval between
rays (Fig. 1). This process is repeated until only k elements       training-set rebasings, as well as options for determining
remain, at which point those elements are returned to the           which GPU(s) should be used to run the kNN classification,
CPU.                                                                should more than one GPU be present in the system.
   The final phase is the voting phase. Since the number of            To summarize how the system operates, first an unclassi-
data points used in the voting phase is k, which is generally       fied incoming email is received from the Gmail server in the
set to a small value in order to maintain accuracy and avoid        form of an IMAP message. It is then parsed by our applica-
over-fitting, the computation for this step is negligible. For      tion into quantifiable attributes. Next, it is sent to the kNN
this reason, we execute this step on the CPU. The weighted          classifier, which determines whether it is Ham or Spam. Fi-
distance for each of the k data points is added to its classifi-    nally, the email is moved to the appropriate email folder in
cation’s total. Once each of the k data points has contributed      our application, and the application notifies the Gmail server
its vote, the classification with majority would generally be       to do the same. A detailed diagram representing the compo-
determined to be the correct classification. However, since         nents and how they interact can be seen in Fig. 2.
our implementation is designed specifically for use in Spam-           Our system runs locally on the user’s machine. Our goal
Classification systems, a simple majority vote is not ideal.        in developing this local implementation was to demonstrate
Generally speaking, it is preferable to have Spam emails            its efficiency and scalability.
incorrectly classified as Ham rather than the opposite (we
prefer false-negatives over false-positives). As a result, our         5
                                                                         For this implementation stage, we gratefully acknowledge the
implementation utilizes a threshold value (θ), that the vote        contribution of our student colleague Leonard Patterson.
                                                                       6
must exceed in order for an email to be classified as Spam.              http://mailsystem.codeplex.com/
                                Figure 2: The overall system architecture for SpamSystem.NET.




Figure 3: Main screen of SpamSystem.NET’s user interface.


                    5   Experiments
A significant amount of previous research has been done
showing that the kNN algorithm can be highly accurate (see,
for example, (Firte, Lemnaru, and Potolea 2010)). The algo-
rithm’s real deficit lies in the cost of its computation. As a
result, the primary focus of our research and resulting bench-
marks relate to demonstrating the efficiency gained by im-
plementing kNN in CUDA. That being said, one cannot dis-         Figure 4: Option screen of SpamSystem.NET’s user inter-
cuss a classification algorithm without discussing accuracy.     face.
This section details both time and accuracy benchmarks.
   Our implementation was designed and tested on a
GeForce GTX 260 (PCIe 2.0) on a Windows 7 PC with an
Intel i7 950 CPU. All of our tests were run on this configu-
ration.
 Time (ms), legend pos= north west   8,000           Sequential Implementation
                                                       Cuda Implamentation

                                     6,000
                                                                                                                7




                                                                                 Speedup: CUDA vs Sequential
                                     4,000
                                                                                                                6

                                     2,000
                                                                                                                5

                                        0                                                                       4
                                             1   2         3       4       5
                                                 N (millions)                                                   3

Figure 5: Performance of CUDA implementation on GTX
GPU versus Sequential implementation on CPU.                                                                              1          2         3          4           5
                                                                                                                                      N (millions)

Efficiency Benchmarks                                                            Figure 6: Speedup of CUDA implementation on GTX GPU
                                                                                 versus Sequential implementation on CPU.
All efficiency benchmarks are the result of 10 trials on a
dummy dataset.
   Fig. 5 shows the efficiency in terms of time as the size of
the training set (N ) increases. Here we compare a naı̈ve, se-
quential implementation of kNN versus our implementation
on the GTX 260. For these tests we used a k value of 10.
   Fig. 6 shows the speedup gained by the CUDA implemen-
tation on the GTX 260 over the sequential implementation
as N increases.

Accuracy Benchmarks
                                                                                                               20
All accuracy benchmarks are the result of a 10-fold cross-
validation (10:90 test-set to training-set ratio) on the Spa-
mAssassin dataset.                                                                                             15
                                                                                  Percent Error




   Fig. 7 shows the accuracy (in terms of false-negatives and
false positives) for varying threshold values (θ). For this ex-
periment, a k-value of 10 was used. The results show that                                                      10
while higher θ-values yield fewer instances of false-positive
classifications, it also yields more instances of false-negative
classifications (as one would expect). Depending on the en-                                                    5
vironment in which this application was being used, a proper
θ could be easily determined via training to fit the desired
false-positive to false-negative ratio, but a ”best” value to
                                                                                                               0
use is not necessarily apparent.
   Fig. 8 shows the accuracy (in terms of false-negatives and                                                       0.5       0.55   0.6   0.65     0.7   0.75      0.8
false positives) for varying values of k. For this experiment,                                                                               θ
a θ-value of 0.7 was used. Based on our results, we deter-                                                                    False Negative       False Positive
mined that k = 3 was optimal for our purposes based on
its false-positive to false-negative ratio. Additionally, since
smaller k-values tend to yield faster computational perfor-                              Figure 7: Accuracy of classification on varying θ values.
mance, this choice was obvious. That being said, an optimal
k value could also be determined via training for any envi-
ronment using this application based on the specific needs
of that environment.
                                                                     Cormack, G. V. 2008. Email spam filtering: A systematic
                                                                     review. Found. Trends Inf. Retr. 1(4):335–455.
                                                                     Firte, L.; Lemnaru, C.; and Potolea, R. 2010. Spam de-
                 15                                                  tection filter using kNN algorithm and resampling. In In-
                                                                     telligent Computer Communication and Processing (ICCP),
 Percent Error




                                                                     2010 IEEE International Conference on, 27–33.
                                                                     Lopes, L., and Ribeiro, B. 2011. GPUMLib: An efficient
                 10
                                                                     open-source GPU machine learning library. International
                                                                     Journal of Computer Information Systems and Industrial
                                                                     Management Applications 2150–7988.
                 5                                                   Panigrahi, P. 2012. A comparative study of supervised
                                                                     machine learning techniques for spam e-mail filtering. In
                                                                     Computational Intelligence and Communication Networks
                                                                     (CICN), 2012 Fourth International Conference on, 506–512.
                                                                     Sallab, A. A. A., and Rashwan, M. A. 2012. E-mail clas-
                      1 2 3 4 5 6 7 8 9 10 11 12 13 14
                                                                     sification using deep networks. Journal of Theoretical and
                                   k
                                                                     Applied Information Technology 37:241–251.
                          False Negative   False Positive            Tretyakov, K. 2004. Machine learning techniques in spam
                                                                     filtering. Technical report, Institute of Computer Science,
    Figure 8: Accuracy of classification on varying k values.        University of Tartu.
                                                                     Yokoyama, T.; Ishikawa, Y.; and Suzuki, Y. 2012. Process-
                                                                     ing all k-nearest neighbor queries in hadoop. In Gao, H.;
                          6    Conclusions                           Lim, L.; Wang, W.; Li, C.; and Chen, L., eds., Web-Age In-
Although many email Spam-filtering tools exists in the               formation Management, volume 7418 of Lecture Notes in
world, due to the existence of spammers and adoption of              Computer Science. Springer Berlin Heidelberg. 346–351.
new techniques, email Spam-filtering becomes a challenging
problem to researchers. Automatic email filtering seems to
be the most effective method for countering Spam at the mo-
ment and a tight competition between spammers and spam-
filtering methods is going on: as anti-Spam methods become
more refined, so too do those of the spammers (Tretyakov
2004).
    In this work, we have demonstrated an efficient imple-
mentation of the kNN algorithm utilizing CUDA-enabled
GPUs. Under our scheme, an incoming email is parsed into
its quantifiable attributes and is then sent to the classification
system. Here, the distances between the newly-processed
email and each email in the training set are quickly calcu-
lated on the GPU. Afterwards, a parallel reduction process
is used to identify the k training set emails with the smallest
distances from the query email. These k email cast weighted
votes for their own classification. If the Spam-vote exceeds
the required threshold (θ), then the query email is classi-
fied as Spam. Otherwise, it is classified as Ham. Our ex-
perimental results show that our implementation is highly
efficient and scalable. This efficiency, coupled with kNN’s
previously demonstrated accuracy for classification, makes
it potentially beneficial for large-scale Spam-filtration, and
could be easily adapted to tackle other similar problems.

                              References
Arefin, A. S.; Riveros, C.; Berretta, R.; and Moscato, P.
2012. GPU-FS-kNN: A software tool for fast and scalable
kNN computation using GPUs. PLoS one 7(8):e44000.
Blanzieri, E., and Bryl, A. 2008. A survey of learning-
based techniques of email spam filtering. Artif. Intell. Rev.
29(1):63–92.