Multi-Stage Malicious Click Detection on
                      Large Scale Web Advertising Data

                  Leyi Song, Xueqing Gong⇤ , Xiaofeng He, Rong Zhang, Aoying Zhou
                                        Center for Cloud Computing and Big Data
                                             East China Normal University
                                      3663 North Zhongshan Road, Shanghai, China
                      songleyi@ecnu.cn, {xqgong,xfhe,rzhang,ayzhou}@sei.ecnu.cn

ABSTRACT                                                              tain types of anomaly click behaviors generated by human
The healthy development of the Internet largely depends on            or bots [7, 12, 13]. Despite such e↵orts, there still remains
the online advertisement which provides the financial sup-            great challenge to address the problem of malicious clicks. In
port to the Internet. Click fraud, however, poses serious             this paper, we focus on detecting malicious clicks from large
threat to the Internet ecosystem. It not only brings harm             scale web advertising data on ad agency side. Ad agency
to the advertisers, but also damages the mutual trust be-             plays an intermediate role between advertisers and publish-
tween advertiser and ad agency. Click fraud prediction is             ers in the online advertising system. In order to identify
a typical big data application in that we normally need to            malicious clicks of di↵erent categories, for instance, suspi-
identify the malicious clicks from massive click logs, there-         cious users, stealthy click-bots and cheating publishers, we
fore efficient detection methods in big data framework are            build a series of filters at di↵erent stages in big data compu-
much desired to combat this fraudulent behavior. In this              tation framework. The filters are ordered by the decreasing
paper, we propose a three-stage filtering system to attack            confidence of predicting the malicious clicks. At first stage,
click fraud. The serialized filters e↵ectively detect the mali-       a rule based filter identifies the malicious clicks with high
cious clicks with decreasing confidence that can satisfy both         confidence since these rules catch strong signals for abnor-
advertisers and content providers.                                    mal clicks. At second stage, a supervised classification ap-
                                                                      proach is used to detect malicious clicks, whose prediction
                                                                      results are of lower confidence than rule based ones. Fi-
1.   INTRODUCTION                                                     nally we cluster the clicks into group at stage 3 with the
   The fast development of the Internet depends not only on           hope that fraudulent clicks generated from one publisher
the increase of rich content, but also on online advertise-           will be grouped together. The clustering method is an unsu-
ment which provides the financial support to the Internet             pervised learning process, hence results in lowest prediction
ecosystem. Online advertising tends to benefit all involved           confidence. Sequentially organizing the filters in decreasing
parties including content provider, advertiser, ad agency and         confidence order provides us with flexibility and scalability
ad network. It requires the mutual trust among all parties.           to add extra filters. For instance, we can replace one classi-
This trust was at risk, however, by fraudulent clicks. Click          fier with another classifier at stage 2, or add more classifiers
fraud (also called click spam, malicious click) is an action of       to stage 2 to form an ensemble. Users have the freedom to
intentional clicking with the purpose of making undue profit,         use the result corresponding to di↵erent confidence.
or doing harm to competitors. Click fraud is becoming a se-              We use the precision as the measure of confidence in this
rious problem to the World Wide Web [5]. Failing to deter             paper. The reason to use precision, instead of other metrics
such behaviors will discourage advertisers from actively en-          such as recall or F-measure is because 1) precision focuses
gaging in more online advertising activities, resulting in less       on singling out bad clicks while trying to minimize the pos-
revenue for content providers or ad agencies, and ultimately          sibility of predicting valid clicks as fraudulent ones; 2) it is
endangering the Internet ecosystem as a whole. Therefore              well-known that the judgement of whether or not a click is
it is of high importance to detect the malicious clicks.              a fraudulent one is very subjective in many cases. Too large
   Ad agencies or networks often deploy di↵erent filters to           recall can potentially classify large number of valid clicks as
identify malicious clicks [10]. By setting a proper thresh-           bad ones unless we have high quality classifiers which is hard
old or training a classifier, these methods can handle cer-           to obtain, especially when human judgement is difficult.
                                                                         In this paper, we propose a click fraud detecting architec-
 * Corresponding Author                                               ture which organizes filters sequentially by decreasing con-
                                                                      fidence order. This architecture was applied to large scale
                                                                      advertising log obtained from an ad agency. Specifically,

                                                                         • we address the malicious click problem on the ad agency
                                                                           side. In addition to rule-based and supervised meth-
                                                                           ods, we propose a clustering-based method to analyse
                                                                           the traffic quality;

                                                                         • we design a click detecting mechanism in big data com-
                                                                           putation framework, i.e., Hadoop, to detect malicious


                                                                  1
       clicks efficiently by sequentially organizing three types
                                                                                       
       of filters of di↵erent confidence level;                                    
                                                                                      
     • we carefully design and analyse important features,                                             Blacklist
       and verify our approach by evaluating the dissimilarity                                                         Stage 1:
       between predicted fraudulent clicks vs. valid clicks.                                                            Rule-
                                                                                   malicious         Heavy Hitter       based
                                                                                                                       Filtering
   The rest of the paper is organized as follows. In Section 2,                     clicks
we introduce previous work related to click fraud detection.                                          Frequent
                                                                                                        Click
We present our detection architecture in Section 3. In Sec-                        trust users


                                                                                                                                       Confidence
tion 4, we comprehensively analyze the results of our ap-
proach by applying it to over one month real click log data
from an ad agency, and conclude our paper in Section 5.                                                Feature
                                                                                                      Extraction        Stage 2:
                                                                                                                     Classification-
2.    RELATED WORK                                                                                  Malicious User
                                                                                                                         based
                                                                                                                        Filtering
                                                                                    Result           Prediction
   There are many participants play in online advertising                          Validation           Model
ecosystem. Ad agency usually plays an match maker role
                                                                                    Result
between publishers and advertisers. Publishers own the web-                        Analysis &
site pages containing ad slots. Advertisers purpose their ad                       Ensemble                           Stage 3:
                                                                                                      Malicious
creativity to attract users to buy their products or to make                                          Publishers     Clustering-
                                                                                   Stage 4:                            based
other kinds of profit. Ad agency buys inventory from pub-                                             Detection
                                                                                                                      Filtering
                                                                                Validation and
lishers and sells advertising traffic to advertisers. Thus, for a                  Analysis
reputed ad agency, it is responsible to filter these malicious
clicks before charging.                                                           Figure 1: Architecture Overview.
   Undoubtedly there are great e↵orts in the area of fraud
detection, including the click fraud problem, but most solu-
tions are not easily available. The report by Tuzhilin [10]                Previous approaches to identify and understand malicious
introduced some of the approaches that Google adapted to                clicks focused on one specific method, or one specific prob-
fight click fraud, in which both rule-based and anomaly-                lem. Researchers adopted click-through rate related meth-
based filters were incorporated into search engine. Since               ods for web spam in search engine [11]. However, it is
search advertising is one of the major forms of online adver-           difficult for an ad agency to access the click-though rate,
tisement, this work inspired us to take similar approach in             therefore some fraudulent patterns can not be easily iden-
fraud detection system.                                                 tified by traditional methods. In our framework, we design
   Another type of malicious click detection methods was                a more general way of data collection and filtering strategy
based on click stream analysis techniques which identified              for malicious click detection, implemented in a stage-wise
patterns of fraudulent traffic. Metwally et al. proposed                architecture.
fraud detection solutions for data stream by combining asso-
ciation rules and duplicate detection methods [8]. E↵ective
data structure such as modified Bloom Filter was used in
                                                                        3.    STAGE-WISE CLICK FRAUD FILTER-
this situation [13]. However, it is often difficult for most                  ING ARCHITECTURE
data mining-based detection methods to be implemented
into stream analysis, hence limits the adaptation of such               3.1   Architecture Overview
techniques to click fraud detection.                                       In order to accommodate di↵erent filtering methods with
   Supervised learning approach detects the malicious clicks            di↵erent confidence in predicting the fraudulent clicks, we
by training a classifier. Data collection is one of the most            argue that it is advantageous to take the approach of stage-
important steps in this approach. Haddadi [3] proposed                  wise filtering architecture. The filters are sequentially con-
the idea of blu↵ ads, which is unrelated to the user search,            nected such that the filter generating results with higher
user profile and recently activity. If a high ratio of such             predicting confidence is put in the chain before the one with
ads was clicked, the user could be flagged as suspicious. In            lower confidence. The intuition is that we need to identify
some other work, CAPTCHA was used for training data                     most confident malicious clicks, and reduce the data size
generation and useful data collection [6]. For ad agencies,             which need further processing, hence reduce the complex-
they have various advertiser and publisher sources, hence               ity of the problem. Furthermore, stage-wise structure o↵ers
CAPTCHA approach is hardly applicable. The next key                     prediction results of di↵erent confidence, which enables the
step is to extract features. The work [1] investigated query            users to utilize the result with more flexibility.
attributes between human and robot traffic. Di↵erent type                  The stage-wise filtering system architecture is illustrated
of features could be extracted from the click attributes and            in Figure 1. The major components are following 3 stages.
user attributes, such as click count or geographic origin [1,           1) Rule-based filters at stage 1 to detect obvious invalid
4].                                                                     clicks with high confidence. They can identify two types
   Bot-generated click traffic is a big part of malicious clicks.       of malicious clicks: heavy hitter and frequent clicker. 2)
The state-of-the-art bot detection work mostly aimed at                 Classification-based filters at stage 2 to determine more com-
clicks in search engine logs [6, 12]. Yu et al. proposed SBot-          plicated clicks with human judged training set. 3) The
Miner, a system which automatically identified bot gener-               clustering-based filter at stage 3 to identify cheating groups
ated search traffic from query log, using history-based as              from similar publisher websites. We use intra-cluster dis-
well as matrix-based unsupervised methods [12].                         tance and query diversity to separate malicious groups.


                                                                    2
                                                                                                  80000
                                                                                                                                                          that, for agency’s customers(advertisers), some of the suspi-
                                                    3 hours as interval
                   100000
                                                                                                                             10 min as period
                                                                                                                                                          cious ad traffic from the same website shows high similarity.


                                                                                Number of Users
                                                    6 hours as intervel                                                      30 min as period
 Number of Users

                                                    12 hours as interval                                                     60 min as period
                                                    24 hours as interval                                                     180 min as period            We try to group similar traffic and analyze them as a group
                                                                                                                                                          in order to detect abusive clicks from the publisher-side.


                                                                                                  40000
                   50000


                                                                                                                                                          This approach can also be applied to search engine traffic,
                                                                                                                                                          since the query diversity can be used to separate the suspi-
                                                                                                                                                          cious groups from the search ad log. Considering the result


                                                                                                  0
                   0


                            0   1   2   3   4   5     6    7    8   9      10                             0      2     4     6        8          10       confidence and filtering cost, this stage can be an optional
                                    Clicks in One Interval                                                           Number of Periods
                                                                                                                                                          choice for each advertiser. For valuable customers, this is
                                (a) Click density                                                             (b) Click frequency                         an attractive feature, which di↵ers from previous work.
                                                                                                                                                             For clustering, We define the dissimilarity between two
Figure 2: Click statistics of a trusted user click log                                                                                                    log entries x, y as:
dataset in one month                                                                                                                                                                   X
                                                                                                                                                                           D(x, y) =         wf df (x, y)           (1)
3.2                             Rule-based Filtering
                                                                                                                                                                                     f 2F ields


   To fight click fraud, generating blacklist in the filtering                                                                                            where each log entry is represented by a vector of click at-
system is the most reliable method. It is easy to compile                                                                                                 tributes as < user, IP, ref errer, U A, area, query >. F ields
the blacklist for violating entities such as user agent(UA)                                                                                               is the feature set used in click attributes and df (x, y) is the
or IP address, but the coverage of a blacklist is limited.                                                                                                distance measure defined on each attribute f , normalized
Setting certain rules is efficient to exclude more malicious                                                                                              between [0,1]. wf is the weight of df (x, y).
clicks from entering accounting system. In this paper we                                                                                                     For attributes like refer URL and user agent, we care
focus on setting rules for the heavy hitter and frequent click                                                                                            about the longest matching prefix, and the distance mea-
problems [8, 13, 7].                                                                                                                                      sure is defined as:
   Heavy hitter in click logs means, at a specific time in-                                                                                                                                  LCP (x, y)
terval, the click rate of a user is relatively higher than a                                                                                                              durl (x, y) = 1                             (2)
                                                                                                                                                                                             max(|x|, |y|)
threshold 1 , while frequent click problem refers to the sit-
uation that user’s click appears in relatively more periods1                                                                                              where LCP means the longest common prefix of two strings.
than a predefined threshold 2 .                                                                                                                              Even though network address translation (NAT) might
   To obtain a reasonable filtering threshold, we note from                                                                                               allow many users behind a single IP address, the cheating
Figure 2 that the number of clicks and periods follows the                                                                                                groups always show strong similarity in IP address. To am-
Zipfian distribution. We set the maximum number for nor-                                                                                                  plify the importance of the tail part in IP address, we take
mal user behavior as the lower value of the two: number of                                                                                                32 bits IPv4 address as an example to define the distance.
clicks in one interval and number of periods the user clicks.                                                                                             If LCB(longest common bits) 16, then
For better accuracy, we can also determine the threshold
based on the p-quantile value on the entire log dataset.                                                                                                                                     LCB(x, y)
                                                                                                                                                                           dIP (x, y) = 1               ,             (3)
                                                                                                                                                                                             total bits
3.3                             Classification-based Filtering                                                                                            otherwise, the distance between two IP addresses is 1.
   Classification-based methods are widely used in fraud de-                                                                                                 For other attributes such as area, we simply treat their dis-
tection or spam detection field [9]. Classification is an ef-                                                                                             similarity as binary value (0 or 1). Furthermore, we can eas-
fective way for addressing malicious clicks, especially for                                                                                               ily prove that D(x, y) is a metric, since it follows the proper-
stealthy clicks that are hard to be captured by rule-based                                                                                                ties: (1) D(x, y) = 0if f x = y, (2) D(x, y) 0, (3) D(x, y) =
approach. The classification has been applied to real data                                                                                                D(y, x), and (4) D(x, y)  D(x, z) + D(z, y)(triangle in-
with success [4, 10]. One of the biggest advantages with                                                                                                  equality). Due to space limit, we skip the proof here.
classification approach is that once a model has been built,                                                                                                 After obtaining the dissimilarity matrix of a log set, we
the prediction of new instance is usually quite fast. There is                                                                                            use k-medoids algorithm to produce the clustering in this
also research on the attributes which can distinguish human                                                                                               stage. K-medoids is an adaptation of k-means algorithm.
and bot traffic [1]. In our work, we use the traditional fea-                                                                                             Rather than calculating the mean of the items for each clus-
tures used in previous work, and also engineer new features                                                                                               ter, which is not applicable in our situation, a representative
useful for training classifiers. We will discuss the features                                                                                             item, or medoid, is chosen for each cluster. Medoids for each
in more details later. The result of this stage is a set of                                                                                               cluster are calculated to finding object i by minimizing
malicious users from the click log.                                                                                                                                                    X
                                                                                                                                                                                 J˜ =       D(i, j)                    (4)
3.4                             Clustering-based Filtering                                                                                                                            j2Ci

  It is obvious that results of supervised methods highly de-                                                                                             where Ci is the cluster containing object i and D(i, j) is
pend on the accuracy, coverage and labeling scheme of the                                                                                                 the dissimilarity function defined in equation 1. Since the
labeled corpus. The classified malicious clicks are limited by                                                                                            algorithm simply looks up the dissimilarity matrix, it only
the fraudulent types in training set. Hence, we develop our                                                                                               needs to be calculated once in the beginning.
clustering-based method, which is based on our observation                                                                                                   The next step is how to distinguish cheating groups from
1
  We use period to represent the window for counting in fre-                                                                                              all clusters. Obviously, if one group agrees on most fields,
quent clicks problem, in order to distinguish the interval                                                                                                it indicates this group of clicks come from a botnet using
of heavy hitter. Clicks of each user occur within the same                                                                                                similar terminals and browsers or a real interested user with
period will be ignored.                                                                                                                                   high probability. From the click statistics we can figure that,


                                                                                                                                                      3
                                                             Table 1: Examples of Labeled Malicious Clicks
   User                        Advertiser     Area          IP      Referer                   Query    User agent
    1                             c1           1        x.x.25.177 none                       none     Mozilla/5.0
    2                             c2           1        x.x.25.178 none                       none     Mozilla/5.0
    3                             c3           2        y.y.51.137 http://r1.com/s?wd=wvihv   wvihv    Mozilla/4.0(compatible; MSIE 7.0;)
    3                             c3           2        y.y.51.142 http://r1.com/s?wd=jmfitxm jmfitxm Mozilla/4.0(compatible; MSIE 7.0;)
    4                             c3           2        y.y.51.145 http://r2.com/s?wd=qyfsoc  qyfsoc   Mozilla/4.0(compatible; MSIE 7.0;)

                               0.10
                                                                                              In order to define useful features, we need to analyze
          Fraction of groups


                               0.08                                                        the di↵erence between normal and malicious click behav-
                               0.06                                                        iors. Below, we discuss some of the features we identified to
                                                                                           represent the user.
                               0.04
                                                                                           Number of clicked advertisers. This feature counts the
                               0.02
                                                                                           number of distinct advertisers each user clicked. Malicious
                               0.00                                                        users show extreme patterns, for instance most have empty
                                      0   4   8   12       16     20   24   28
                                                       SC score                            cookie and others have dense clicks on one advertiser.
                                                                                           Click ratio on advertisers. This feature takes into ac-
   Figure 3: Fraction of groups vs. Scatter scores                                         count both the total clicks and the distinct clicked adver-
                                                                                           tisers, defined as total clicks/total clicked advertisers. For
                                                                                           each user, it represents the average clicks per advertiser.
if the group size is reasonable, it is less possible to be an in-                          We observe that the trusted users show higher diversity by
terested user. In other words, if the intra-cluster similarity                             comparing their histograms.
of a group is lower, then the probability of the traffics in the                              We also define features that can be used to characterize
group being malicious is higher. Moreover, the high similar-                               the attribute of a user. For instance, a fraudulent user might
ity of the referrer in a suspicious group means the low traffic                            carry out the malicious click behaviors from one device, but
quality of the publisher except search engine. In particular,                              with many dynamically allocated IP addresses. Thus, we
we add the query diversity factor for search engine traffic.                               derive features from user agent, IP and cookie. Short cookies
We define the Scatter(SC) score of each group as:                                          are more suspicious than normal cookies, same goes for user
                                                                                           agent. Some details about these features are shown below.
                          SC = intra distance ⇥ query diversity                  (5)       Click/IP ratio. This feature is defined as the total clicks
                                                                                           for a user/total unique IP addresses the clicks come from.
where query diversity is defined as the ratio of distinct search                           Variance of IP clicks. After counting the number of clicks
phrases to total search phrases. The query diversity is set                                from each IP address, we can calculate the variance of these
to 1 for non-search engine traffic. By adding query diver-                                 clicks. It will be suspicious if one user launches clicks with
sity, we want to give more importance to the website traffic                               consistent frequency from some IP addresses. Besides, we
groups, since it is more difficult to locate the root cause of                             also use features such as the number of user agents, number
problem in search engine groups. Thus, we use the intra-                                   of referrers, length of agent, length of cookie.
cluster distance times query diversity to measure the ads                                     There are some other features derived from geographic or
click Scatter-ness of groups. Finally, the groups with low SC                              temporal attributes. For click timestamp, we divide a day
score will be regarded as suspicious groups. Figure 3 shows                                into four six-hour periods: night(0:00 to 5:59), morning(6:00
the distribution of Scatter score across our test groups. A                                to 11:59), afternoon(12:00 to 17:59) and evening(18:00-23:59).
brief description is shown in Algorithm 1.                                                 With the information available, the features below are also
                                                                                           extracted: most frequent area, most frequent period, num-
  input : clicks on each advertiser                                                        ber of clicks in each period, mean/std deviation of clicks in
  output: groups of malicious clicks                                                       periods and so on. We are not going to list out all the 17
  initialize dissimilarity matrix;                                                         features used in classification considering the space limit.
  while not at end of advertisers’ traffic set do
      step 1: apply K-medoids using D(x, y) distance to                                    4.    EXPERIMENTS
      get traffic groups with similar features                                               In this section we present the experimental results of the
      step 2: select suspicious groups with lower SC score                                 stage-wise filtering framework.
  end
Algorithm 1: Major steps of getting cheating groups
                                                                                           4.1   Dataset
                                                                                              The data set used in the experiments is over one month
                                                                                           click log we get from an ad agency company. It contains
                                                                                           about 35 million records of ad clicks. The advertising log
3.5    Feature Extraction                                                                  data normally contains attributes as user ID, click times-
   The features extracted from data that work in web page                                  tamp, user’s IP, cookie, query phrase (if the ad traffic is
spam domains may not work in ad log analysis. We inspect                                   from a search engine), user agent and referrer (url of the
the click data and introduce several features specifically de-                             page where user clicked the ad link ).
signed for classifiers to predict malicious clicks. Features are                              To protect privacy, users who click the ads are assumed
created by studying the labeled users’ activity patterns.                                  to be only temporarily identified by cookie. We assume that


                                                                                       4
      Table 2: Brief description of training set.                          Table 3: Precision performance for classifiers.
       Description           # of Clicks/Users                              Classifier      % Precision Run Time (s)
       Fraudulent clicks     174,642                                        Random Forest      97.6            322
       Non-fraudulent clicks 779,980                                        REP Tree           96.8             98
       Malicious users       83,061                                         Bayes Net           96.7           115
       Benign users          649,204                                        Decision Table      96.1          199.5
                                                                            Naı̈ve Bayes        52.9            27

some user actions such as buying stu↵ or registering are be-
                                                                                                                                             total clicks
nign. In this way, we extract non-fraudulent clicks from be-


                                                                                                                                                                                                     12000
                                                                                                                                             malicious filtered by stage 1


                                                                                             400000 800000 1200000 1600000
                                                                                                                                             malicious filtered by stage 2
nign users for training. On the other hand, some malicious                                                                                   malicious filtered by stage 3


                                                                                                                                                                                                             malicious clicks
clicks were addressed using domain knowledge. Examples


                                                                              total clicks
of malicious clicks are shown in Table 1. We notice that


                                                                                                                                                                                                     8000
attack may come from similar IP addresses with fake user
agent, fake referrer, or meaningless query phrase. Besides,


                                                                                                                                                                                                     4000
the training data includes complete clicks of three advertis-
ers that can be used to evaluate clustering performance. A
brief description of dataset for training is shown in Table 2.


                                                                                                                                                                                                     0
                                                                                                                             2   4   6   8   10   12   14   16    18   20   22   24   26   28   30
                                                                                                                                                            day
4.2    Experimental Setup
   We run our stage-wise method on an 8-node Hadoop clus-                    Figure 4: Analysis of the click log dataset.
ter using Pig Latin [2]. We split each stage into Map/Reduce
modules in pipeline. Each module can be converted into
one or several rounds of Map/Reduce tasks. For example
the filtering phase in rule-based stage, users are partitioned
                                                                           Figure 4 illustrates the result of our implementation of
through mappers and the counting filter UDFs(User Define
                                                                        stage-wise approach applied on click log of one month.
Functions) on each user are performed in reducer. How to
                                                                           From the daily result, we find that a high percentage of
minimize the I/O cost is a major research for analysis on
                                                                        clicks follows the format http://search-engine.com/s?word
large scale dataset. Taking this problem into consideration,
                                                                        with the same short UA. Further study suggests that these
briefly, we design our UDFs to get the most results through
                                                                        clicks are coming from bot-net which directly inject noise
one-pass over the data.
                                                                        into our logs using fake referer, user agent and IP address.
   In the first stage of our framework, the bound rules for
                                                                        We also find four suspicious publisher groups, which gener-
click density and click frequency are determined using p-
                                                                        ate high density clicks with similar IP addresses and user
quantile of the entire distribution. From our training data,
                                                                        agents from their website. Moreover, part of the suspicious
we find there contains over 0.5% heavy hitters. There exists
                                                                        clicks were generated by download manager showing in UA.
a certain percentage of noisy clicks in the log, 0.995 or similar
                                                                           Figure 4 shows that the percentage of malicious in first
(consider the distributions in Figure 1) can be chosen as
                                                                        stage distributed evenly. We could reasonably assume that
value of p in our experiment for rule setting. The large click
                                                                        all the heavy hitters and frequent clicks are malicious, since
logs are filtered by passing through click counting, period
                                                                        the upper bounds were setting based on the propriety of
counting, quantile calculating and filtering phases.
                                                                        distribution. Thus, the rule-based filter can be chosen as
   Next, the reduced dataset is passed to feature extraction
                                                                        the basic filter with absolute high confidence to meet ba-
modules as well as classifiers in the second stage. In our ex-
                                                                        sic requirement of ad agency. For classification stage, we
periment, 17 features which have been discussed above are
                                                                        choose Random Forest as our model. This stage can dis-
created. Then, we use training set in Table 2 to train clas-
                                                                        cover stealthy clicks with suspicious patterns derived from
sification models and compare the performance of the fol-
                                                                        domain knowledge-based labeling. However, for the limita-
lowing classifiers: Naı̈ve Bayes, Decision Table, Bayes Net,
                                                                        tion of supervised method, false positive is inevitable. Our
REP Tree and Random Forest(10 trees, 5 random features
                                                                        aim is to get the classifier with maximum precision. As to
each). The comparison results of 5-fold cross validation over
                                                                        the clustering stage, we evaluate the precision on three dif-
training set are presented in Table 3. Since the labelled ma-
                                                                        ferent advertisers’ click log. If the number of final result
licious are just partial of real malicious clicks, the relatively
                                                                        groups is n, precision metric in this case is:
recall performance is less important than precision. The
performance could be very good result of the simple label-                                                                                                         n
                                                                                                                                                             1 X |Ci \ L|
ing scheme, thus we need unsupervised methods.                                                                                       P recision =
   Finally, three advertisers with di↵erent click size(100K,                                                                                                 n i   |Ci |
10K, 1K) are chosen as the evaluation dataset in the clus-
tering stage. We set equal weight for di↵erent features in              where L is the cheating click set in result, and Ci is the total
experiment. According to Figure 3, we use 2.0 as the lower              click set from each of the advertisers. Figure 5 shows our test
bound for the SC score of groups, which means the group                 results of clustering. Three advertisers’ dataset achieve dif-
is quite dense. The number of K in applying K-medoids are               ferent precision performance, which indicates that the con-
determined experimentally.                                              fidence partially depends on the distribution of real click
                                                                        data. Therefore, we set the clustering-based filters in the
4.3    Results Analysis and Validation                                  last stage. Indeed, even this filter has a lower confidence, it


                                                                    5
                              1.0
                                                                                                                                         80


                                                                                                                    Diversity Ratio(%)
                              0.8
                                                                                                                                         60                            Normal


                  Precision
                              0.6                                                     Advertiser1                                                                      Malicious in stage 2
                                                                                                                                         40
                                                                                      Advertiser2                                                                      Malicious in stage 3
                              0.4
                                                                                      Advertiser3                                        20
                              0.2
                                                                                                                                          0
                              0.0                                                                                                             UA   cookie   referrer
                                        20   30         40       50       60
                                                                                                                                                   Fields
                                             Number of Clusters


Figure 5:  Precision performance of clustering                                                             Figure 7: Diversity comparison on three features.
method on evaluation data from three advertisers.
                     100                                                                                    This work is partially supported by the Key Program of
                                                                               Rule-based                 National Natural Science Foundation of China grant No.
          Precision(%)


                                                                               Classification-based
                         80                                                    Clustering-based           61232002, National Science Foundation of China under grant
                                                                                                          No.60925008, No.61103039, No.61021004 and the Key lab
                                                                                                          Project of Wuhan University.
                         60
                                    1        2               3        4
                                                  day

                                                                                                          7.   REFERENCES
Figure 6: Precision performance of three stages on                                                         [1] O. Duskin and D. G. Feitelson. Distinguishing humans
sampled result.                                                                                                from robots in web search logs: preliminary results
                                                                                                               using query rates and intervals. In Proceedings of the
                                                                                                               2009 workshop on Web Search Click Data, WSCD ’09,
is important for ad agency to assess the traffic quality from
                                                                                                               pages 15–19, 2009.
publishers by evaluating the malicious group.
   It is worth pointing out that our framework built on the                                                [2] A. Gates, O. Natkovich, S. Chopra, P. Kamath,
top of Hadoop platform achieved high efficiency for process-                                                   S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and
ing click log. Due to the as-many-as computation in one-pass                                                   U. Srivastava. Building a highlevel dataflow system on
design, each stage could be finished in minutes for both daily                                                 top of mapreduce: The pig experience. PVLDB, 2(2),
and monthly filtering.                                                                                         2009.
   To further validate the stage-wise precision, we pick 4-day                                             [3] H. Haddadi. Fighting online click-fraud using blu↵
predicated malicious clicks for human judgement. We invite                                                     ads. Computer Communication Review, 40(2):21–25,
domain expert to inspect these clicks and the comparison                                                       2010.
results were shown in Figure 6. We see that the rule-based                                                 [4] M. Hager and T. Landergren. Implementing best
methods achieve high precision, which verifies our assump-                                                     practices for fraud detection on an online advertising
tion. However, false positives are inevitable in any unsu-                                                     platform. Master’s thesis, Chalmers University of
pervised learning algorithms. It is interesting that most of                                                   Technology, 2010.
the false classified results are from mobile applications, es-                                             [5] B. J. Jansen. Click Fraud. Computer, 40(7):85–86,
pecially on the 10th day. The missing of referrer field in                                                     July 2007.
these clicks is the root cause, which makes the patterns of                                                [6] H. Kang, K. Wang, D. Soukal, F. Behr, and Z. Zheng.
these clicks similar to those of malicious clicks.                                                             Large-scale bot detection for search engines. In
   As a comparison, we check the diversity ratio on three                                                      WWW, pages 501–510, 2010.
features: hash of cookie, hash of user agent, hash of refer-                                               [7] B. Lahiri, J. Chandrashekar, and S. Tirthapura.
rer, by sampling clicks from positive and negative results                                                     Space-efficient tracking of persistent items in a
respectively. The diversity is defined as the ratio of distinct                                                massive data stream. In DEBS, pages 255–266, 2011.
items to total samples. Figure 7 shows the result: diversi-                                                [8] A. Metwally, D. Agrawal, and A. El Abbadi.
ties of normal clicks are relatively higher than these of mali-                                                Duplicate detection in click streams. In WWW, pages
cious groups. For example, a group of intentional clicks with                                                  12–21, 2005.
similar UA may come from one commander. The obvious                                                        [9] C. Phua, D. Alahakoon, and V. C. S. Lee. Minority
di↵erence between the malicious groups and normal groups                                                       report in fraud detection: classification of skewed
suggests that the identified ones are indeed very suspicious.                                                  data. SIGKDD Explorations, 6(1):50–59, 2004.
                                                                                                          [10] A. Tuzhilin. The lane’s gifts v. google report. http :
5.   CONCLUSION                                                                                                //googleblog.blogspot.in/pdf /T uzhilinR eport.pdf ,
   Fraudulent click is a malicious behavior which threat-                                                      2007.
ens the healthy development of Internet ecosystem. In this                                                [11] C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and
work, we propose a stage-wise click fraud filtering architec-                                                  K. Zhang. Fighting against web spam: a novel
ture which e↵ectively identifies the fraud clicks for ad agency                                                propagation method based on click-through data. In
with di↵erent prediction confidence. The stages in this work                                                   SIGIR, pages 395–404, 2012.
can be further divided into a set of modules, which consist                                               [12] F. Yu, Y. Xie, and Q. Ke. Sbotminer: large scale
of one or several rounds of Map/Reduce using parallel com-                                                     search bot detection. In WSDM, pages 421–430, 2010.
puting. We performed an in-depth analysis on one month                                                    [13] L. Zhang and Y. Guan. Detecting click fraud in
click log using the proposed framework and evaluated our                                                       pay-per-click streams of online advertising networks.
results by di↵erent metrics.                                                                                   In ICDCS, pages 77–84, 2008.

6.   ACKNOWLEDGMENTS

                                                                                                      6