Multi-Stage Malicious Click Detection on Large Scale Web Advertising Data Leyi Song, Xueqing Gong⇤ , Xiaofeng He, Rong Zhang, Aoying Zhou Center for Cloud Computing and Big Data East China Normal University 3663 North Zhongshan Road, Shanghai, China songleyi@ecnu.cn, {xqgong,xfhe,rzhang,ayzhou}@sei.ecnu.cn ABSTRACT tain types of anomaly click behaviors generated by human The healthy development of the Internet largely depends on or bots [7, 12, 13]. Despite such e↵orts, there still remains the online advertisement which provides the financial sup- great challenge to address the problem of malicious clicks. In port to the Internet. Click fraud, however, poses serious this paper, we focus on detecting malicious clicks from large threat to the Internet ecosystem. It not only brings harm scale web advertising data on ad agency side. Ad agency to the advertisers, but also damages the mutual trust be- plays an intermediate role between advertisers and publish- tween advertiser and ad agency. Click fraud prediction is ers in the online advertising system. In order to identify a typical big data application in that we normally need to malicious clicks of di↵erent categories, for instance, suspi- identify the malicious clicks from massive click logs, there- cious users, stealthy click-bots and cheating publishers, we fore efficient detection methods in big data framework are build a series of filters at di↵erent stages in big data compu- much desired to combat this fraudulent behavior. In this tation framework. The filters are ordered by the decreasing paper, we propose a three-stage filtering system to attack confidence of predicting the malicious clicks. At first stage, click fraud. The serialized filters e↵ectively detect the mali- a rule based filter identifies the malicious clicks with high cious clicks with decreasing confidence that can satisfy both confidence since these rules catch strong signals for abnor- advertisers and content providers. mal clicks. At second stage, a supervised classification ap- proach is used to detect malicious clicks, whose prediction results are of lower confidence than rule based ones. Fi- 1. INTRODUCTION nally we cluster the clicks into group at stage 3 with the The fast development of the Internet depends not only on hope that fraudulent clicks generated from one publisher the increase of rich content, but also on online advertise- will be grouped together. The clustering method is an unsu- ment which provides the financial support to the Internet pervised learning process, hence results in lowest prediction ecosystem. Online advertising tends to benefit all involved confidence. Sequentially organizing the filters in decreasing parties including content provider, advertiser, ad agency and confidence order provides us with flexibility and scalability ad network. It requires the mutual trust among all parties. to add extra filters. For instance, we can replace one classi- This trust was at risk, however, by fraudulent clicks. Click fier with another classifier at stage 2, or add more classifiers fraud (also called click spam, malicious click) is an action of to stage 2 to form an ensemble. Users have the freedom to intentional clicking with the purpose of making undue profit, use the result corresponding to di↵erent confidence. or doing harm to competitors. Click fraud is becoming a se- We use the precision as the measure of confidence in this rious problem to the World Wide Web [5]. Failing to deter paper. The reason to use precision, instead of other metrics such behaviors will discourage advertisers from actively en- such as recall or F-measure is because 1) precision focuses gaging in more online advertising activities, resulting in less on singling out bad clicks while trying to minimize the pos- revenue for content providers or ad agencies, and ultimately sibility of predicting valid clicks as fraudulent ones; 2) it is endangering the Internet ecosystem as a whole. Therefore well-known that the judgement of whether or not a click is it is of high importance to detect the malicious clicks. a fraudulent one is very subjective in many cases. Too large Ad agencies or networks often deploy di↵erent filters to recall can potentially classify large number of valid clicks as identify malicious clicks [10]. By setting a proper thresh- bad ones unless we have high quality classifiers which is hard old or training a classifier, these methods can handle cer- to obtain, especially when human judgement is difficult. In this paper, we propose a click fraud detecting architec- * Corresponding Author ture which organizes filters sequentially by decreasing con- fidence order. This architecture was applied to large scale advertising log obtained from an ad agency. Specifically, • we address the malicious click problem on the ad agency side. In addition to rule-based and supervised meth- ods, we propose a clustering-based method to analyse the traffic quality; • we design a click detecting mechanism in big data com- putation framework, i.e., Hadoop, to detect malicious 1 clicks efficiently by sequentially organizing three types      of filters of di↵erent confidence level;     • we carefully design and analyse important features, Blacklist and verify our approach by evaluating the dissimilarity Stage 1: between predicted fraudulent clicks vs. valid clicks. Rule- malicious Heavy Hitter based Filtering The rest of the paper is organized as follows. In Section 2, clicks we introduce previous work related to click fraud detection. Frequent Click We present our detection architecture in Section 3. In Sec- trust users Confidence tion 4, we comprehensively analyze the results of our ap- proach by applying it to over one month real click log data from an ad agency, and conclude our paper in Section 5. Feature Extraction Stage 2: Classification- 2. RELATED WORK Malicious User based Filtering Result Prediction There are many participants play in online advertising Validation Model ecosystem. Ad agency usually plays an match maker role Result between publishers and advertisers. Publishers own the web- Analysis & site pages containing ad slots. Advertisers purpose their ad Ensemble Stage 3: Malicious creativity to attract users to buy their products or to make Publishers Clustering- Stage 4: based other kinds of profit. Ad agency buys inventory from pub- Detection Filtering Validation and lishers and sells advertising traffic to advertisers. Thus, for a Analysis reputed ad agency, it is responsible to filter these malicious clicks before charging. Figure 1: Architecture Overview. Undoubtedly there are great e↵orts in the area of fraud detection, including the click fraud problem, but most solu- tions are not easily available. The report by Tuzhilin [10] Previous approaches to identify and understand malicious introduced some of the approaches that Google adapted to clicks focused on one specific method, or one specific prob- fight click fraud, in which both rule-based and anomaly- lem. Researchers adopted click-through rate related meth- based filters were incorporated into search engine. Since ods for web spam in search engine [11]. However, it is search advertising is one of the major forms of online adver- difficult for an ad agency to access the click-though rate, tisement, this work inspired us to take similar approach in therefore some fraudulent patterns can not be easily iden- fraud detection system. tified by traditional methods. In our framework, we design Another type of malicious click detection methods was a more general way of data collection and filtering strategy based on click stream analysis techniques which identified for malicious click detection, implemented in a stage-wise patterns of fraudulent traffic. Metwally et al. proposed architecture. fraud detection solutions for data stream by combining asso- ciation rules and duplicate detection methods [8]. E↵ective data structure such as modified Bloom Filter was used in 3. STAGE-WISE CLICK FRAUD FILTER- this situation [13]. However, it is often difficult for most ING ARCHITECTURE data mining-based detection methods to be implemented into stream analysis, hence limits the adaptation of such 3.1 Architecture Overview techniques to click fraud detection. In order to accommodate di↵erent filtering methods with Supervised learning approach detects the malicious clicks di↵erent confidence in predicting the fraudulent clicks, we by training a classifier. Data collection is one of the most argue that it is advantageous to take the approach of stage- important steps in this approach. Haddadi [3] proposed wise filtering architecture. The filters are sequentially con- the idea of blu↵ ads, which is unrelated to the user search, nected such that the filter generating results with higher user profile and recently activity. If a high ratio of such predicting confidence is put in the chain before the one with ads was clicked, the user could be flagged as suspicious. In lower confidence. The intuition is that we need to identify some other work, CAPTCHA was used for training data most confident malicious clicks, and reduce the data size generation and useful data collection [6]. For ad agencies, which need further processing, hence reduce the complex- they have various advertiser and publisher sources, hence ity of the problem. Furthermore, stage-wise structure o↵ers CAPTCHA approach is hardly applicable. The next key prediction results of di↵erent confidence, which enables the step is to extract features. The work [1] investigated query users to utilize the result with more flexibility. attributes between human and robot traffic. Di↵erent type The stage-wise filtering system architecture is illustrated of features could be extracted from the click attributes and in Figure 1. The major components are following 3 stages. user attributes, such as click count or geographic origin [1, 1) Rule-based filters at stage 1 to detect obvious invalid 4]. clicks with high confidence. They can identify two types Bot-generated click traffic is a big part of malicious clicks. of malicious clicks: heavy hitter and frequent clicker. 2) The state-of-the-art bot detection work mostly aimed at Classification-based filters at stage 2 to determine more com- clicks in search engine logs [6, 12]. Yu et al. proposed SBot- plicated clicks with human judged training set. 3) The Miner, a system which automatically identified bot gener- clustering-based filter at stage 3 to identify cheating groups ated search traffic from query log, using history-based as from similar publisher websites. We use intra-cluster dis- well as matrix-based unsupervised methods [12]. tance and query diversity to separate malicious groups. 2 80000 that, for agency’s customers(advertisers), some of the suspi- 3 hours as interval 100000 10 min as period cious ad traffic from the same website shows high similarity. Number of Users 6 hours as intervel 30 min as period Number of Users 12 hours as interval 60 min as period 24 hours as interval 180 min as period We try to group similar traffic and analyze them as a group in order to detect abusive clicks from the publisher-side. 40000 50000 This approach can also be applied to search engine traffic, since the query diversity can be used to separate the suspi- cious groups from the search ad log. Considering the result 0 0 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 confidence and filtering cost, this stage can be an optional Clicks in One Interval Number of Periods choice for each advertiser. For valuable customers, this is (a) Click density (b) Click frequency an attractive feature, which di↵ers from previous work. For clustering, We define the dissimilarity between two Figure 2: Click statistics of a trusted user click log log entries x, y as: dataset in one month X D(x, y) = wf df (x, y) (1) 3.2 Rule-based Filtering f 2F ields To fight click fraud, generating blacklist in the filtering where each log entry is represented by a vector of click at- system is the most reliable method. It is easy to compile tributes as < user, IP, ref errer, U A, area, query >. F ields the blacklist for violating entities such as user agent(UA) is the feature set used in click attributes and df (x, y) is the or IP address, but the coverage of a blacklist is limited. distance measure defined on each attribute f , normalized Setting certain rules is efficient to exclude more malicious between [0,1]. wf is the weight of df (x, y). clicks from entering accounting system. In this paper we For attributes like refer URL and user agent, we care focus on setting rules for the heavy hitter and frequent click about the longest matching prefix, and the distance mea- problems [8, 13, 7]. sure is defined as: Heavy hitter in click logs means, at a specific time in- LCP (x, y) terval, the click rate of a user is relatively higher than a durl (x, y) = 1 (2) max(|x|, |y|) threshold 1 , while frequent click problem refers to the sit- uation that user’s click appears in relatively more periods1 where LCP means the longest common prefix of two strings. than a predefined threshold 2 . Even though network address translation (NAT) might To obtain a reasonable filtering threshold, we note from allow many users behind a single IP address, the cheating Figure 2 that the number of clicks and periods follows the groups always show strong similarity in IP address. To am- Zipfian distribution. We set the maximum number for nor- plify the importance of the tail part in IP address, we take mal user behavior as the lower value of the two: number of 32 bits IPv4 address as an example to define the distance. clicks in one interval and number of periods the user clicks. If LCB(longest common bits) 16, then For better accuracy, we can also determine the threshold based on the p-quantile value on the entire log dataset. LCB(x, y) dIP (x, y) = 1 , (3) total bits 3.3 Classification-based Filtering otherwise, the distance between two IP addresses is 1. Classification-based methods are widely used in fraud de- For other attributes such as area, we simply treat their dis- tection or spam detection field [9]. Classification is an ef- similarity as binary value (0 or 1). Furthermore, we can eas- fective way for addressing malicious clicks, especially for ily prove that D(x, y) is a metric, since it follows the proper- stealthy clicks that are hard to be captured by rule-based ties: (1) D(x, y) = 0if f x = y, (2) D(x, y) 0, (3) D(x, y) = approach. The classification has been applied to real data D(y, x), and (4) D(x, y)  D(x, z) + D(z, y)(triangle in- with success [4, 10]. One of the biggest advantages with equality). Due to space limit, we skip the proof here. classification approach is that once a model has been built, After obtaining the dissimilarity matrix of a log set, we the prediction of new instance is usually quite fast. There is use k-medoids algorithm to produce the clustering in this also research on the attributes which can distinguish human stage. K-medoids is an adaptation of k-means algorithm. and bot traffic [1]. In our work, we use the traditional fea- Rather than calculating the mean of the items for each clus- tures used in previous work, and also engineer new features ter, which is not applicable in our situation, a representative useful for training classifiers. We will discuss the features item, or medoid, is chosen for each cluster. Medoids for each in more details later. The result of this stage is a set of cluster are calculated to finding object i by minimizing malicious users from the click log. X J˜ = D(i, j) (4) 3.4 Clustering-based Filtering j2Ci It is obvious that results of supervised methods highly de- where Ci is the cluster containing object i and D(i, j) is pend on the accuracy, coverage and labeling scheme of the the dissimilarity function defined in equation 1. Since the labeled corpus. The classified malicious clicks are limited by algorithm simply looks up the dissimilarity matrix, it only the fraudulent types in training set. Hence, we develop our needs to be calculated once in the beginning. clustering-based method, which is based on our observation The next step is how to distinguish cheating groups from 1 We use period to represent the window for counting in fre- all clusters. Obviously, if one group agrees on most fields, quent clicks problem, in order to distinguish the interval it indicates this group of clicks come from a botnet using of heavy hitter. Clicks of each user occur within the same similar terminals and browsers or a real interested user with period will be ignored. high probability. From the click statistics we can figure that, 3 Table 1: Examples of Labeled Malicious Clicks User Advertiser Area IP Referer Query User agent 1 c1 1 x.x.25.177 none none Mozilla/5.0 2 c2 1 x.x.25.178 none none Mozilla/5.0 3 c3 2 y.y.51.137 http://r1.com/s?wd=wvihv wvihv Mozilla/4.0(compatible; MSIE 7.0;) 3 c3 2 y.y.51.142 http://r1.com/s?wd=jmfitxm jmfitxm Mozilla/4.0(compatible; MSIE 7.0;) 4 c3 2 y.y.51.145 http://r2.com/s?wd=qyfsoc qyfsoc Mozilla/4.0(compatible; MSIE 7.0;) 0.10 In order to define useful features, we need to analyze Fraction of groups 0.08 the di↵erence between normal and malicious click behav- 0.06 iors. Below, we discuss some of the features we identified to represent the user. 0.04 Number of clicked advertisers. This feature counts the 0.02 number of distinct advertisers each user clicked. Malicious 0.00 users show extreme patterns, for instance most have empty 0 4 8 12 16 20 24 28 SC score cookie and others have dense clicks on one advertiser. Click ratio on advertisers. This feature takes into ac- Figure 3: Fraction of groups vs. Scatter scores count both the total clicks and the distinct clicked adver- tisers, defined as total clicks/total clicked advertisers. For each user, it represents the average clicks per advertiser. if the group size is reasonable, it is less possible to be an in- We observe that the trusted users show higher diversity by terested user. In other words, if the intra-cluster similarity comparing their histograms. of a group is lower, then the probability of the traffics in the We also define features that can be used to characterize group being malicious is higher. Moreover, the high similar- the attribute of a user. For instance, a fraudulent user might ity of the referrer in a suspicious group means the low traffic carry out the malicious click behaviors from one device, but quality of the publisher except search engine. In particular, with many dynamically allocated IP addresses. Thus, we we add the query diversity factor for search engine traffic. derive features from user agent, IP and cookie. Short cookies We define the Scatter(SC) score of each group as: are more suspicious than normal cookies, same goes for user agent. Some details about these features are shown below. SC = intra distance ⇥ query diversity (5) Click/IP ratio. This feature is defined as the total clicks for a user/total unique IP addresses the clicks come from. where query diversity is defined as the ratio of distinct search Variance of IP clicks. After counting the number of clicks phrases to total search phrases. The query diversity is set from each IP address, we can calculate the variance of these to 1 for non-search engine traffic. By adding query diver- clicks. It will be suspicious if one user launches clicks with sity, we want to give more importance to the website traffic consistent frequency from some IP addresses. Besides, we groups, since it is more difficult to locate the root cause of also use features such as the number of user agents, number problem in search engine groups. Thus, we use the intra- of referrers, length of agent, length of cookie. cluster distance times query diversity to measure the ads There are some other features derived from geographic or click Scatter-ness of groups. Finally, the groups with low SC temporal attributes. For click timestamp, we divide a day score will be regarded as suspicious groups. Figure 3 shows into four six-hour periods: night(0:00 to 5:59), morning(6:00 the distribution of Scatter score across our test groups. A to 11:59), afternoon(12:00 to 17:59) and evening(18:00-23:59). brief description is shown in Algorithm 1. With the information available, the features below are also extracted: most frequent area, most frequent period, num- input : clicks on each advertiser ber of clicks in each period, mean/std deviation of clicks in output: groups of malicious clicks periods and so on. We are not going to list out all the 17 initialize dissimilarity matrix; features used in classification considering the space limit. while not at end of advertisers’ traffic set do step 1: apply K-medoids using D(x, y) distance to 4. EXPERIMENTS get traffic groups with similar features In this section we present the experimental results of the step 2: select suspicious groups with lower SC score stage-wise filtering framework. end Algorithm 1: Major steps of getting cheating groups 4.1 Dataset The data set used in the experiments is over one month click log we get from an ad agency company. It contains about 35 million records of ad clicks. The advertising log 3.5 Feature Extraction data normally contains attributes as user ID, click times- The features extracted from data that work in web page tamp, user’s IP, cookie, query phrase (if the ad traffic is spam domains may not work in ad log analysis. We inspect from a search engine), user agent and referrer (url of the the click data and introduce several features specifically de- page where user clicked the ad link ). signed for classifiers to predict malicious clicks. Features are To protect privacy, users who click the ads are assumed created by studying the labeled users’ activity patterns. to be only temporarily identified by cookie. We assume that 4 Table 2: Brief description of training set. Table 3: Precision performance for classifiers. Description # of Clicks/Users Classifier % Precision Run Time (s) Fraudulent clicks 174,642 Random Forest 97.6 322 Non-fraudulent clicks 779,980 REP Tree 96.8 98 Malicious users 83,061 Bayes Net 96.7 115 Benign users 649,204 Decision Table 96.1 199.5 Naı̈ve Bayes 52.9 27 some user actions such as buying stu↵ or registering are be- total clicks nign. In this way, we extract non-fraudulent clicks from be- 12000 malicious filtered by stage 1 400000 800000 1200000 1600000 malicious filtered by stage 2 nign users for training. On the other hand, some malicious malicious filtered by stage 3 malicious clicks clicks were addressed using domain knowledge. Examples total clicks of malicious clicks are shown in Table 1. We notice that 8000 attack may come from similar IP addresses with fake user agent, fake referrer, or meaningless query phrase. Besides, 4000 the training data includes complete clicks of three advertis- ers that can be used to evaluate clustering performance. A brief description of dataset for training is shown in Table 2. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 day 4.2 Experimental Setup We run our stage-wise method on an 8-node Hadoop clus- Figure 4: Analysis of the click log dataset. ter using Pig Latin [2]. We split each stage into Map/Reduce modules in pipeline. Each module can be converted into one or several rounds of Map/Reduce tasks. For example the filtering phase in rule-based stage, users are partitioned Figure 4 illustrates the result of our implementation of through mappers and the counting filter UDFs(User Define stage-wise approach applied on click log of one month. Functions) on each user are performed in reducer. How to From the daily result, we find that a high percentage of minimize the I/O cost is a major research for analysis on clicks follows the format http://search-engine.com/s?word large scale dataset. Taking this problem into consideration, with the same short UA. Further study suggests that these briefly, we design our UDFs to get the most results through clicks are coming from bot-net which directly inject noise one-pass over the data. into our logs using fake referer, user agent and IP address. In the first stage of our framework, the bound rules for We also find four suspicious publisher groups, which gener- click density and click frequency are determined using p- ate high density clicks with similar IP addresses and user quantile of the entire distribution. From our training data, agents from their website. Moreover, part of the suspicious we find there contains over 0.5% heavy hitters. There exists clicks were generated by download manager showing in UA. a certain percentage of noisy clicks in the log, 0.995 or similar Figure 4 shows that the percentage of malicious in first (consider the distributions in Figure 1) can be chosen as stage distributed evenly. We could reasonably assume that value of p in our experiment for rule setting. The large click all the heavy hitters and frequent clicks are malicious, since logs are filtered by passing through click counting, period the upper bounds were setting based on the propriety of counting, quantile calculating and filtering phases. distribution. Thus, the rule-based filter can be chosen as Next, the reduced dataset is passed to feature extraction the basic filter with absolute high confidence to meet ba- modules as well as classifiers in the second stage. In our ex- sic requirement of ad agency. For classification stage, we periment, 17 features which have been discussed above are choose Random Forest as our model. This stage can dis- created. Then, we use training set in Table 2 to train clas- cover stealthy clicks with suspicious patterns derived from sification models and compare the performance of the fol- domain knowledge-based labeling. However, for the limita- lowing classifiers: Naı̈ve Bayes, Decision Table, Bayes Net, tion of supervised method, false positive is inevitable. Our REP Tree and Random Forest(10 trees, 5 random features aim is to get the classifier with maximum precision. As to each). The comparison results of 5-fold cross validation over the clustering stage, we evaluate the precision on three dif- training set are presented in Table 3. Since the labelled ma- ferent advertisers’ click log. If the number of final result licious are just partial of real malicious clicks, the relatively groups is n, precision metric in this case is: recall performance is less important than precision. The performance could be very good result of the simple label- n 1 X |Ci \ L| ing scheme, thus we need unsupervised methods. P recision = Finally, three advertisers with di↵erent click size(100K, n i |Ci | 10K, 1K) are chosen as the evaluation dataset in the clus- tering stage. We set equal weight for di↵erent features in where L is the cheating click set in result, and Ci is the total experiment. According to Figure 3, we use 2.0 as the lower click set from each of the advertisers. Figure 5 shows our test bound for the SC score of groups, which means the group results of clustering. Three advertisers’ dataset achieve dif- is quite dense. The number of K in applying K-medoids are ferent precision performance, which indicates that the con- determined experimentally. fidence partially depends on the distribution of real click data. Therefore, we set the clustering-based filters in the 4.3 Results Analysis and Validation last stage. Indeed, even this filter has a lower confidence, it 5 1.0 80 Diversity Ratio(%) 0.8 60 Normal Precision 0.6 Advertiser1 Malicious in stage 2 40 Advertiser2 Malicious in stage 3 0.4 Advertiser3 20 0.2 0 0.0 UA cookie referrer 20 30 40 50 60 Fields Number of Clusters Figure 5: Precision performance of clustering Figure 7: Diversity comparison on three features. method on evaluation data from three advertisers. 100 This work is partially supported by the Key Program of Rule-based National Natural Science Foundation of China grant No. Precision(%) Classification-based 80 Clustering-based 61232002, National Science Foundation of China under grant No.60925008, No.61103039, No.61021004 and the Key lab Project of Wuhan University. 60 1 2 3 4 day 7. REFERENCES Figure 6: Precision performance of three stages on [1] O. Duskin and D. G. Feitelson. Distinguishing humans sampled result. from robots in web search logs: preliminary results using query rates and intervals. In Proceedings of the 2009 workshop on Web Search Click Data, WSCD ’09, is important for ad agency to assess the traffic quality from pages 15–19, 2009. publishers by evaluating the malicious group. It is worth pointing out that our framework built on the [2] A. Gates, O. Natkovich, S. Chopra, P. Kamath, top of Hadoop platform achieved high efficiency for process- S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and ing click log. Due to the as-many-as computation in one-pass U. Srivastava. Building a highlevel dataflow system on design, each stage could be finished in minutes for both daily top of mapreduce: The pig experience. PVLDB, 2(2), and monthly filtering. 2009. To further validate the stage-wise precision, we pick 4-day [3] H. Haddadi. Fighting online click-fraud using blu↵ predicated malicious clicks for human judgement. We invite ads. Computer Communication Review, 40(2):21–25, domain expert to inspect these clicks and the comparison 2010. results were shown in Figure 6. We see that the rule-based [4] M. Hager and T. Landergren. Implementing best methods achieve high precision, which verifies our assump- practices for fraud detection on an online advertising tion. However, false positives are inevitable in any unsu- platform. Master’s thesis, Chalmers University of pervised learning algorithms. It is interesting that most of Technology, 2010. the false classified results are from mobile applications, es- [5] B. J. Jansen. Click Fraud. Computer, 40(7):85–86, pecially on the 10th day. The missing of referrer field in July 2007. these clicks is the root cause, which makes the patterns of [6] H. Kang, K. Wang, D. Soukal, F. Behr, and Z. Zheng. these clicks similar to those of malicious clicks. Large-scale bot detection for search engines. In As a comparison, we check the diversity ratio on three WWW, pages 501–510, 2010. features: hash of cookie, hash of user agent, hash of refer- [7] B. Lahiri, J. Chandrashekar, and S. Tirthapura. rer, by sampling clicks from positive and negative results Space-efficient tracking of persistent items in a respectively. The diversity is defined as the ratio of distinct massive data stream. In DEBS, pages 255–266, 2011. items to total samples. Figure 7 shows the result: diversi- [8] A. Metwally, D. Agrawal, and A. El Abbadi. ties of normal clicks are relatively higher than these of mali- Duplicate detection in click streams. In WWW, pages cious groups. For example, a group of intentional clicks with 12–21, 2005. similar UA may come from one commander. The obvious [9] C. Phua, D. Alahakoon, and V. C. S. Lee. Minority di↵erence between the malicious groups and normal groups report in fraud detection: classification of skewed suggests that the identified ones are indeed very suspicious. data. SIGKDD Explorations, 6(1):50–59, 2004. [10] A. Tuzhilin. The lane’s gifts v. google report. http : 5. CONCLUSION //googleblog.blogspot.in/pdf /T uzhilinR eport.pdf , Fraudulent click is a malicious behavior which threat- 2007. ens the healthy development of Internet ecosystem. In this [11] C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and work, we propose a stage-wise click fraud filtering architec- K. Zhang. Fighting against web spam: a novel ture which e↵ectively identifies the fraud clicks for ad agency propagation method based on click-through data. In with di↵erent prediction confidence. The stages in this work SIGIR, pages 395–404, 2012. can be further divided into a set of modules, which consist [12] F. Yu, Y. Xie, and Q. Ke. Sbotminer: large scale of one or several rounds of Map/Reduce using parallel com- search bot detection. In WSDM, pages 421–430, 2010. puting. We performed an in-depth analysis on one month [13] L. Zhang and Y. Guan. Detecting click fraud in click log using the proposed framework and evaluated our pay-per-click streams of online advertising networks. results by di↵erent metrics. In ICDCS, pages 77–84, 2008. 6. ACKNOWLEDGMENTS 6