Detecting Fraudulent Advertisements on a Large E-Commerce Platform Tim Zimmermann1 , Timo Djürken1 , Arne Mayer1 , Michael Janke1 , Martin Boissier2 , Christian Schwarz2 , Rainer Schlosser2 , Matthias Uflacker2 1,2 Hasso Plattner Institute, University of Potsdam, Germany {firstname.lastname}@{1 student.hpi.de, 2 hpi.de} ABSTRACT it is hard for humans to recognize fraud patterns and main- E-commerce platforms face the challenge of efficiently and tain them over time as fraudsters adapt their behavior. To accurately detecting fraudulent activity every day. Manu- create such rules, highly skilled experts are needed which ally checking every advertisement for fraud does not scale are both rare and expensive. On the other hand, fraudsters and is financially unviable. By using automated learning and companies are in constant competition to come up with algorithms, we can drastically reduce the number of adver- ever-improving strategies to carry out fraud and to fight it, tisements that need to be checked by humans. In this pa- respectively. Therefore, complex models comprised of rules per, we present the results of a joint project with a large e- that humans once developed are likely to prove ineffective commerce company selling used goods. Using our partner’s once fraudsters steadily evolve their strategy. Fraud detec- advertisement data, we implemented several classification tion needs to be both adaptable and verifiable. approaches to automatically recognize fraudulent activity. In general, there is always a trade-off between correctly With the help of the proposed fraud detection, customer identifying fraudulent advertisements on the one hand and service agents only need to check about 8% of all advertise- incorrectly classifying actually legitimate advertisements as ments manually for fraud. Simultaneously, we detect more fraudulent on the other. For every e-commerce company it than 93% of all fraudulent advertisements. is expensive to have fraud on the platform. Consequentially, we want to identify as many of the fraudulent advertise- ments as possible. We can accept if we incorrectly classify Keywords some of the legitimate advertisements as fraudulent, because Fraud Detection; Random Forests; XGBoost; Logistic Re- the automatic classification is solely used as a filter for the gression; Boosted Trees; E-Commerce customer support agent. These agents then check those ad- vertisements manually. Therefore, if a fraudulent advertise- ment is not classified as such it will not be checked by an 1. DETECTING ONLINE FRAUD agent unless a user files a complaint. On the other hand, if The success of e-commerce platforms strongly depends on an advertisement is classified as fraudulent but is actually le- the trust that customers have in it. If a platform is home to gitimate, the customer agent can overrule the classification fraudulent offerings, customers are less likely to interact with of the algorithm. Another challenge of hand-crafted rules that platform. Hence, minimizing the number of fraudulent for classification is to determine how likely a given offering advertisements is crucial for every e-commerce company. is fraudulent based on the rule set. However, having that In 2012, online retailers lost an estimated revenue of USD functionality can reduce the workload of the agents review- 3.5 billion due to fraud [2]. This figure does not even include ing the classification as the probability of a prediction being the expenses that retailers face to fight crime on their plat- correct above a certain threshold might already be sufficient. forms. Technology already helps to reduce the number of In that case, the advertisement’s classification could be de- people required by automatically suggesting a classification termined without any human looking at it. for a given offering. However, many companies still have to In this paper, we compare three approaches to classify employ a large number of dedicated staff to make the final fraud: logistic regression and decision tree-based models in decision whether an offering is fraudulent or not. the form of Random Forests and XGBoost. We compare Fraudulent activity cannot always be easily identified. It their effectiveness to recognize fraud with two different fea- is crucial to take the dependencies between various charac- ture sets: One feature set is based on the characteristics and teristics of the offering into account. While these dependen- values of the offerings (see Section 4). The other one is based cies can be simple in some cases, they are often complex on the average fraud probability for each of these values. and not at all obvious to humans. Approaches using hand- crafted rule sets pose several problems. On the one hand, 1.1 Columnar In-Memory Databases For our implementation, we decided for an architecture using the statistical language R and a columnar in-memory database (in our case SAP HANA). The decision for R was 2017, Copyright is with the authors. Published in the Workshop Proc. of rather straightforward as R is open source and the de-facto the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on standard for fast prototyping of machine learning algorithms. CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted The decision for a columnar in-memory database was made under the terms of the Creative Commons license CC-by-nc-nd 4.0. for several reasons. Databases with analytical capabilities allow to us to select a new data set for training on the fly 3. STRATEGIES FOR MODELING FRAUD without long running MapReduce or ETL (extract, trans- In order to successfully reduce the number of advertise- form, and load) jobs. At the same time, we can exploit the ments that have to be checked manually, there are two major fact that we always have access to the most recent data to aspects we need to consider. First, the algorithm should pro- train our models on [4]. This is particularly interesting for duce as few false negatives as possible. That is, it should identifying the steadily changing behaviour by fraudsters. not misclassify as legitimate advertisements that are in fact Whenever a major shift in fraudulent behaviour is observed, fraudulent. At the same time, the pool of possibly fraudu- models can be retrained within minutes. lent advertisements should be as small as possible. Second, During this project, we have used the columnar database the algorithm needs to keep up with the latest strategies of to select data on the fly and used a stand-alone R installa- fraudsters. Fraudsters react to defensive mechanisms em- tion for iterative work on our models. In case our proposed ployed by companies and improve the techniques they use approach shall be put into production, there are two ways to fool and exploit both users and companies. – both offering high performance – to proceed. First, R and HANA can be linked using the system’s shared memory 3.1 Dynamic Adaption as described in [3]. This way, data can be directly shared Traditional approaches are based on domain experts. They between the two processes without any serialization or net- use their knowledge and experience to create a set of rules. work transmission, while still having all capabilities of R at A rule consists of a number of checks that an advertisement hand. Second, SAP HANA provides the so-called predictive is tested against. The result is a classification as fraudu- analysis library (PAL) offering a large variety of machine lent or legitimate, depending on whether the rule matched learning algorithms that are directly executed on the actual or not. Results of multiple rules are aggregated to make a data inside of the database engine. final decision on the classification of the advertisement. This approach has two major drawbacks. First, it takes 2. PRELIMINARIES a lot of time to develop a sophisticated set of rules that For our research, we worked with data from a production are tuned to the individual use case. Second, these rules system of our project partner. It includes a subset of adver- constantly need to be updated by these experts, which in tisements along with some meta information (see Figure 1). turn again takes time. Besides having to pay the experts, this time span can already cause very significant damage << table >> << table >> << table >> and result in big losses to the company. COMPLAINTS AD_HISTORY TRAIN_HISTORY advertisement_id generated_ad_history_id generated_ad_history_id In contrast, machine learning offers techniques with which contact_time advertisement_id status_ok the criteria to classify advertisements can be generated semi- content customer_id status_quality_violation … model status_quality_resubmission or fully automatically, while at the same time adapting to … status_unchecked status_hidden_dealer the ever-changing deceptions of fraudsters. We compare status_uncertain status_fraud three different techniques to classify advertisements: logistic … regression, Random Forest, and XGBoost. The process of engineering the criteria is described in Section 4. << table >> ADVERTISEMENT identification 3.2 Logistic Regression advertisement_id customer_id Logistic regression is a form of supervised learning. A … function is fitted to a labeled data set. In our case the ad- vertisements are labeled with fraud (1) or no fraud (0). The function is then used to predict whether new advertisements are fraudulent or not. The result is always a value between << table >> << table >> 0 and 1 - the higher it is, the more likely that advertise- TRAINING SET ATTRIBUTE_VALIDATIONS identification identification ment is fraudulent. Due to the nature of regression it is not overall_fraud cs_fraud check_attribute check_result able to work with categorical (non-numerical) features. One … … work-around is to transform categorical into binary features, so-called one-hot encoding. Assume the brand attribute only Figure 1: Data schema of our project partner’s data set. had three distinct values: Nike, Reebok and Adidas. Instead of having a single feature brand we would have three fea- The data set contains a total of ∼1.4 million advertise- tures, specifically is_nike, is_reebok and is_adidas. The ments. For some of these advertisements there is more than value of each of these features is either 0 or 1 and it can one version, which means that the advertisement has changed thus be used by logistic regression. The main disadvantage over the observation period. We treat these versions as sep- of this approach is that it adds a lot of features to the model arate advertisements and can therefore work with about 1.9 and therefore increases the computation time required to million advertisements. For around 1.5 million (≈ 80%) of generate the model. At the same time, many classification those advertisements we know the correct classification, i.e., algorithms are limited in the number of features they can whether a given advertisement is fraudulent. Apart from the efficiently handle. current states of advertisements, the data also includes so- called attribute validations as well as historical information. 3.3 Random Forest Attribute validations are checks that are performed when- Random Forests are a collection of decision trees that are ever an advertisement is created or changed, e.g., comparing trained by repeatedly taking random samples from the train- the price to the average price for this offered item. The his- ing set. As it also relies on labeled data it is a supervised tory table contains different versions of the advertisements. learning technique as well. In contrast to logistic regres- sion, Random Forests can natively handle categorical fea- User Information tures with an arbitrary number of distinct values. However, Browser the R-implementation we used is only able to handle fea- Average tures with a maximum of 53 distinct values. While this is Ad Count enough for some of the features we used, it does not solve Operating System the problem in general. The brand attribute, for example, Ad Creation Time has more than 53 different brands in it. Often times it is not Letter Code necessary to know the exact brand - it might be sufficient to know the segment, e.g., sports shoes or running shoes. The IP Letter Code segment can be used as a “reduced” feature instead. Operating System Cluster We built a forest of 4,000 trees. This forest was assembled Browser Cluster by 20 parallel workers building forests of 200 trees each, −1.0 −0.5 0.0 0.5 1.0 which were combined to a single model in the end. Influence 3.4 XGBoost Figure 2: Visualization of the relevant features and their in- XGBoost – for extreme gradient-boosted trees – is a re- fluence on fraud probability for an exemplary advertisement. cent algorithm that has attracted many researchers over the past year [1]. Similar to Random Forests, XGBoost cre- ates (boosted) decision trees and belongs to the group of to condense the process into a few simple but meaningful supervised learning techniques. One of the key advantages figures. On the one hand, providing a figure for each tree of of XGBoost – besides the pure learning results – is its per- a larger forest is unpractical for the number of trees in our formance and scalability, allowing for very fast training and models. On the other hand, using a random sample of trees thus faster parameter tuning. can be non-representative and might confuse the user. Hence, we decided to solely show the relevance results of 3.5 Visualizing the Decision Process the logistic regression to the customer agents. One of our main objectives was an easy-to-understand and digestible visualization of the automated fraud detection. Understanding why a – from a customer agent perspective 4. FEATURE ENGINEERING – black-box algorithm classified a particular advertisement To detect fraud, we worked with two very different fea- as legitimate or fraudulent is essential to build up trust be- ture sets. The first feature set relies on the attributes and tween the system and the customer agent. Additionally, it features provided by our project partner as well as addi- helps the users to quickly decide which characteristics of the tional features we manually created. We call this feature advertisement to look at. set value-based feature set. The second feature set is based on the average fraud probability for a given value and we 3.5.1 Logistic Regression therefore call it average feature set. In the training stage, logistic regression builds a model which consists of an intercept value and feature coefficients 4.1 Advertisement Characteristics that are used to predict the fraud probability. We built two The value-based feature set consists of multiple features visualizations in order to help the customer agent better with varying complexity. This includes basic attributes like understand the learned model. The first model is supposed the geographical origin of the user’s IP address but also more to give the customer agent an overview of the features that complex combinations such as the IP Usage Counter. A the model uses to predict. Hereby, we simply show the high- fraudster might have numerous advertisements online at the est/lowest coefficients of the model (please note that normal- same time to increase the chances of success. For this case ization is required for this step). This very simple approach we calculate the number of advertisements that were created helps to make the results interpretable as the customer agent with the same IP address and try to find IP addresses with gains trust in the model as it (hopefully) detects fraud char- a large number of advertisements. acteristics that conform with the agent’s experience. IP addresses can also be analyzed with regards to their The second model is rather simple and optimized to help usage and connections to fraud in the past. However, it is the agent decide whether a particular advertisement is legit- not easy to identify users based on the IP since for most cus- imate or fraudulent (for an example see Figure 2). For all tomers the IP changes every day. Nevertheless, the IP can in historical legitimate and fraudulent advertisements, we cal- many cases be connected to an internet service provider sim- culate average explanatory variables that are put into the ply by using its sub ranges. Users typically only log in with regression model. Now, for the advertisement that is cur- IP addresses of a small number of different providers, which rently examined, we compare against the advertisement’s in turn means that the number of different providers can be explanatory variables and show the ratio (i.e., the ‘fraud in- used as a feature. The current fraud detection system gener- fluence’). We do not aim to give the customer agent a full ates attribute validations (see Section 2). We have included understanding of the model’s decision as even simple models the most useful of these attribute validations in our feature are usually to complex to re-enact mentally in a reasonable set as well. In total we are using 24 (partially hand-crafted) time frame. Instead, we want to give hints which features features in our value-based feature set. should be examined and allow to check for plausibility. 4.2 Using Historical Fraud Information 3.5.2 Decision Tree-Based Models This feature set makes use of the average fraud probability While it is possible to visualize decision trees, it is difficult for all values in the data set. As a first step, for every value in (almost) every column, we calculate the share of Value-Based Averages fraudulent advertisements with that value compared to all Logistic regression 0.73 0.82 advertisements with that value. Random Forest 0.80 0.87 Specifically, for each of these columns we calculate the XGBoost 0.82 0.89 conditional probability of fraud P (f raud | v) for every pos- sible value v. We then use these probabilities as features. Table 2: F3 -scores by feature set and model. Table 1 illustrates this process. (a) Actual Legitimate Fraud ID BRAND BROWSER FRAUD Predicted 1 Nike Chrome 1 Legitimate 280, 289 1, 180 2 Nike Chrome 1 Fraud 9, 162 16, 343 3 Reebok Firefox 1 4 Adidas Firefox 0 5 Nike Firefox 0 Table 3: Confusion matrix for XGBoost with the averages feature set. (b) ID BRAND BROWSER FRAUD 1 0.67 1 1 2 0.67 1 1 we decided for this approach. However, if the added load of 3 1 0.33 1 4 0 0.33 0 these aggregations tends to harm system performance, the 5 0.67 0.33 0 averages can be materialized and updated from time to time with only negligible accuracy hits. Table 1: Initial data (a) and generated features (b). 5. EVALUATION Let us assume in the learning data, the brand Nike ap- In this section we will compare and evaluate the six com- pears a total of three times. Two of these advertisements binations of models (Section 3) and features (Section 4). are fraudulent. Therefore, the fraud probability for Nike As mentioned in Section 2, the data set contains correct calculates as: classifications for 80% of the advertisements. This part of the data set is used for training the models and for cross- P (f raud | brand = N ike) = validation. For the remaining 20% of unclassified advertise- ments we predict the classification with each of the models. P (f raud ∩ brand = N ike) 0.4 The results are checked with a dedicated validation service = ≈ 0.67 P (brand = N ike) 0.6 that will judge the predicted classification. It knows the cor- Thus, all values of Nike in the brand column are replaced rect classification for every advertisement and will return the with 0.67 in the feature set. Accordingly, Google’s Chrome number of true/false positives/negatives for our prediction. as a browser has two advertisements in the table, of which both are fraudulent, yielding a 1.0. 5.1 Fβ -Score Measures In case of numerical values (e.g., price, sales rank on plat- To measure the accuracy of our classifications we calcu- form) we group them into buckets and determine the share late different Fβ -scores. Fβ -scores offer a way to weigh re- of fraudulent advertisements for the buckets instead. We do call higher than precision by choosing β > 1. As outlined in not transform columns with unique values (e.g., advertise- Section 1, we prioritize maximizing the share of fraudulent ment ids) and do not use them as features. advertisements found (i.e., recall). We chose β = 3 for a sig- The main advantage of using average fraud probabilities is nificantly higher importance of recall while not disregarding the fact that the features itself already contain direct infor- precision altogether. Additionally, we provide the F1-scores mation about the variable to be determined, namely fraud. for the interested reader for better comparability. Each value is a representation of the historical likelihood of fraud for that specific characteristic of the advertisement. 5.2 Results Another advantage of this feature set is that all features We achieved the bes F3 -score (0.892) using the averages used by the models are numerical. Therefore, we do not feature set in combination with XGBoost (cf. Table 2). An have to create more features to deal with categorical values, overview of the results is shown in Figure 3 and the confusion which would make the model both more complex as well as matrix is shown in Table 3. computationally more expensive. The recall was 0.93, meaning 93% of all fraudulent adver- One disadvantage of the feature set is that for values which tisements have been correctly classified as such. Over 64% of have not been seen before there is no average fraud value. advertisements that customer support agents deal with are One way to address this is to initialize that value with the actually fraudulent. Our project partner adapts that ratio average fraud probability for all advertisements. Alterna- based on the absolute number of advertisements that have to tively, these advertisements could be flagged as suspicious be checked and the agents’ workload. Our approach allows so that a customer agent has to decide how to continue. to flexibly change the number of advertisements that have Another disadvantage is the added calculation effort during to be checked by adjusting the cutoff value in the model. testing. Either, we have to calculate the averaged values be- The cutoff value is the threshold that all three models use fore we can test an advertisement or we have to materialize to decide whether a particular advertisement is classified as the averages. With column in-memory databases, calculat- fraudulent or not. If the predicted value is below the cutoff ing the averages on the fly is comparatively fast and thus threshold the advertisement will be classified as legitimate, Log. Regr. Random Forest XGBoost Logistic Regression Random Forest XGBoost 1.00 F1−Score F3−Score 1.00 True positive rate 0.75 0.75 Fβ−Score 0.50 0.50 0.25 0.25 0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False positive rate Cut−Off (a) True and false positive rates. (b) F1 and F3-scores. Figure 3: Comparison of true/false positives rates and F1/F3-scores using the averages feature set. whereas it will be classified as fraudulent if the value is larger world data set of a large e-commerce company. The most than the cutoff value. Therefore, a higher value means that promising approach to detect fraud was XGBoost in combi- fewer advertisements are classified as fraudulent, which in nation with the averages feature set yielding an F3 -score of turn reduces the workload of the agents. The cutoff value 0.89. We identified over 93% of all fraudulent advertisements to maximize the Fβ -score can be calculated by iteratively and incorrectly classified 3% of all non-fraudulent advertise- narrowing down its range (similarly to gradient descent). ments as fraudulent. Besides the Random Forest model, we An advantage of decision forests over logistic regression is used logistic regression to improve the understanding and in- that they are able to model dependencies between multiple terpretability of the fraud detection for the customer agents. features. As a simple example, consider the price of a tennis The proposed models can be easily adjusted to control the racket by a large tennis brand and the exact racket model. number of items that are forwarded to the customer agents. Fraudulent actors often try to lower the prices in their ad- Having that kind of control enables our partner to adjust the vertisements to lure people into buying. Which price is to agents’ workload, e.g., during peaks when customer agents be considered low strongly depends on the model. Logis- could be overloaded with too items to process. tic regression is not able to differentiate between a simple For our implementation, we used a columnar in-memory entry-level tennis racket and a professional top-class tennis database (i.e., SAP HANA) in conjunction with R. Using racket, both offered for USD 50. Decision forests, on the these technologies, we can iterate on our models fast while other hand, are able to recognize the dependency between being highly flexible at the same time. For large real-world them and can decide that this might be a low price for a setups, this setup has the advantage of not requiring to add a top-class racket, but not for an entry-level one. We believe new IT system to the landscape in order to train our models. this to be the major reason why decision forests performs The analytical capabilities of column stores allow us selec- better for both feature sets. tively train on a subset of the data without expensive ETL One of the reasons that the averages feature set performs pre-processing and to work on the most recent data, which better is that the numerical features are not necessarily allows fast reactions to changing fraudulent behaviours. monotonic in their influence on fraud. Logistic regression is not directly capable to model these ranges appropriately 7. REFERENCES without further feature engineering. In contrast, Random [1] T. Chen and C. Guestrin. Xgboost: A scalable tree Forest is (to some extend) able to recognize particularly in- boosting system. In Proc. 22nd ACM SIGKDD, pages teresting ranges of numerical features, but there is no guar- 785–794, 2016. antee that it recognizes all of these ranges for every fea- [2] CyberSource. 2013 online fraud report - online payment ture [5]. When using averages instead, the features all be- fraud trends, merchant practices, and benchmarks. come monotonic with regards to their importance on fraud: Technical Report 14th Annual Edition, 2013. a higher value always means a higher probability for fraud. [3] P. Große et al. Bridging two worlds with RICE Looking at recent data mining challenges, the superiority integrating R into the SAP in-memory computing of XGBoost over Random Forest was somewhat expectable. engine. PVLDB, 4(12):1307–1317, 2011. Especially XGBoost’s out-of-the box performance allows for fast parameter tuning. [4] H. Plattner. The impact of columnar in-memory databases on enterprise systems. PVLDB, 7(13):1722–1729, 2014. 6. CONCLUSION [5] J. R. Quinlan. Induction of decision trees. Mach. In this paper, we presented the results of various ap- Learn., 1(1):81–106, Mar. 1986. proaches we implemented in order to detect fraud on a real-