Detecting Fraudulent Advertisements on a Large
                          E-Commerce Platform

                    Tim Zimmermann1 , Timo Djürken1 , Arne Mayer1 , Michael Janke1 ,
               Martin Boissier2 , Christian Schwarz2 , Rainer Schlosser2 , Matthias Uflacker2
                                   1,2
                                         Hasso Plattner Institute, University of Potsdam, Germany
                                     {firstname.lastname}@{1 student.hpi.de, 2 hpi.de}

ABSTRACT                                                                  it is hard for humans to recognize fraud patterns and main-
E-commerce platforms face the challenge of efficiently and                tain them over time as fraudsters adapt their behavior. To
accurately detecting fraudulent activity every day. Manu-                 create such rules, highly skilled experts are needed which
ally checking every advertisement for fraud does not scale                are both rare and expensive. On the other hand, fraudsters
and is financially unviable. By using automated learning                  and companies are in constant competition to come up with
algorithms, we can drastically reduce the number of adver-                ever-improving strategies to carry out fraud and to fight it,
tisements that need to be checked by humans. In this pa-                  respectively. Therefore, complex models comprised of rules
per, we present the results of a joint project with a large e-            that humans once developed are likely to prove ineffective
commerce company selling used goods. Using our partner’s                  once fraudsters steadily evolve their strategy. Fraud detec-
advertisement data, we implemented several classification                 tion needs to be both adaptable and verifiable.
approaches to automatically recognize fraudulent activity.                   In general, there is always a trade-off between correctly
With the help of the proposed fraud detection, customer                   identifying fraudulent advertisements on the one hand and
service agents only need to check about 8% of all advertise-              incorrectly classifying actually legitimate advertisements as
ments manually for fraud. Simultaneously, we detect more                  fraudulent on the other. For every e-commerce company it
than 93% of all fraudulent advertisements.                                is expensive to have fraud on the platform. Consequentially,
                                                                          we want to identify as many of the fraudulent advertise-
                                                                          ments as possible. We can accept if we incorrectly classify
Keywords                                                                  some of the legitimate advertisements as fraudulent, because
Fraud Detection; Random Forests; XGBoost; Logistic Re-                    the automatic classification is solely used as a filter for the
gression; Boosted Trees; E-Commerce                                       customer support agent. These agents then check those ad-
                                                                          vertisements manually. Therefore, if a fraudulent advertise-
                                                                          ment is not classified as such it will not be checked by an
1.    DETECTING ONLINE FRAUD                                              agent unless a user files a complaint. On the other hand, if
   The success of e-commerce platforms strongly depends on                an advertisement is classified as fraudulent but is actually le-
the trust that customers have in it. If a platform is home to             gitimate, the customer agent can overrule the classification
fraudulent offerings, customers are less likely to interact with          of the algorithm. Another challenge of hand-crafted rules
that platform. Hence, minimizing the number of fraudulent                 for classification is to determine how likely a given offering
advertisements is crucial for every e-commerce company.                   is fraudulent based on the rule set. However, having that
   In 2012, online retailers lost an estimated revenue of USD             functionality can reduce the workload of the agents review-
3.5 billion due to fraud [2]. This figure does not even include           ing the classification as the probability of a prediction being
the expenses that retailers face to fight crime on their plat-            correct above a certain threshold might already be sufficient.
forms. Technology already helps to reduce the number of                   In that case, the advertisement’s classification could be de-
people required by automatically suggesting a classification              termined without any human looking at it.
for a given offering. However, many companies still have to                  In this paper, we compare three approaches to classify
employ a large number of dedicated staff to make the final                fraud: logistic regression and decision tree-based models in
decision whether an offering is fraudulent or not.                        the form of Random Forests and XGBoost. We compare
   Fraudulent activity cannot always be easily identified. It             their effectiveness to recognize fraud with two different fea-
is crucial to take the dependencies between various charac-               ture sets: One feature set is based on the characteristics and
teristics of the offering into account. While these dependen-             values of the offerings (see Section 4). The other one is based
cies can be simple in some cases, they are often complex                  on the average fraud probability for each of these values.
and not at all obvious to humans. Approaches using hand-
crafted rule sets pose several problems. On the one hand,                 1.1    Columnar In-Memory Databases
                                                                            For our implementation, we decided for an architecture
                                                                          using the statistical language R and a columnar in-memory
                                                                          database (in our case SAP HANA). The decision for R was
2017, Copyright is with the authors. Published in the Workshop Proc. of   rather straightforward as R is open source and the de-facto
the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on    standard for fast prototyping of machine learning algorithms.
CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted       The decision for a columnar in-memory database was made
under the terms of the Creative Commons license CC-by-nc-nd 4.0.          for several reasons. Databases with analytical capabilities
allow to us to select a new data set for training on the fly                                                3.    STRATEGIES FOR MODELING FRAUD
without long running MapReduce or ETL (extract, trans-                                                         In order to successfully reduce the number of advertise-
form, and load) jobs. At the same time, we can exploit the                                                  ments that have to be checked manually, there are two major
fact that we always have access to the most recent data to                                                  aspects we need to consider. First, the algorithm should pro-
train our models on [4]. This is particularly interesting for                                               duce as few false negatives as possible. That is, it should
identifying the steadily changing behaviour by fraudsters.                                                  not misclassify as legitimate advertisements that are in fact
Whenever a major shift in fraudulent behaviour is observed,                                                 fraudulent. At the same time, the pool of possibly fraudu-
models can be retrained within minutes.                                                                     lent advertisements should be as small as possible. Second,
   During this project, we have used the columnar database                                                  the algorithm needs to keep up with the latest strategies of
to select data on the fly and used a stand-alone R installa-                                                fraudsters. Fraudsters react to defensive mechanisms em-
tion for iterative work on our models. In case our proposed                                                 ployed by companies and improve the techniques they use
approach shall be put into production, there are two ways                                                   to fool and exploit both users and companies.
– both offering high performance – to proceed. First, R
and HANA can be linked using the system’s shared memory                                                     3.1    Dynamic Adaption
as described in [3]. This way, data can be directly shared                                                     Traditional approaches are based on domain experts. They
between the two processes without any serialization or net-                                                 use their knowledge and experience to create a set of rules.
work transmission, while still having all capabilities of R at                                              A rule consists of a number of checks that an advertisement
hand. Second, SAP HANA provides the so-called predictive                                                    is tested against. The result is a classification as fraudu-
analysis library (PAL) offering a large variety of machine                                                  lent or legitimate, depending on whether the rule matched
learning algorithms that are directly executed on the actual                                                or not. Results of multiple rules are aggregated to make a
data inside of the database engine.                                                                         final decision on the classification of the advertisement.
                                                                                                               This approach has two major drawbacks. First, it takes
2.       PRELIMINARIES                                                                                      a lot of time to develop a sophisticated set of rules that
   For our research, we worked with data from a production                                                  are tuned to the individual use case. Second, these rules
system of our project partner. It includes a subset of adver-                                               constantly need to be updated by these experts, which in
tisements along with some meta information (see Figure 1).                                                  turn again takes time. Besides having to pay the experts,
                                                                                                            this time span can already cause very significant damage
         << table >>                           << table >>                           << table >>            and result in big losses to the company.
        COMPLAINTS                            AD_HISTORY                           TRAIN_HISTORY
 advertisement_id                     generated_ad_history_id                 generated_ad_history_id
                                                                                                               In contrast, machine learning offers techniques with which
 contact_time                         advertisement_id                        status_ok                     the criteria to classify advertisements can be generated semi-
 content                              customer_id                             status_quality_violation
 …                                    model                                   status_quality_resubmission   or fully automatically, while at the same time adapting to
                                      …                                       status_unchecked
                                                                              status_hidden_dealer          the ever-changing deceptions of fraudsters. We compare
                                                                              status_uncertain
                                                                              status_fraud                  three different techniques to classify advertisements: logistic
                                                                              …
                                                                                                            regression, Random Forest, and XGBoost. The process of
                                                                                                            engineering the criteria is described in Section 4.
                                               << table >>
                                            ADVERTISEMENT
                                      identification
                                                                                                            3.2    Logistic Regression
                                      advertisement_id
                                      customer_id                                                              Logistic regression is a form of supervised learning. A
                                      …
                                                                                                            function is fitted to a labeled data set. In our case the ad-
                                                                                                            vertisements are labeled with fraud (1) or no fraud (0). The
                                                                                                            function is then used to predict whether new advertisements
                                                                                                            are fraudulent or not. The result is always a value between
                              << table >>                          << table >>                              0 and 1 - the higher it is, the more likely that advertise-
                            TRAINING SET                    ATTRIBUTE_VALIDATIONS
                    identification                          identification
                                                                                                            ment is fraudulent. Due to the nature of regression it is not
                    overall_fraud
                    cs_fraud
                                                            check_attribute
                                                            check_result
                                                                                                            able to work with categorical (non-numerical) features. One
                    …                                       …                                               work-around is to transform categorical into binary features,
                                                                                                            so-called one-hot encoding. Assume the brand attribute only
 Figure 1: Data schema of our project partner’s data set.                                                   had three distinct values: Nike, Reebok and Adidas. Instead
                                                                                                            of having a single feature brand we would have three fea-
  The data set contains a total of ∼1.4 million advertise-                                                  tures, specifically is_nike, is_reebok and is_adidas. The
ments. For some of these advertisements there is more than                                                  value of each of these features is either 0 or 1 and it can
one version, which means that the advertisement has changed                                                 thus be used by logistic regression. The main disadvantage
over the observation period. We treat these versions as sep-                                                of this approach is that it adds a lot of features to the model
arate advertisements and can therefore work with about 1.9                                                  and therefore increases the computation time required to
million advertisements. For around 1.5 million (≈ 80%) of                                                   generate the model. At the same time, many classification
those advertisements we know the correct classification, i.e.,                                              algorithms are limited in the number of features they can
whether a given advertisement is fraudulent. Apart from the                                                 efficiently handle.
current states of advertisements, the data also includes so-
called attribute validations as well as historical information.                                             3.3    Random Forest
Attribute validations are checks that are performed when-                                                      Random Forests are a collection of decision trees that are
ever an advertisement is created or changed, e.g., comparing                                                trained by repeatedly taking random samples from the train-
the price to the average price for this offered item. The his-                                              ing set. As it also relies on labeled data it is a supervised
tory table contains different versions of the advertisements.                                               learning technique as well. In contrast to logistic regres-
sion, Random Forests can natively handle categorical fea-                             User
                                                                                Information
tures with an arbitrary number of distinct values. However,
                                                                                   Browser
the R-implementation we used is only able to handle fea-
                                                                                   Average
tures with a maximum of 53 distinct values. While this is                         Ad Count
enough for some of the features we used, it does not solve                 Operating System
the problem in general. The brand attribute, for example,                  Ad Creation Time
has more than 53 different brands in it. Often times it is not
                                                                                Letter Code
necessary to know the exact brand - it might be sufficient to
know the segment, e.g., sports shoes or running shoes. The                   IP Letter Code

segment can be used as a “reduced” feature instead.                              Operating
                                                                             System Cluster
   We built a forest of 4,000 trees. This forest was assembled                    Browser
                                                                                    Cluster
by 20 parallel workers building forests of 200 trees each,
                                                                                              −1.0   −0.5      0.0      0.5     1.0
which were combined to a single model in the end.                                                           Influence

3.4     XGBoost                                                       Figure 2: Visualization of the relevant features and their in-
  XGBoost – for extreme gradient-boosted trees – is a re-             fluence on fraud probability for an exemplary advertisement.
cent algorithm that has attracted many researchers over the
past year [1]. Similar to Random Forests, XGBoost cre-
ates (boosted) decision trees and belongs to the group of             to condense the process into a few simple but meaningful
supervised learning techniques. One of the key advantages             figures. On the one hand, providing a figure for each tree of
of XGBoost – besides the pure learning results – is its per-          a larger forest is unpractical for the number of trees in our
formance and scalability, allowing for very fast training and         models. On the other hand, using a random sample of trees
thus faster parameter tuning.                                         can be non-representative and might confuse the user.
                                                                         Hence, we decided to solely show the relevance results of
3.5     Visualizing the Decision Process                              the logistic regression to the customer agents.
  One of our main objectives was an easy-to-understand and
digestible visualization of the automated fraud detection.
Understanding why a – from a customer agent perspective               4.      FEATURE ENGINEERING
– black-box algorithm classified a particular advertisement              To detect fraud, we worked with two very different fea-
as legitimate or fraudulent is essential to build up trust be-        ture sets. The first feature set relies on the attributes and
tween the system and the customer agent. Additionally, it             features provided by our project partner as well as addi-
helps the users to quickly decide which characteristics of the        tional features we manually created. We call this feature
advertisement to look at.                                             set value-based feature set. The second feature set is based
                                                                      on the average fraud probability for a given value and we
3.5.1    Logistic Regression                                          therefore call it average feature set.
   In the training stage, logistic regression builds a model
which consists of an intercept value and feature coefficients         4.1       Advertisement Characteristics
that are used to predict the fraud probability. We built two             The value-based feature set consists of multiple features
visualizations in order to help the customer agent better             with varying complexity. This includes basic attributes like
understand the learned model. The first model is supposed             the geographical origin of the user’s IP address but also more
to give the customer agent an overview of the features that           complex combinations such as the IP Usage Counter. A
the model uses to predict. Hereby, we simply show the high-           fraudster might have numerous advertisements online at the
est/lowest coefficients of the model (please note that normal-        same time to increase the chances of success. For this case
ization is required for this step). This very simple approach         we calculate the number of advertisements that were created
helps to make the results interpretable as the customer agent         with the same IP address and try to find IP addresses with
gains trust in the model as it (hopefully) detects fraud char-        a large number of advertisements.
acteristics that conform with the agent’s experience.                    IP addresses can also be analyzed with regards to their
   The second model is rather simple and optimized to help            usage and connections to fraud in the past. However, it is
the agent decide whether a particular advertisement is legit-         not easy to identify users based on the IP since for most cus-
imate or fraudulent (for an example see Figure 2). For all            tomers the IP changes every day. Nevertheless, the IP can in
historical legitimate and fraudulent advertisements, we cal-          many cases be connected to an internet service provider sim-
culate average explanatory variables that are put into the            ply by using its sub ranges. Users typically only log in with
regression model. Now, for the advertisement that is cur-             IP addresses of a small number of different providers, which
rently examined, we compare against the advertisement’s               in turn means that the number of different providers can be
explanatory variables and show the ratio (i.e., the ‘fraud in-        used as a feature. The current fraud detection system gener-
fluence’). We do not aim to give the customer agent a full            ates attribute validations (see Section 2). We have included
understanding of the model’s decision as even simple models           the most useful of these attribute validations in our feature
are usually to complex to re-enact mentally in a reasonable           set as well. In total we are using 24 (partially hand-crafted)
time frame. Instead, we want to give hints which features             features in our value-based feature set.
should be examined and allow to check for plausibility.
                                                                      4.2       Using Historical Fraud Information
3.5.2    Decision Tree-Based Models                                     This feature set makes use of the average fraud probability
  While it is possible to visualize decision trees, it is difficult   for all values in the data set. As a first step, for every
value in (almost) every column, we calculate the share of                                       Value-Based       Averages
fraudulent advertisements with that value compared to all                 Logistic regression       0.73            0.82
advertisements with that value.                                            Random Forest            0.80           0.87
   Specifically, for each of these columns we calculate the                   XGBoost               0.82           0.89
conditional probability of fraud P (f raud | v) for every pos-
sible value v. We then use these probabilities as features.               Table 2: F3 -scores by feature set and model.
Table 1 illustrates this process.
                              (a)                                                       Actual
                                                                                                    Legitimate      Fraud
             ID   BRAND      BROWSER       FRAUD                          Predicted
              1    Nike       Chrome         1                                 Legitimate              280, 289     1, 180
              2    Nike       Chrome         1                                    Fraud                  9, 162    16, 343
              3   Reebok      Firefox        1
              4   Adidas      Firefox        0
              5    Nike       Firefox        0                     Table 3: Confusion matrix for XGBoost with the averages
                                                                   feature set.
                              (b)
             ID   BRAND      BROWSER       FRAUD
              1    0.67          1           1
              2    0.67          1           1                     we decided for this approach. However, if the added load of
              3      1         0.33          1
              4      0         0.33          0
                                                                   these aggregations tends to harm system performance, the
              5    0.67        0.33          0                     averages can be materialized and updated from time to time
                                                                   with only negligible accuracy hits.

    Table 1: Initial data (a) and generated features (b).
                                                                   5.    EVALUATION
  Let us assume in the learning data, the brand Nike ap-              In this section we will compare and evaluate the six com-
pears a total of three times. Two of these advertisements          binations of models (Section 3) and features (Section 4).
are fraudulent. Therefore, the fraud probability for Nike          As mentioned in Section 2, the data set contains correct
calculates as:                                                     classifications for 80% of the advertisements. This part of
                                                                   the data set is used for training the models and for cross-
          P (f raud | brand = N ike) =                             validation. For the remaining 20% of unclassified advertise-
                                                                   ments we predict the classification with each of the models.
          P (f raud ∩ brand = N ike)   0.4                         The results are checked with a dedicated validation service
                                     =     ≈ 0.67
               P (brand = N ike)       0.6                         that will judge the predicted classification. It knows the cor-
   Thus, all values of Nike in the brand column are replaced       rect classification for every advertisement and will return the
with 0.67 in the feature set. Accordingly, Google’s Chrome         number of true/false positives/negatives for our prediction.
as a browser has two advertisements in the table, of which
both are fraudulent, yielding a 1.0.                               5.1     Fβ -Score Measures
   In case of numerical values (e.g., price, sales rank on plat-      To measure the accuracy of our classifications we calcu-
form) we group them into buckets and determine the share           late different Fβ -scores. Fβ -scores offer a way to weigh re-
of fraudulent advertisements for the buckets instead. We do        call higher than precision by choosing β > 1. As outlined in
not transform columns with unique values (e.g., advertise-         Section 1, we prioritize maximizing the share of fraudulent
ment ids) and do not use them as features.                         advertisements found (i.e., recall). We chose β = 3 for a sig-
   The main advantage of using average fraud probabilities is      nificantly higher importance of recall while not disregarding
the fact that the features itself already contain direct infor-    precision altogether. Additionally, we provide the F1-scores
mation about the variable to be determined, namely fraud.          for the interested reader for better comparability.
Each value is a representation of the historical likelihood of
fraud for that specific characteristic of the advertisement.       5.2    Results
Another advantage of this feature set is that all features            We achieved the bes F3 -score (0.892) using the averages
used by the models are numerical. Therefore, we do not             feature set in combination with XGBoost (cf. Table 2). An
have to create more features to deal with categorical values,      overview of the results is shown in Figure 3 and the confusion
which would make the model both more complex as well as            matrix is shown in Table 3.
computationally more expensive.                                       The recall was 0.93, meaning 93% of all fraudulent adver-
   One disadvantage of the feature set is that for values which    tisements have been correctly classified as such. Over 64% of
have not been seen before there is no average fraud value.         advertisements that customer support agents deal with are
One way to address this is to initialize that value with the       actually fraudulent. Our project partner adapts that ratio
average fraud probability for all advertisements. Alterna-         based on the absolute number of advertisements that have to
tively, these advertisements could be flagged as suspicious        be checked and the agents’ workload. Our approach allows
so that a customer agent has to decide how to continue.            to flexibly change the number of advertisements that have
Another disadvantage is the added calculation effort during        to be checked by adjusting the cutoff value in the model.
testing. Either, we have to calculate the averaged values be-         The cutoff value is the threshold that all three models use
fore we can test an advertisement or we have to materialize        to decide whether a particular advertisement is classified as
the averages. With column in-memory databases, calculat-           fraudulent or not. If the predicted value is below the cutoff
ing the averages on the fly is comparatively fast and thus         threshold the advertisement will be classified as legitimate,
                                Log. Regr.        Random Forest      XGBoost                                            Logistic Regression       Random Forest      XGBoost


                      1.00                                                                                     F1−Score                                               F3−Score
                                                                                          1.00
 True positive rate


                      0.75
                                                                                          0.75


                                                                               Fβ−Score
                      0.50
                                                                                          0.50

                      0.25
                                                                                          0.25

                      0.00                                                                0.00
                             0.00      0.25         0.50      0.75     1.00                      0.00   0.25     0.50          0.75           1.00 0.00       0.25      0.50     0.75   1.00
                                             False positive rate                                                                              Cut−Off

                        (a) True and false positive rates.                                                                   (b) F1 and F3-scores.

                                    Figure 3: Comparison of true/false positives rates and F1/F3-scores using the averages feature set.


whereas it will be classified as fraudulent if the value is larger                                             world data set of a large e-commerce company. The most
than the cutoff value. Therefore, a higher value means that                                                    promising approach to detect fraud was XGBoost in combi-
fewer advertisements are classified as fraudulent, which in                                                    nation with the averages feature set yielding an F3 -score of
turn reduces the workload of the agents. The cutoff value                                                      0.89. We identified over 93% of all fraudulent advertisements
to maximize the Fβ -score can be calculated by iteratively                                                     and incorrectly classified 3% of all non-fraudulent advertise-
narrowing down its range (similarly to gradient descent).                                                      ments as fraudulent. Besides the Random Forest model, we
   An advantage of decision forests over logistic regression is                                                used logistic regression to improve the understanding and in-
that they are able to model dependencies between multiple                                                      terpretability of the fraud detection for the customer agents.
features. As a simple example, consider the price of a tennis                                                     The proposed models can be easily adjusted to control the
racket by a large tennis brand and the exact racket model.                                                     number of items that are forwarded to the customer agents.
Fraudulent actors often try to lower the prices in their ad-                                                   Having that kind of control enables our partner to adjust the
vertisements to lure people into buying. Which price is to                                                     agents’ workload, e.g., during peaks when customer agents
be considered low strongly depends on the model. Logis-                                                        could be overloaded with too items to process.
tic regression is not able to differentiate between a simple                                                      For our implementation, we used a columnar in-memory
entry-level tennis racket and a professional top-class tennis                                                  database (i.e., SAP HANA) in conjunction with R. Using
racket, both offered for USD 50. Decision forests, on the                                                      these technologies, we can iterate on our models fast while
other hand, are able to recognize the dependency between                                                       being highly flexible at the same time. For large real-world
them and can decide that this might be a low price for a                                                       setups, this setup has the advantage of not requiring to add a
top-class racket, but not for an entry-level one. We believe                                                   new IT system to the landscape in order to train our models.
this to be the major reason why decision forests performs                                                      The analytical capabilities of column stores allow us selec-
better for both feature sets.                                                                                  tively train on a subset of the data without expensive ETL
   One of the reasons that the averages feature set performs                                                   pre-processing and to work on the most recent data, which
better is that the numerical features are not necessarily                                                      allows fast reactions to changing fraudulent behaviours.
monotonic in their influence on fraud. Logistic regression
is not directly capable to model these ranges appropriately                                                    7.        REFERENCES
without further feature engineering. In contrast, Random                                                       [1] T. Chen and C. Guestrin. Xgboost: A scalable tree
Forest is (to some extend) able to recognize particularly in-                                                      boosting system. In Proc. 22nd ACM SIGKDD, pages
teresting ranges of numerical features, but there is no guar-                                                      785–794, 2016.
antee that it recognizes all of these ranges for every fea-
                                                                                                               [2] CyberSource. 2013 online fraud report - online payment
ture [5]. When using averages instead, the features all be-
                                                                                                                   fraud trends, merchant practices, and benchmarks.
come monotonic with regards to their importance on fraud:
                                                                                                                   Technical Report 14th Annual Edition, 2013.
a higher value always means a higher probability for fraud.
                                                                                                               [3] P. Große et al. Bridging two worlds with RICE
   Looking at recent data mining challenges, the superiority
                                                                                                                   integrating R into the SAP in-memory computing
of XGBoost over Random Forest was somewhat expectable.
                                                                                                                   engine. PVLDB, 4(12):1307–1317, 2011.
Especially XGBoost’s out-of-the box performance allows for
fast parameter tuning.                                                                                         [4] H. Plattner. The impact of columnar in-memory
                                                                                                                   databases on enterprise systems. PVLDB,
                                                                                                                   7(13):1722–1729, 2014.
6.                     CONCLUSION
                                                                                                               [5] J. R. Quinlan. Induction of decision trees. Mach.
  In this paper, we presented the results of various ap-                                                           Learn., 1(1):81–106, Mar. 1986.
proaches we implemented in order to detect fraud on a real-