=Paper= {{Paper |id=Vol-3052/paper15 |storemode=property |title=Machine Learning Methods for Detecting Fraud in Online Marketplaces |pdfUrl=https://ceur-ws.org/Vol-3052/paper15.pdf |volume=Vol-3052 |authors=Raoul Dekou,,Sabljic Savo,,Simon Kufeld,,Diana Francesca,,Ricardo Kawase |dblpUrl=https://dblp.org/rec/conf/cikm/DekouSKFK21 }} ==Machine Learning Methods for Detecting Fraud in Online Marketplaces== https://ceur-ws.org/Vol-3052/paper15.pdf
       Machine Learning Methods for Detecting Fraud in
                    Online Marketplaces.

    Raoul Dekou1, * , Sabljic Savo2 , Simon Kufeld3 , Diana Francesca2 , and Ricardo Kawase1
               1
                   Mobile.de, MarktPlatz 1 Europarc Dreilinden, 14532, Berlin, Germany
                         2
                           Codecentric AG, Hochstraße 11, 42697, Solingen, Germany
                   3
                     Inovex GmgH, Ludwig-Erhard-Allee 6, 76131, Karlsruhe, Germany
                       *
                         Corresponding author: Raoul Dekou, rdekou@team.mobile.de




                                                                   1    Introduction
                       Abstract                                    As reported in [12], retail e-commerce sales worldwide
                                                                   accounted for 1.86 trillion USD in 2016 and are ex-
    Connecting buyers and sellers in a safe and                    pected to rise to 4.48 trillion USD in 2021. In the
    secure environment is one of the biggest chal-                 meantime, a recent report on fraud attacks trends in
    lenges in online marketplaces. Probabilistic                   the first quarter of 20211 confirmed the shift of at-
    models built upon user-item databases ad-                      tacks towards retail websites and estimated that 25%
    dress the challenge, but often encounter issues                of this traffic is malicious. Such increase in activity has
    such as lack of stability and robustness. These                brought enough pressure to marketplaces which need
    issues are magnified in fraud scenarios where                  to ensure reliability and security of their services while
    datasets are highly imbalanced, noisy and ma-                  inspiring trust towards buyers.
    licious users deliberately adapt their behav-                     Unfortunately, the success of online marketplaces
    iors to avoid detection. In this context, we                   attracts unwanted attention from malicious users who
    leveraged the power of existing open sources                   try to abuse the platforms for personal monetary gain.
    machine learning libraries H2O and Catboost                    mobile.de does not control transactions between buyer
    and designed a pipeline to collect, process and                and sellers. It is a “matchmaking” platform that
    predict the likelihood of a private seller’s list-             bridges the gap between the two sets of entities. Once
    ing data to be fraudulent. We found that the                   the user with malicious intent creates an account,
    stacked ensemble model provides the best per-                  he/she also creates an attractive vehicle listing (the
    formance (F1=0.73) when compared to other                      goal is to get as many leads as possible). To achieve
    commonly used models in the field. Further,                    this, fraudsters take a series of lead-boosting steps.
    our models are benchmarked on a public Kag-                    They upload listings of high-demand vehicles into the
    gle Dataset, TalkingData AdTracking Fraud                      platform and set very low yet reasonable prices for the
    Detection Challenge where we compared them                     vehicles. Since every aspect of the listing looks legiti-
    to other studies and highlighted their gener-                  mate (the website, the seller and the vehicle), buyers
    alizability and effectiveness at handling online               lower their guard and contact the fraudster. Through
    fraud.                                                         a series of interactions, the fraudster is able to con-
                                                                   vince the buyer (now a victim) to send a pre-payment
Copyright © by the paper’s authors. Use permitted under Cre-       money transfer, usually as a “reservation” fee. Once
ative Commons License Attribution 4.0 International (CC BY
4.0).                                                              this happens, and the damage is done, the victims re-
In: RWTH Aachen, CEUR-WS: Proceedings of The 2021
                                                                   alize their mistake, they contact mobile.de‘s Customer
International Workshop on Privacy, Security, and Trust             Service and report the case. There are very few cases
in Computational Intelligence, Gold coast, Queensland,
Australia, 01-11-2021, published at http://https://xuyun-              1 https://securityboulevard.com/2021/07/top-industry-specific-
zhang.github.io/pstci2021/                                         fraud-attack-trends-from-q1-2021/ (accessed on July 2021).




                                                               1
that reach this point, however, the total monthly loss             customers being flagged as fraudulent. Finally, there
can soar to thousands of Euros.                                    is also the need for dynamic solutions given that fraud-
    Satisfied customers (buyers and sellers) are the               sters adapt their behaviors to a point where they are
foundation for a valuable and successful marketplace.              able to bypass the detection from machine learning
Thus, providing a secure enviroment and a safe expe-               models.
rience to our customers is a top priority at mobile.de,                Literature suggests various examples of application
and the motivation of this work which aims at prevent-             of machine learning methods which aim at detect-
ing and detecting fraudulent activity. To achieve our              ing fraud. Najem and Kadeem [16] recent survey on
goals, we tackled the fraud detection problem by lever-            fraud detection techniques in e-commerce, provides a
aging user generated data and building machine learn-              broad view on the performance of the several models
ing models which are able to identify fraudulent activi-           on various datasets. It highlights that Random For-
ties. It is also essential to design robust models, of high        est (RF) is the most used and usually the most accu-
precision which can also generalise well. This paper               rate of all methods. Though Naive Bayes algorithms
describes our approach to mitigate the case of fraud-              are easy to implement, they are limited compared to
ulent activity by fraudsters posing as private sellers.            decision trees when it comes to modelling non linear
Our contribution is twofold. First, we describe a pro-             problems. Such information were taken into consider-
duction pipeline to collect, process and score sellers’            ation when selecting candidate models for our pipeline
listings using open source machine learning libraries              which consists essentially of decision trees ensembles
Catboost2 and H2O3 . We briefly highlight how to ef-               (RF, Xgboost and Catboost). For instance, Kanei
ficiently use these libraries to pre-select relevant can-          et al. [10] trained a Random Forest model for detecting
didate models and tune their hyper-parameters. Sec-                fraudulent ad requests. In their study, they demon-
ond, we demonstrate that our approach could poten-                 strated that the model robustness challenge could be
tially inspire other used cases by verifying our detec-            addressed by means of features which could not be
tion methods on a sample of a large dataset publicly               controlled by fraudsters such as the network statis-
available at Kaggle.com4 .                                         tics from clients and publishers. This set-up allowed
    The remainder of this paper is structured as fol-              them to improve their recall rate by 10%. Renjith
lows. In Section 2, we discuss existing work in the                [20] described a pipeline using Support Vector Machine
field. In Section 3, we provide deeper understanding               (SVM) to detect fraudulent sellers in an online mar-
of the problem and formalize it. In Sections 4 and 5,              ketplace. The authors specifically pointed out that a
we describe our methodology to tackle the problem.                 cold start problem may arise for new users when us-
Section 6 contains our results, followed by the conclu-            ing predictive models with seller or transaction infor-
sion and prospects                                                 mation as features. In our approach, the cold start
                                                                   effect was mitigated by removing these types of fea-
                                                                   tures. Gupta et al. [8] benchmarked ensemble mod-
2    Related Work
                                                                   els for predicting the likelihood of a click on mobile
Techniques used to detect fraud can be divided into                phone advertisement to be fraudulent on a publicly
two groups: expertise based and data driven. In the                available Kaggle dataset. They tested two configura-
first technique, experts use their knowledge to build              tions: traditional and Big Data. In the traditional
a set of rules that are tested and refined to filter               configuration, they combined different sampling tech-
out fraudulent activities. However, contrary to ma-                niques (SMOTE, stratified sampling, etc) to reduce
chine learning solutions traditional expert techniques             the data size and handle the imbalanced training set.
sometimes lack the ability to model non trivial on-                This dataset which has been widely used in previous
line connections [24]. The second set of techniques,               studies [8, 14, 22], is employed in our study and results
data driven, i.e. Machine learning solutions, over-                from Gupta et al. [8] are used as our baseline. In our
come this issue but yield different challenges. While              work, we applied the same preprocessing techniques
the increase of activity in marketplaces generates mas-            and compared our results to their best model, Two
sive datasets which require model scalability, the low             Class Decision Forest5 with an F1 score of 0.944. Using
occurrence of fraudulent events produces imbalanced                a sample of the same dataset, Minastireanu and Mes-
datasets. Maintaining both a high precision and recall             nita [14], trained a Lightgbm model to detect fraudu-
is often a challenge and many models provide signifi-              lent clicks and reported an accuracy of 98%. The au-
cant misclassification errors [2] which result in genuine          thors specifically described an example of how feature
                                                                   engineering on original features set (click time, device,
    2 https://www.catboost.ai/ (accessed on July 2021).
    3 https://www.h2o.ai/ (accessed on 16 July 2021).                  5 https://docs.microsoft.com/en-us/azure/machine-
    4 https://www.kaggle.com/c/talkingdata-adtracking-fraud-       learning/algorithm-module-reference/two-class-decision-forest
detection/data (downloaded on 16 July 2021).                       (accessed on July 2021)




                                                               2
channel, etc) and K fold cross validation are combined                model reduces the classification speed [2] which might
to enable high performance. Besides, by testing their                 be an issue on big datasets.
model on a large data sample (18 millions users clicks),
they proved the robustness of the boosting machine
for the case study. In the same context, Mohammed                     3   Problem statement
et al. [15] investigated the scalability of Random For-
est, Balanced Bagging Ensemble and Gaussian Naive                     mobile.de supports two different types of sellers,
Bayes on massive and highly imbalanced credit card                    namely dealers and private sellers. Dealers are those
fraud datasets. They found that random undersam-                      registered dealerships in Germany and neighbour-
pling is effective at handling imbalanced datasets, and               ing countries who are paying customers of mobile.de.
combined with RF, it is suitable for real time appli-                 These are professional sellers who make a living out of
cations on large datasets. In their study, the Random                 buying and selling vehicles. Private sellers are the reg-
Forest model provided the highest recall of 91%. Ra-                  ular common citizens who own a vehicle and use a clas-
jora et al. [19] benchmarked the performance of various               sified market to sell it (not registered as a business).
machine learning algorithms on a credit card transac-                 Internally, at mobile.de a private seller is labelled and
tion dataset with 31 attributes. They used random                     named as FSBO (For Sale By Owner), and for the rest
undersampling technique to address the data imbal-                    of this paper, we will address a private seller with the
ance and Principal Component Analysis (PCA) [1] as                    same terminology. Although there are several mali-
dimensionality reduction technique. On top of PCA                     cious activities which can be classified as fraud such
features, a time feature corresponding to the time de-                as: account take over, falsification of documents, etc.,
lay from the first transaction is part of the training                our objective in this study is focused on a single type
set. Furthermore, the authors illustrated how the in-                 of users (FSBOs) that create fraudulent (fake) listings.
clusion of this feature can impact the performance. RF                Our pipeline overview is depicted in Figure 1. When
provided a better performance without the time fea-                   a listing is created (or updated) our machine learning
ture while Gradient Boosting Regression Tree perfor-                  models generate a fraud probability prediction and, in
mance was constant. Meng et al. [13] also used a real                 case the result is above a certain threshold, the list-
world credit card transactions dataset and combined                   ing is manually evaluated by a Customer Service (CS)
Xgboost and sampling techniques to achieve great per-                 agent, who reviews the content of the listing and as-
formance. SMOTE technique allowed an increase of                      signs a rating (ground truth). In addition to listings
the recall from 0.8062 to 0.9 and the AUC from 0.9795                 flagged by our ML models, Customer Service agents
to 0.9853. Mohammed et al. [15] reported that Neural                  extend their reviewing process to listings which might
Networks tend to overfit on fraud datasets and struggle               have received users’ complaints. Eventually, one way
to handle imbalanced datasets. Nevertheless, as illus-                or another, every fraudulent listing is flagged in our
trated by Adewumi and Akinyelu [2] in their survey,                   dataset, the vast majority happening before damage is
such techniques are also commonly used for credit card                done, and in very few cases, reports come from scam
fraud detection. Najem and Kadeem [16] pointed out                    victims. The main classification task is binary in the
that hybrid methods which combine several methods                     sense that the target variable to predict has two pos-
to build a robust learner provide better performance                  sible outcomes OK or FRAUD. The goal is to detect
than individual learners. For example, Wang et al.                    when a vehicle listing is (or becomes) fraudulent. It
[23] built an hybrid mixed model consisting of Xg-                    can happen at the insertion time (version 1 of the list-
boost and Logistic regression (LR) and benchmarked it                 ing) or at any time later due to a modification in the
against common baseline models such as Xgboost, RF,                   data.
SVM, Naive Bayes and Logistic Regression on the Ger-
man Credit dataset published by UCI6 . In the hybrid
model, an effective feature combination was obtained
by using Xgboost leaf nodes as features for the LR
model. This set up, provided an AUC of 0.8321 which
is far beyond the value of 0.7321 obtained with LR,
the best individual model. Other studies such as [18]
and [21] use meta learning techniques to enhance the
performance on credit card fraud dataset. However,
combining the output of different classifiers to build a
                                                                      Figure 1: mobile.de in house data collection and
   6 https://archive.ics.uci.edu/ml/index.php (accessed on July       pipeline overview.
2021).




                                                                  3
4    Datasets
                                                                    Table 1: In-house dataset preprocessing steps.
In this study, we used two different datasets to train            non overlapping - test (latest week)
and test our machine learning models, mobile.de in-               time based split   - train (28 days)
house dataset and a tailored sample of TalkingData
                                                                                       random undersampling of
AdTracking Fraud Detection Challenge dataset ob-
                                                                  undersampling        the training set, 10% nega-
tained from the machine learning competition plat-
                                                                                       tive cases kept
form Kaggle.
                                                                                       kept and processed by ma-
   At mobile.de FRAUD cases are less frequent (posi-              missing values
                                                                                       chine learning models
tive cases) than the OK cases leading to a highly imbal-
                                                                  feature engineer-
ance dataset. The in-house dataset consists of 27 cate-                                yes (confidential)
                                                                  ing
gorical variables and 10 continuous ones. To maintain
the confidentiality of our data points, and to eliminate
the risk of giving any clues that could lead to learnings         Table 2: TalkingData dataset preprocessing steps.
on how to bypass our fraud detection models, we re-                                 15% random sample of
frain from disclosing the exact names of the attributes                             unique IPs then 8% strat-
and features.                                                     subsampling
                                                                                    ified sample from the
   The public dataset is taken from the China’s largest                             remaining set
independent big data service platform which covers                                  SMOTE with k=5 neigh-
70% of active mobile devices in the country, handles              oversampling      bours, positive class up to
3 billion clicks per day out of which 90% are poten-                                11%
tially fraudulent. Contrary to mobile.de case, here
                                                                  missing values    absent
click fraud is the most frequent class (negative class)
                                                                  stratified split  - test (30%)
and occurs when a person or an automated bot act-
                                                                                    - train (70 %)
ing as legitimate user clicks on an app ad without
                                                                  feature engineer- - click hour&day of the week
downloading the app afterwards. The raw dataset
                                                                  ing               - attributed time is removed
contains 200 millions clicks over a 4 day period. It
includes 7 data fields (IP, app, device, OS, channel,
click time, attributed time) and a binary target to pre-         sample of 8% of the remaining set. To handle the
dict (is attributed). The target variable is imbalanced          imbalance, we applied Synthetic Minority Over Sam-
with 99.8% of negative cases.                                    pling Technique (SMOTE) [5] with 5 neighbours and
                                                                 oversampled the positive class up to 11%. We then
   Tables 1 and 2 summarize the preprocessing steps
                                                                 applied a stratified split, keeping 70% of the set for
applied on mobile.de and TalkingData datasets respec-
                                                                 training. The final set has 1,706,481 training samples
tively. For our in-house dataset, the testing set corre-
                                                                 and 731,349 testing ones without any missing values.
sponds to samples recorded 7 days prior to the day
the model was trained. The training set corresponds
to 28 days of data prior to the start date of the testing        5   Training Machine Learning Models
set. The timely split was done to prevent the model
                                                                 In this section, we briefly summarize the theoretical
from learning from future observations. In order to
                                                                 concepts behind the models used in our study, pro-
reduce the imbalance and increase the performance,
                                                                 vide an overview of the machine learning libraries in
we applied a random undersampling and kept 10 % of
                                                                 which the models were implemented and finally de-
the majority class in the training set. This resulted
                                                                 scribe the hyper-parameter tuning steps and our per-
in around 200,000 training samples and 240,000 test-
                                                                 formance metrics.
ing ones. We kept raw missing entries within the sets,
H2O and Catboost models handled them as separate                    As stated in [4], Random Forest is an ensemble ma-
categories7,8 .                                                  chine learning algorithm consisting of a collection of
   For the Kaggle dataset, we borrowed the prepro-               decision trees each built from random samples. In each
cessing steps from [8] and we engineered two additional          tree, thresholds are applied to the input features to
features: click hour of the day and day of the week.             maximize information gain while minimizing an impu-
First, we reduced the data size by randomly sampling             rity function (for e.g. Cross Entropy, Mean Squared
15% of unique IP addresses and retaining a stratified            Error, etc). The final score is given by the average
                                                                 scores of all trees. Besides, RF provides maximum
    7 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
                                                                 depth and minimum sample split parameters to pre-
science/gbm-faq/missing values.html (accessed on 16 July         vent decision trees from overfitting on the training set.
2021).
    8 https://catboost.ai/docs/concepts/algorithm-missing-          Xgboost [6] is another ensemble method which be-
values-processing.html (accessed on 16 July 2021).               longs to the large family of boosting algorithms. In




                                                             4
general, boosting models combine shallow decision
trees (also called weak learners), each built sequen-             Table 3: H2O models hyperparameters (in-house
tially considering the errors on previous trees to reduce         dataset).
bias and variance at the same time. Xgboost particu-               parameter             RF      Xgb     AutoML
larly is an advanced implementation of gradient boost-             maximum number of
                                                                                         -       -       20
ing which includes additional features such as parallel            models
processing and regularization techniques for handling              number of trees       100     1000    -
overfitting.                                                       maximum depth         50      35      -
   Introduced in [17], Catboost is a boosting model                number of columns for
                                                                                         9       -       -
designed to handle and process categorical data effi-              a DT split
ciently. By default, Catboost implementation uses one              columns sample rate   -       0.8     -
hot encoding technique on categorical variables except             sample rate           -       0.8     -
for the ones with high cardinality. In such a case,                learning rate         -       0.009 -
ordered targeted statistics [17] are used to maximize              early stopping metric logloss logloss logloss
information gain. Contrary to other machine learning               early stopping rounds -       25      3
techniques which require preprocessing steps to con-
vert categorical data into numbers, Catboost requires
only the indices of the categorical features [7].                 Table 4: Catboost hyperparameters and Hyperopt
   Meta learning technique aims at combining the out-             “quantized” continuous distributions minimun and
put of several based learners to improve the prediction           maximum values used for optimisation.
accuracy and utilize the strength of one learner to com-                              Hyperopt
                                                                   Parameter                            min   max
plement the weaknesses of others [18]. In this study,                                 function
we used H2O AutoML [11] to build a stacked ensem-                  l2 leaf reg        qloguniform       0     2
ble. AutoML brings out a simple wrapper function                   learning rate      qloguniform       0.001 0.5
optimized for training and combining a large number                subsample          quniform          0.5   1
of models in a short amount of time. This module                   colsample bylevel quniform           0.5   1
evaluates single machine learning models (GBM9 , Xg-
boost, RF, Extremely Randomized Trees10 , Artificial              Python, R and JAVA interfaces. For this study, we
Neural networks11 and Generalised Linear Models12 )               combined Catboost’s Python and JAVA interfaces for
and their stacked ensembles on validation sets using              model training and deployment.
relevant metrics (for e.g. AUC, logloss, etc). The best
performing model is then retained for deployment.                 5.1   Hyperparameters tuning
   H2O is an open source distributed library software
                                                                  The parameter optimization described in this section
for machine learning and deep learning applications.
                                                                  is limited to our in-house dataset. In fact, because
Its attributes: frame and clusters allow to easily pro-
                                                                  of TalkingData large sample size (1,706,481 entries)
cess tabular data of various types in a distributed fash-
                                                                  carrying out an extensive hyper parameters tuning is
ion. H2O platform supports various interface includ-
                                                                  daunting. Therefore, for this dataset, we applied a full
ing R, Python and Java making it easier to complete
                                                                  parameter optimization only for the Catboost model
analytic workflows [3]. In our case, we used H2O
                                                                  and kept similar parameters for their H2O counter-
Python interface to train and optimize Distributed
                                                                  parts.
Random Forest (DRF), Xgboost and AutoML mod-
                                                                     For H2O, 3, 5 and 10 folds Cross Validation (CV)
els. The models trained are saved as MOJO (Model
                                                                  have provided the best performance for RF, AutoML
Object Optimized) formats which are later embedded
                                                                  and Xgboost respectively. These models hyperparam-
in JAVA environment for real time predictions.
                                                                  eters are depicted in Table 3. However, on the public
   The Catboost library is another high performance
                                                                  dataset, we set the maximum number of models to 10
open source framework for gradient boosting on deci-
                                                                  and the number of folds to 3 to circumvent memory
sion trees. Similar to H2O, Catboost library supports
                                                                  limitations for AutoML.
    9 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-           For Catboost, Python library Hyperopt13 allowed
science/gbm.html (accessed on 16 July 2021).                      hyperparameters optimization. Hyperpot provides
   10 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
                                                                  custom functions for hyperparameter search. Each
science/drf.html#extremely-randomized-trees (accessed on 16
July 2021).                                                       parameter value is retrieved from a list of candidates
   11 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-        taken from a specific “quantized” continuous distribu-
science/deep-learning.html (accessed on 16 July 2021).
   12 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-          13 https://github.com/hyperopt/hyperopt (accessed on 16

science/glm.html (accessed on 16 July 2021).                      July 2021).




                                                              5
Table 5: Area Under the Receiver Operating Char-                 Table 6: Machine learning models performance sum-
acteristic Curve of the best single learner of each              mary (in-house dataset).
model family derived from H2O AutoML leaderboard()                Model      F1         Precision Recall  AUC
method (in-house dataset).                                        AutoML 0.7293         0.7206    0.7833  0.9850
 Metric                                   AUC                     Xgb        0.7134     0.7104    0.7165  0.9794
 Stacked Ensemble (all models)            0.9850                  Catboost 0.7127       0.7375    0.6895  0.9809
 Stacked Ensemble (best of each family) 0.9848                    RF         0.6810     0.7274    0.6401  0.9786
 Gradient Boosting Machine                0.9826
 Extreme Gradient Booosting               0.9821                 house test dataset but limited to the best algorithms of
 Random Forest                            0.9790                 each family (GBM, Xgboost, RF, Extremely Random-
 Extremely Randomized Trees               0.9719                 ized Trees, Artificial Neural networks and Generalised
 Generalized Linear Model                 0.9690                 Linear Models). Tree based models outperform Artifi-
 Articifcial Neural Network               0.9200                 cial Neural Networks and Generalised Linear Models.
                                                                 They suit well to complex non linear problems [16].
tion such as qloguniform and quniform (see Table 4).             Especially, GBM and Xgboost yield the best AUC of
Besides, models are trained for 500 iterations, using 3          0.982 followed by Random Forest of 0.9790 AUC. Be-
folds CV, the logarithm loss function and Area Under             sides, Najem and Kadeem [16] survey on fraud detec-
the Receiver Operating Characteristic Curve (AUC)                tion techniques in e-commerce demonstrated that RF
evaluation metric.                                               has the highest frequency usage and is the best per-
                                                                 forming one across various use cases. Based on these
5.2   Performance metrics                                        observations, we initially retained AutoML, Xgboot
                                                                 and RF for our benchmark. Catboost model, which
In an imbalanced classification task, the positive class
                                                                 is not part of H2O was benchmarked separately and
denotes the less frequent value of the target and the
                                                                 added later for the comparison.
negative class is its complement. When scoring a
model, an optimal solution can be derived from the                   Tables 6 and 7 illustrate performance metrics ob-
confusion matrix [9]. True positive (TP) and True neg-           tained from the different models on mobile.de and
ative values (TN) occur when the output of the model             TalkingData datasets respectively. On the first one,
matches with the ground truth label on positive and              AutoML best model (stacked ensemble) yields an F1
negative classes respectively. Conversely, False Pos-            score of 0.73 which is higher than the one of 0.71 ob-
itive (FP) and False Negative (FN) occur when the                tained with Xgboost and Catboost and of 0.68 with
model provides predictions which mismatch with the               Random Forest. It has been reported in [11] that
true labels. To convert model probabilities into classes,        stacked ensemble models usually produce better per-
we chose a threshold in order to maximize the F1 score           formance than individual models (Xgboost, Random
on the testing set accordingly. F1 score is the harmonic         Forest, etc) used in an AutoML run in accordance with
mean between the precision and recall and evaluates              our findings. On Talking Dataset, Catboost model
the accuracy of the model at predicting the positive             yields the best performance with an F1 score of 0.988.
class. Another popular evaluation metric is the Area             Catboost model is designed to process heterogeneous
Under the Receiver Operating Characteristic Curve.               data with categorical variables efficiently [17]. The
Contrary, to the previous metrics, it is used to assess          features cardinality is highlighted in Table 8. One hot
the ability of a classifier to distinguish between classes       encoding on one side and ordered targeted statistic ap-
independently of any selected threshold.                         plied on variables of high cardinality have a significant
                                                                 impact on the model performance. Catboost also pro-
                                                                 vides get feature importance() method which gives the
6     Results                                                    contribution of each feature to the ensemble model.
In order to retain candidate models for our evalua-              The output of this method is summarized in Figure 2,
tion, we first benchmarked a large pool of machine               the app id for marketing and the IP address of click
learning models. For this purpose, H2O AutoML ob-                are the most important features.
jects provide leaderboard() method which allows to                   In order to assess the generalizability of our mod-
rank the models trained to build the stacked ensemble            elling approach at detecting fraud, we compared our
on chosen dataset and metric. These models are op-               models with the work of Gupta et al. [8]. Their best
timised with AutoML predefined random grid param-                model, Two Class Decision Forest classifier provides
eter searches which are different from our production            a precision of 0.992 and a recall of 0.902 correspond-
hyper-parameters tuning described in the previous sec-           ing to an F1 score of 0.9442. All the models used in
tion. Table 5 summarizes the AUC obtained on our in-             our experiment outperform their results in terms of




                                                             6
Table 7: Machine learning models performance sum-
mary (TalkingData dataset).
 Model      F1        Precision Recall   AUC
 Catboost 0.9888      0.9902    0.9873   0.9994
 AutoML 0.9800        0.9848    0.9752   0.9987
 Xgb        0.9787    0.9804    0.9771   0.9982
 RF         0.9780    0.9801    0.9758   0.9985


Table 8: Count of distinct values per columns in Talk-         Figure 2: Catboost model feature importance (Talk-
ing data training set.                                         ingData dataset).
                       count of unique
  feature                                                      formed previously proposed models. The best model
                       values
                                                               on this set, Catboost provides an F1 score of 0.9888
  IP                   123099                                  which is significantly higher than the value of 0.9442
  device               1450                                    reported in [8].
  OS                   558                                         With regard to the prospects of the study, we will
  channel              496                                     first explore dimensionality reduction techniques [19]
  app                  383                                     and encoding methods in order to improve the per-
  hour                 24                                      formance of the classifiers. Second, we will leverage
  dayofweek            4                                       the power of Big Data tools (for e.g Spark) to train
                                                               and optimize the models on larger samples of data.
F1 (see Table 7). Especially, our best model Catboost          In addition to that, we aim at investigating differ-
demonstrates a comparable precision and a better re-           ent meta learning techniques combining Catboost and
call. Relying on F1 score alone to compare our models          H2O models to build robust classifiers and further pre-
would be problematic since in the TalkingData’s con-           vent fraud in our website.
text the positive class correponds to the non fraudulent           Furthermore, in our future work we will tackle the
clicks. In the TalkingData adTracking Fraud Detec-             problem of detecting fraud “as soon as possible”. It
tion Challenge, Kaggle competitors’ machine learning           is crucial that fraudulent listings are detected before
models were evaluated based on AUC. Using such a               it reaches the audience. To this end we plan to in-
metric, our Catboost model yields an AUC of 0.9994             clude further features such as buyers’ and sellers’ user
compared to 0.997 from Gupta et al. [8].                       activity. Finally, we would like to highlight that the
                                                               work present in this paper is currently in production,
7   Conclusions                                                protecting buyers and sellers at mobile.de, and due to
We presented a case study which described the appli-           that we refrain from disclosing more technical details
cation of ensemble methods to detect fraud in a large          that could help malicious users to bypass our detection
scale online marketplace (mobile.de). The business             system.
value of such an investigation is twofold. First, to en-
able a trustworthy customers’ experience and enhance           8    Acknowledgements
customers’ satisfaction. Second, to reduce Customer            We would like to thank the Customer Service team
Service operational cost in order to resolve fraudulent        at mobile.de for their countless hours of manual work
cases.                                                         in detecting fraud, and for providing us the ground
   To achieve our goals, we designed a Machine Learn-          truth to start our work. We would also like to thank
ing pipeline based on sellers’ listings data and opti-         members of TnS and Data teams at mobile.de who
mized a way to address common challenges in fight-             have directly and indirectly been involved in this work,
ing fraud (fraudsters adaptability, dataset imbalance,         with special thanks to Moritz Aschoff and Matthias
high false positive rate, etc). The main contribution          Radtke.
of this study is that it proposes a pipeline using open
source data science libraries to collect, process and          References
score sellers listings to efficiently detect fraud. Our
best model AutoML has provided an F1 score of 0.73             [1] Abdi, H. and Williams, L. J. (2010). Principal compo-
outperforming Catboost, Xgboost and Random For-                   nent analysis. Wiley interdisciplinary reviews: compu-
                                                                  tational statistics, 2(4):433–459.
est. These models were later tested on a TalkingData
public dataset from Kaggle competition platform and            [2] Adewumi, A. O. and Akinyelu, A. A. (2017). A sur-
yielded great robustness at detecting fraud and outper-           vey of machine-learning and nature-inspired based credit




                                                           7
  card fraud detection techniques. International Jour-               [15] Mohammed, R. A., Wong, K.-W., Shiratuddin, M. F.,
  nal of System Assurance Engineering and Management,                   and Wang, X. (2018). Scalable machine learning tech-
  8(2):937–953.                                                         niques for highly imbalanced credit card fraud detec-
                                                                        tion: a comparative study. In Pacific Rim Interna-
[3] Aiello, S., Click, C., Roark, H., Rehak, L., and Stet-              tional Conference on Artificial Intelligence, pages 237–
   senko, P. (2016). Machine learning with python and h2o.              246. Springer.
   Edited by Lanford, J., Published by H, 20:2016.
                                                                     [16] Najem, S. M. and Kadeem, S. M. (2021). A survey on
                                                                        fraud detection techniques in e-commerce.
[4] Breiman, L. (2001). Random forests. Machine learning,
   45(1):5–32.                                                       [17] Prokhorenkova, L., Gusev, G., Vorobev, A., Doro-
                                                                        gush, A. V., and Gulin, A. (2018). Catboost: unbiased
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., and                      boosting with categorical features. In Advances in neural
   Kegelmeyer, W. P. (2002). Smote: synthetic minority                  information processing systems, pages 6638–6648.
   over-sampling technique. Journal of artificial intelli-
   gence research, 16:321–357.                                       [18] Pun, J. and Lawryshyn, Y. (2012). Improving credit
                                                                        card fraud detection using a meta-classification strategy.
[6] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable               International Journal of Computer Applications, 56(10).
   tree boosting system. In Proceedings of the 22nd acm
                                                                     [19] Rajora, S., Li, D.-L., Jha, C., Bharill, N., Patel, O. P.,
   sigkdd international conference on knowledge discovery
                                                                        Joshi, S., Puthal, D., and Prasad, M. (2018). A com-
   and data mining, pages 785–794.
                                                                        parative study of machine learning techniques for credit
                                                                        card fraud detection based on time variance. In 2018
[7] Ghori, K. M., Abbasi, R. A., Awais, M., Imran, M., Ul-
                                                                        IEEE Symposium Series on Computational Intelligence
   lah, A., and Szathmary, L. (2019). Performance analysis
                                                                        (SSCI), pages 1958–1963. IEEE.
   of different types of machine learning classifiers for non-
   technical loss detection. IEEE Access, 8:16033–16048.             [20] Renjith, S. (2018). Detection of fraudulent sellers in
                                                                        online marketplaces using support vector machine ap-
[8] Gupta, N., Le, H., Boldina, M., and Woo, J. (2019).                 proach. arXiv preprint arXiv:1805.00464.
   Predicting fraud of ad click using traditional and spark
   ml. In KSII The 14th Asia Pacific International Con-              [21] Suganya, S. and Kamalra, M. (2016). Meta classifi-
   ference on Information Science and Technology (APIC-                 cation technique for improving credit card fraud detec-
   IST), pages 24–28.                                                   tion. International Journal of Scientific and Technical
                                                                        Advancements, 2(1):101–105.
[9] Hossin, M. and Sulaiman, M. (2015). A review on eval-
                                                                     [22] Thejas, G., Dheeshjith, S., Iyengar, S., Sunitha, N.,
   uation metrics for data classification evaluations. Inter-
                                                                        and Badrinath, P. (2021). A hybrid and effective learn-
   national Journal of Data Mining & Knowledge Manage-
                                                                        ing approach for click fraud detection. Machine Learning
   ment Process, 5(2):1.
                                                                        with Applications, 3:100016.
[10] Kanei, F., Chiba, D., Hato, K., Yoshioka, K., Mat-              [23] Wang, M., Yu, J., and Ji, Z. (2018). Credit fraud risk
   sumoto, T., and Akiyama, M. (2020). Detecting and un-                detection based on xgboost-lr hybrid model. In Proc.
   derstanding online advertising fraud in the wild. IEICE              Int. Conf. Electron. Bus., volume 2, pages 336–343.
   Transactions on Information and Systems, 103(7):1512–
   1523.                                                             [24] Zhang, Z., Zhou, X., Zhang, X., Wang, L., and Wang,
                                                                        P. (2018). A model based on convolutional neural net-
[11] LeDell, E. and Poirier, S. (2020). H2O AutoML: Scal-               work for online transaction fraud detection. Security and
   able automatic machine learning. 7th ICML Workshop                   Communication Networks, 2018.
   on Automated Machine Learning (AutoML).

[12] Lee, S.-J., Ahn, C., Song, K. M., and Ahn, H.
   (2018). Trust and distrust in e-commerce. Sustainability,
   10(4):1015.

[13] Meng, C., Zhou, L., and Liu, B. (2020). A case study
   in credit fraud detection with smote and xgboost. In
   Journal of Physics: Conference Series, volume 1601,
   page 052016. IOP Publishing.

[14] Minastireanu, E.-A. and Mesnita, G. (2019). Light
   gbm machine learning algorithm to online click fraud
   detection. J. Inform. Assur. Cybersecur, 2019.




                                                                 8