=Paper=
{{Paper
|id=Vol-3052/paper15
|storemode=property
|title=Machine Learning Methods for Detecting Fraud in Online Marketplaces
|pdfUrl=https://ceur-ws.org/Vol-3052/paper15.pdf
|volume=Vol-3052
|authors=Raoul Dekou,,Sabljic Savo,,Simon Kufeld,,Diana Francesca,,Ricardo Kawase
|dblpUrl=https://dblp.org/rec/conf/cikm/DekouSKFK21
}}
==Machine Learning Methods for Detecting Fraud in Online Marketplaces==
Machine Learning Methods for Detecting Fraud in
Online Marketplaces.
Raoul Dekou1, * , Sabljic Savo2 , Simon Kufeld3 , Diana Francesca2 , and Ricardo Kawase1
1
Mobile.de, MarktPlatz 1 Europarc Dreilinden, 14532, Berlin, Germany
2
Codecentric AG, Hochstraße 11, 42697, Solingen, Germany
3
Inovex GmgH, Ludwig-Erhard-Allee 6, 76131, Karlsruhe, Germany
*
Corresponding author: Raoul Dekou, rdekou@team.mobile.de
1 Introduction
Abstract As reported in [12], retail e-commerce sales worldwide
accounted for 1.86 trillion USD in 2016 and are ex-
Connecting buyers and sellers in a safe and pected to rise to 4.48 trillion USD in 2021. In the
secure environment is one of the biggest chal- meantime, a recent report on fraud attacks trends in
lenges in online marketplaces. Probabilistic the first quarter of 20211 confirmed the shift of at-
models built upon user-item databases ad- tacks towards retail websites and estimated that 25%
dress the challenge, but often encounter issues of this traffic is malicious. Such increase in activity has
such as lack of stability and robustness. These brought enough pressure to marketplaces which need
issues are magnified in fraud scenarios where to ensure reliability and security of their services while
datasets are highly imbalanced, noisy and ma- inspiring trust towards buyers.
licious users deliberately adapt their behav- Unfortunately, the success of online marketplaces
iors to avoid detection. In this context, we attracts unwanted attention from malicious users who
leveraged the power of existing open sources try to abuse the platforms for personal monetary gain.
machine learning libraries H2O and Catboost mobile.de does not control transactions between buyer
and designed a pipeline to collect, process and and sellers. It is a “matchmaking” platform that
predict the likelihood of a private seller’s list- bridges the gap between the two sets of entities. Once
ing data to be fraudulent. We found that the the user with malicious intent creates an account,
stacked ensemble model provides the best per- he/she also creates an attractive vehicle listing (the
formance (F1=0.73) when compared to other goal is to get as many leads as possible). To achieve
commonly used models in the field. Further, this, fraudsters take a series of lead-boosting steps.
our models are benchmarked on a public Kag- They upload listings of high-demand vehicles into the
gle Dataset, TalkingData AdTracking Fraud platform and set very low yet reasonable prices for the
Detection Challenge where we compared them vehicles. Since every aspect of the listing looks legiti-
to other studies and highlighted their gener- mate (the website, the seller and the vehicle), buyers
alizability and effectiveness at handling online lower their guard and contact the fraudster. Through
fraud. a series of interactions, the fraudster is able to con-
vince the buyer (now a victim) to send a pre-payment
Copyright © by the paper’s authors. Use permitted under Cre- money transfer, usually as a “reservation” fee. Once
ative Commons License Attribution 4.0 International (CC BY
4.0). this happens, and the damage is done, the victims re-
In: RWTH Aachen, CEUR-WS: Proceedings of The 2021
alize their mistake, they contact mobile.de‘s Customer
International Workshop on Privacy, Security, and Trust Service and report the case. There are very few cases
in Computational Intelligence, Gold coast, Queensland,
Australia, 01-11-2021, published at http://https://xuyun- 1 https://securityboulevard.com/2021/07/top-industry-specific-
zhang.github.io/pstci2021/ fraud-attack-trends-from-q1-2021/ (accessed on July 2021).
1
that reach this point, however, the total monthly loss customers being flagged as fraudulent. Finally, there
can soar to thousands of Euros. is also the need for dynamic solutions given that fraud-
Satisfied customers (buyers and sellers) are the sters adapt their behaviors to a point where they are
foundation for a valuable and successful marketplace. able to bypass the detection from machine learning
Thus, providing a secure enviroment and a safe expe- models.
rience to our customers is a top priority at mobile.de, Literature suggests various examples of application
and the motivation of this work which aims at prevent- of machine learning methods which aim at detect-
ing and detecting fraudulent activity. To achieve our ing fraud. Najem and Kadeem [16] recent survey on
goals, we tackled the fraud detection problem by lever- fraud detection techniques in e-commerce, provides a
aging user generated data and building machine learn- broad view on the performance of the several models
ing models which are able to identify fraudulent activi- on various datasets. It highlights that Random For-
ties. It is also essential to design robust models, of high est (RF) is the most used and usually the most accu-
precision which can also generalise well. This paper rate of all methods. Though Naive Bayes algorithms
describes our approach to mitigate the case of fraud- are easy to implement, they are limited compared to
ulent activity by fraudsters posing as private sellers. decision trees when it comes to modelling non linear
Our contribution is twofold. First, we describe a pro- problems. Such information were taken into consider-
duction pipeline to collect, process and score sellers’ ation when selecting candidate models for our pipeline
listings using open source machine learning libraries which consists essentially of decision trees ensembles
Catboost2 and H2O3 . We briefly highlight how to ef- (RF, Xgboost and Catboost). For instance, Kanei
ficiently use these libraries to pre-select relevant can- et al. [10] trained a Random Forest model for detecting
didate models and tune their hyper-parameters. Sec- fraudulent ad requests. In their study, they demon-
ond, we demonstrate that our approach could poten- strated that the model robustness challenge could be
tially inspire other used cases by verifying our detec- addressed by means of features which could not be
tion methods on a sample of a large dataset publicly controlled by fraudsters such as the network statis-
available at Kaggle.com4 . tics from clients and publishers. This set-up allowed
The remainder of this paper is structured as fol- them to improve their recall rate by 10%. Renjith
lows. In Section 2, we discuss existing work in the [20] described a pipeline using Support Vector Machine
field. In Section 3, we provide deeper understanding (SVM) to detect fraudulent sellers in an online mar-
of the problem and formalize it. In Sections 4 and 5, ketplace. The authors specifically pointed out that a
we describe our methodology to tackle the problem. cold start problem may arise for new users when us-
Section 6 contains our results, followed by the conclu- ing predictive models with seller or transaction infor-
sion and prospects mation as features. In our approach, the cold start
effect was mitigated by removing these types of fea-
tures. Gupta et al. [8] benchmarked ensemble mod-
2 Related Work
els for predicting the likelihood of a click on mobile
Techniques used to detect fraud can be divided into phone advertisement to be fraudulent on a publicly
two groups: expertise based and data driven. In the available Kaggle dataset. They tested two configura-
first technique, experts use their knowledge to build tions: traditional and Big Data. In the traditional
a set of rules that are tested and refined to filter configuration, they combined different sampling tech-
out fraudulent activities. However, contrary to ma- niques (SMOTE, stratified sampling, etc) to reduce
chine learning solutions traditional expert techniques the data size and handle the imbalanced training set.
sometimes lack the ability to model non trivial on- This dataset which has been widely used in previous
line connections [24]. The second set of techniques, studies [8, 14, 22], is employed in our study and results
data driven, i.e. Machine learning solutions, over- from Gupta et al. [8] are used as our baseline. In our
come this issue but yield different challenges. While work, we applied the same preprocessing techniques
the increase of activity in marketplaces generates mas- and compared our results to their best model, Two
sive datasets which require model scalability, the low Class Decision Forest5 with an F1 score of 0.944. Using
occurrence of fraudulent events produces imbalanced a sample of the same dataset, Minastireanu and Mes-
datasets. Maintaining both a high precision and recall nita [14], trained a Lightgbm model to detect fraudu-
is often a challenge and many models provide signifi- lent clicks and reported an accuracy of 98%. The au-
cant misclassification errors [2] which result in genuine thors specifically described an example of how feature
engineering on original features set (click time, device,
2 https://www.catboost.ai/ (accessed on July 2021).
3 https://www.h2o.ai/ (accessed on 16 July 2021). 5 https://docs.microsoft.com/en-us/azure/machine-
4 https://www.kaggle.com/c/talkingdata-adtracking-fraud- learning/algorithm-module-reference/two-class-decision-forest
detection/data (downloaded on 16 July 2021). (accessed on July 2021)
2
channel, etc) and K fold cross validation are combined model reduces the classification speed [2] which might
to enable high performance. Besides, by testing their be an issue on big datasets.
model on a large data sample (18 millions users clicks),
they proved the robustness of the boosting machine
for the case study. In the same context, Mohammed 3 Problem statement
et al. [15] investigated the scalability of Random For-
est, Balanced Bagging Ensemble and Gaussian Naive mobile.de supports two different types of sellers,
Bayes on massive and highly imbalanced credit card namely dealers and private sellers. Dealers are those
fraud datasets. They found that random undersam- registered dealerships in Germany and neighbour-
pling is effective at handling imbalanced datasets, and ing countries who are paying customers of mobile.de.
combined with RF, it is suitable for real time appli- These are professional sellers who make a living out of
cations on large datasets. In their study, the Random buying and selling vehicles. Private sellers are the reg-
Forest model provided the highest recall of 91%. Ra- ular common citizens who own a vehicle and use a clas-
jora et al. [19] benchmarked the performance of various sified market to sell it (not registered as a business).
machine learning algorithms on a credit card transac- Internally, at mobile.de a private seller is labelled and
tion dataset with 31 attributes. They used random named as FSBO (For Sale By Owner), and for the rest
undersampling technique to address the data imbal- of this paper, we will address a private seller with the
ance and Principal Component Analysis (PCA) [1] as same terminology. Although there are several mali-
dimensionality reduction technique. On top of PCA cious activities which can be classified as fraud such
features, a time feature corresponding to the time de- as: account take over, falsification of documents, etc.,
lay from the first transaction is part of the training our objective in this study is focused on a single type
set. Furthermore, the authors illustrated how the in- of users (FSBOs) that create fraudulent (fake) listings.
clusion of this feature can impact the performance. RF Our pipeline overview is depicted in Figure 1. When
provided a better performance without the time fea- a listing is created (or updated) our machine learning
ture while Gradient Boosting Regression Tree perfor- models generate a fraud probability prediction and, in
mance was constant. Meng et al. [13] also used a real case the result is above a certain threshold, the list-
world credit card transactions dataset and combined ing is manually evaluated by a Customer Service (CS)
Xgboost and sampling techniques to achieve great per- agent, who reviews the content of the listing and as-
formance. SMOTE technique allowed an increase of signs a rating (ground truth). In addition to listings
the recall from 0.8062 to 0.9 and the AUC from 0.9795 flagged by our ML models, Customer Service agents
to 0.9853. Mohammed et al. [15] reported that Neural extend their reviewing process to listings which might
Networks tend to overfit on fraud datasets and struggle have received users’ complaints. Eventually, one way
to handle imbalanced datasets. Nevertheless, as illus- or another, every fraudulent listing is flagged in our
trated by Adewumi and Akinyelu [2] in their survey, dataset, the vast majority happening before damage is
such techniques are also commonly used for credit card done, and in very few cases, reports come from scam
fraud detection. Najem and Kadeem [16] pointed out victims. The main classification task is binary in the
that hybrid methods which combine several methods sense that the target variable to predict has two pos-
to build a robust learner provide better performance sible outcomes OK or FRAUD. The goal is to detect
than individual learners. For example, Wang et al. when a vehicle listing is (or becomes) fraudulent. It
[23] built an hybrid mixed model consisting of Xg- can happen at the insertion time (version 1 of the list-
boost and Logistic regression (LR) and benchmarked it ing) or at any time later due to a modification in the
against common baseline models such as Xgboost, RF, data.
SVM, Naive Bayes and Logistic Regression on the Ger-
man Credit dataset published by UCI6 . In the hybrid
model, an effective feature combination was obtained
by using Xgboost leaf nodes as features for the LR
model. This set up, provided an AUC of 0.8321 which
is far beyond the value of 0.7321 obtained with LR,
the best individual model. Other studies such as [18]
and [21] use meta learning techniques to enhance the
performance on credit card fraud dataset. However,
combining the output of different classifiers to build a
Figure 1: mobile.de in house data collection and
6 https://archive.ics.uci.edu/ml/index.php (accessed on July pipeline overview.
2021).
3
4 Datasets
Table 1: In-house dataset preprocessing steps.
In this study, we used two different datasets to train non overlapping - test (latest week)
and test our machine learning models, mobile.de in- time based split - train (28 days)
house dataset and a tailored sample of TalkingData
random undersampling of
AdTracking Fraud Detection Challenge dataset ob-
undersampling the training set, 10% nega-
tained from the machine learning competition plat-
tive cases kept
form Kaggle.
kept and processed by ma-
At mobile.de FRAUD cases are less frequent (posi- missing values
chine learning models
tive cases) than the OK cases leading to a highly imbal-
feature engineer-
ance dataset. The in-house dataset consists of 27 cate- yes (confidential)
ing
gorical variables and 10 continuous ones. To maintain
the confidentiality of our data points, and to eliminate
the risk of giving any clues that could lead to learnings Table 2: TalkingData dataset preprocessing steps.
on how to bypass our fraud detection models, we re- 15% random sample of
frain from disclosing the exact names of the attributes unique IPs then 8% strat-
and features. subsampling
ified sample from the
The public dataset is taken from the China’s largest remaining set
independent big data service platform which covers SMOTE with k=5 neigh-
70% of active mobile devices in the country, handles oversampling bours, positive class up to
3 billion clicks per day out of which 90% are poten- 11%
tially fraudulent. Contrary to mobile.de case, here
missing values absent
click fraud is the most frequent class (negative class)
stratified split - test (30%)
and occurs when a person or an automated bot act-
- train (70 %)
ing as legitimate user clicks on an app ad without
feature engineer- - click hour&day of the week
downloading the app afterwards. The raw dataset
ing - attributed time is removed
contains 200 millions clicks over a 4 day period. It
includes 7 data fields (IP, app, device, OS, channel,
click time, attributed time) and a binary target to pre- sample of 8% of the remaining set. To handle the
dict (is attributed). The target variable is imbalanced imbalance, we applied Synthetic Minority Over Sam-
with 99.8% of negative cases. pling Technique (SMOTE) [5] with 5 neighbours and
oversampled the positive class up to 11%. We then
Tables 1 and 2 summarize the preprocessing steps
applied a stratified split, keeping 70% of the set for
applied on mobile.de and TalkingData datasets respec-
training. The final set has 1,706,481 training samples
tively. For our in-house dataset, the testing set corre-
and 731,349 testing ones without any missing values.
sponds to samples recorded 7 days prior to the day
the model was trained. The training set corresponds
to 28 days of data prior to the start date of the testing 5 Training Machine Learning Models
set. The timely split was done to prevent the model
In this section, we briefly summarize the theoretical
from learning from future observations. In order to
concepts behind the models used in our study, pro-
reduce the imbalance and increase the performance,
vide an overview of the machine learning libraries in
we applied a random undersampling and kept 10 % of
which the models were implemented and finally de-
the majority class in the training set. This resulted
scribe the hyper-parameter tuning steps and our per-
in around 200,000 training samples and 240,000 test-
formance metrics.
ing ones. We kept raw missing entries within the sets,
H2O and Catboost models handled them as separate As stated in [4], Random Forest is an ensemble ma-
categories7,8 . chine learning algorithm consisting of a collection of
For the Kaggle dataset, we borrowed the prepro- decision trees each built from random samples. In each
cessing steps from [8] and we engineered two additional tree, thresholds are applied to the input features to
features: click hour of the day and day of the week. maximize information gain while minimizing an impu-
First, we reduced the data size by randomly sampling rity function (for e.g. Cross Entropy, Mean Squared
15% of unique IP addresses and retaining a stratified Error, etc). The final score is given by the average
scores of all trees. Besides, RF provides maximum
7 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
depth and minimum sample split parameters to pre-
science/gbm-faq/missing values.html (accessed on 16 July vent decision trees from overfitting on the training set.
2021).
8 https://catboost.ai/docs/concepts/algorithm-missing- Xgboost [6] is another ensemble method which be-
values-processing.html (accessed on 16 July 2021). longs to the large family of boosting algorithms. In
4
general, boosting models combine shallow decision
trees (also called weak learners), each built sequen- Table 3: H2O models hyperparameters (in-house
tially considering the errors on previous trees to reduce dataset).
bias and variance at the same time. Xgboost particu- parameter RF Xgb AutoML
larly is an advanced implementation of gradient boost- maximum number of
- - 20
ing which includes additional features such as parallel models
processing and regularization techniques for handling number of trees 100 1000 -
overfitting. maximum depth 50 35 -
Introduced in [17], Catboost is a boosting model number of columns for
9 - -
designed to handle and process categorical data effi- a DT split
ciently. By default, Catboost implementation uses one columns sample rate - 0.8 -
hot encoding technique on categorical variables except sample rate - 0.8 -
for the ones with high cardinality. In such a case, learning rate - 0.009 -
ordered targeted statistics [17] are used to maximize early stopping metric logloss logloss logloss
information gain. Contrary to other machine learning early stopping rounds - 25 3
techniques which require preprocessing steps to con-
vert categorical data into numbers, Catboost requires
only the indices of the categorical features [7]. Table 4: Catboost hyperparameters and Hyperopt
Meta learning technique aims at combining the out- “quantized” continuous distributions minimun and
put of several based learners to improve the prediction maximum values used for optimisation.
accuracy and utilize the strength of one learner to com- Hyperopt
Parameter min max
plement the weaknesses of others [18]. In this study, function
we used H2O AutoML [11] to build a stacked ensem- l2 leaf reg qloguniform 0 2
ble. AutoML brings out a simple wrapper function learning rate qloguniform 0.001 0.5
optimized for training and combining a large number subsample quniform 0.5 1
of models in a short amount of time. This module colsample bylevel quniform 0.5 1
evaluates single machine learning models (GBM9 , Xg-
boost, RF, Extremely Randomized Trees10 , Artificial Python, R and JAVA interfaces. For this study, we
Neural networks11 and Generalised Linear Models12 ) combined Catboost’s Python and JAVA interfaces for
and their stacked ensembles on validation sets using model training and deployment.
relevant metrics (for e.g. AUC, logloss, etc). The best
performing model is then retained for deployment. 5.1 Hyperparameters tuning
H2O is an open source distributed library software
The parameter optimization described in this section
for machine learning and deep learning applications.
is limited to our in-house dataset. In fact, because
Its attributes: frame and clusters allow to easily pro-
of TalkingData large sample size (1,706,481 entries)
cess tabular data of various types in a distributed fash-
carrying out an extensive hyper parameters tuning is
ion. H2O platform supports various interface includ-
daunting. Therefore, for this dataset, we applied a full
ing R, Python and Java making it easier to complete
parameter optimization only for the Catboost model
analytic workflows [3]. In our case, we used H2O
and kept similar parameters for their H2O counter-
Python interface to train and optimize Distributed
parts.
Random Forest (DRF), Xgboost and AutoML mod-
For H2O, 3, 5 and 10 folds Cross Validation (CV)
els. The models trained are saved as MOJO (Model
have provided the best performance for RF, AutoML
Object Optimized) formats which are later embedded
and Xgboost respectively. These models hyperparam-
in JAVA environment for real time predictions.
eters are depicted in Table 3. However, on the public
The Catboost library is another high performance
dataset, we set the maximum number of models to 10
open source framework for gradient boosting on deci-
and the number of folds to 3 to circumvent memory
sion trees. Similar to H2O, Catboost library supports
limitations for AutoML.
9 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- For Catboost, Python library Hyperopt13 allowed
science/gbm.html (accessed on 16 July 2021). hyperparameters optimization. Hyperpot provides
10 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
custom functions for hyperparameter search. Each
science/drf.html#extremely-randomized-trees (accessed on 16
July 2021). parameter value is retrieved from a list of candidates
11 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- taken from a specific “quantized” continuous distribu-
science/deep-learning.html (accessed on 16 July 2021).
12 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- 13 https://github.com/hyperopt/hyperopt (accessed on 16
science/glm.html (accessed on 16 July 2021). July 2021).
5
Table 5: Area Under the Receiver Operating Char- Table 6: Machine learning models performance sum-
acteristic Curve of the best single learner of each mary (in-house dataset).
model family derived from H2O AutoML leaderboard() Model F1 Precision Recall AUC
method (in-house dataset). AutoML 0.7293 0.7206 0.7833 0.9850
Metric AUC Xgb 0.7134 0.7104 0.7165 0.9794
Stacked Ensemble (all models) 0.9850 Catboost 0.7127 0.7375 0.6895 0.9809
Stacked Ensemble (best of each family) 0.9848 RF 0.6810 0.7274 0.6401 0.9786
Gradient Boosting Machine 0.9826
Extreme Gradient Booosting 0.9821 house test dataset but limited to the best algorithms of
Random Forest 0.9790 each family (GBM, Xgboost, RF, Extremely Random-
Extremely Randomized Trees 0.9719 ized Trees, Artificial Neural networks and Generalised
Generalized Linear Model 0.9690 Linear Models). Tree based models outperform Artifi-
Articifcial Neural Network 0.9200 cial Neural Networks and Generalised Linear Models.
They suit well to complex non linear problems [16].
tion such as qloguniform and quniform (see Table 4). Especially, GBM and Xgboost yield the best AUC of
Besides, models are trained for 500 iterations, using 3 0.982 followed by Random Forest of 0.9790 AUC. Be-
folds CV, the logarithm loss function and Area Under sides, Najem and Kadeem [16] survey on fraud detec-
the Receiver Operating Characteristic Curve (AUC) tion techniques in e-commerce demonstrated that RF
evaluation metric. has the highest frequency usage and is the best per-
forming one across various use cases. Based on these
5.2 Performance metrics observations, we initially retained AutoML, Xgboot
and RF for our benchmark. Catboost model, which
In an imbalanced classification task, the positive class
is not part of H2O was benchmarked separately and
denotes the less frequent value of the target and the
added later for the comparison.
negative class is its complement. When scoring a
model, an optimal solution can be derived from the Tables 6 and 7 illustrate performance metrics ob-
confusion matrix [9]. True positive (TP) and True neg- tained from the different models on mobile.de and
ative values (TN) occur when the output of the model TalkingData datasets respectively. On the first one,
matches with the ground truth label on positive and AutoML best model (stacked ensemble) yields an F1
negative classes respectively. Conversely, False Pos- score of 0.73 which is higher than the one of 0.71 ob-
itive (FP) and False Negative (FN) occur when the tained with Xgboost and Catboost and of 0.68 with
model provides predictions which mismatch with the Random Forest. It has been reported in [11] that
true labels. To convert model probabilities into classes, stacked ensemble models usually produce better per-
we chose a threshold in order to maximize the F1 score formance than individual models (Xgboost, Random
on the testing set accordingly. F1 score is the harmonic Forest, etc) used in an AutoML run in accordance with
mean between the precision and recall and evaluates our findings. On Talking Dataset, Catboost model
the accuracy of the model at predicting the positive yields the best performance with an F1 score of 0.988.
class. Another popular evaluation metric is the Area Catboost model is designed to process heterogeneous
Under the Receiver Operating Characteristic Curve. data with categorical variables efficiently [17]. The
Contrary, to the previous metrics, it is used to assess features cardinality is highlighted in Table 8. One hot
the ability of a classifier to distinguish between classes encoding on one side and ordered targeted statistic ap-
independently of any selected threshold. plied on variables of high cardinality have a significant
impact on the model performance. Catboost also pro-
vides get feature importance() method which gives the
6 Results contribution of each feature to the ensemble model.
In order to retain candidate models for our evalua- The output of this method is summarized in Figure 2,
tion, we first benchmarked a large pool of machine the app id for marketing and the IP address of click
learning models. For this purpose, H2O AutoML ob- are the most important features.
jects provide leaderboard() method which allows to In order to assess the generalizability of our mod-
rank the models trained to build the stacked ensemble elling approach at detecting fraud, we compared our
on chosen dataset and metric. These models are op- models with the work of Gupta et al. [8]. Their best
timised with AutoML predefined random grid param- model, Two Class Decision Forest classifier provides
eter searches which are different from our production a precision of 0.992 and a recall of 0.902 correspond-
hyper-parameters tuning described in the previous sec- ing to an F1 score of 0.9442. All the models used in
tion. Table 5 summarizes the AUC obtained on our in- our experiment outperform their results in terms of
6
Table 7: Machine learning models performance sum-
mary (TalkingData dataset).
Model F1 Precision Recall AUC
Catboost 0.9888 0.9902 0.9873 0.9994
AutoML 0.9800 0.9848 0.9752 0.9987
Xgb 0.9787 0.9804 0.9771 0.9982
RF 0.9780 0.9801 0.9758 0.9985
Table 8: Count of distinct values per columns in Talk- Figure 2: Catboost model feature importance (Talk-
ing data training set. ingData dataset).
count of unique
feature formed previously proposed models. The best model
values
on this set, Catboost provides an F1 score of 0.9888
IP 123099 which is significantly higher than the value of 0.9442
device 1450 reported in [8].
OS 558 With regard to the prospects of the study, we will
channel 496 first explore dimensionality reduction techniques [19]
app 383 and encoding methods in order to improve the per-
hour 24 formance of the classifiers. Second, we will leverage
dayofweek 4 the power of Big Data tools (for e.g Spark) to train
and optimize the models on larger samples of data.
F1 (see Table 7). Especially, our best model Catboost In addition to that, we aim at investigating differ-
demonstrates a comparable precision and a better re- ent meta learning techniques combining Catboost and
call. Relying on F1 score alone to compare our models H2O models to build robust classifiers and further pre-
would be problematic since in the TalkingData’s con- vent fraud in our website.
text the positive class correponds to the non fraudulent Furthermore, in our future work we will tackle the
clicks. In the TalkingData adTracking Fraud Detec- problem of detecting fraud “as soon as possible”. It
tion Challenge, Kaggle competitors’ machine learning is crucial that fraudulent listings are detected before
models were evaluated based on AUC. Using such a it reaches the audience. To this end we plan to in-
metric, our Catboost model yields an AUC of 0.9994 clude further features such as buyers’ and sellers’ user
compared to 0.997 from Gupta et al. [8]. activity. Finally, we would like to highlight that the
work present in this paper is currently in production,
7 Conclusions protecting buyers and sellers at mobile.de, and due to
We presented a case study which described the appli- that we refrain from disclosing more technical details
cation of ensemble methods to detect fraud in a large that could help malicious users to bypass our detection
scale online marketplace (mobile.de). The business system.
value of such an investigation is twofold. First, to en-
able a trustworthy customers’ experience and enhance 8 Acknowledgements
customers’ satisfaction. Second, to reduce Customer We would like to thank the Customer Service team
Service operational cost in order to resolve fraudulent at mobile.de for their countless hours of manual work
cases. in detecting fraud, and for providing us the ground
To achieve our goals, we designed a Machine Learn- truth to start our work. We would also like to thank
ing pipeline based on sellers’ listings data and opti- members of TnS and Data teams at mobile.de who
mized a way to address common challenges in fight- have directly and indirectly been involved in this work,
ing fraud (fraudsters adaptability, dataset imbalance, with special thanks to Moritz Aschoff and Matthias
high false positive rate, etc). The main contribution Radtke.
of this study is that it proposes a pipeline using open
source data science libraries to collect, process and References
score sellers listings to efficiently detect fraud. Our
best model AutoML has provided an F1 score of 0.73 [1] Abdi, H. and Williams, L. J. (2010). Principal compo-
outperforming Catboost, Xgboost and Random For- nent analysis. Wiley interdisciplinary reviews: compu-
tational statistics, 2(4):433–459.
est. These models were later tested on a TalkingData
public dataset from Kaggle competition platform and [2] Adewumi, A. O. and Akinyelu, A. A. (2017). A sur-
yielded great robustness at detecting fraud and outper- vey of machine-learning and nature-inspired based credit
7
card fraud detection techniques. International Jour- [15] Mohammed, R. A., Wong, K.-W., Shiratuddin, M. F.,
nal of System Assurance Engineering and Management, and Wang, X. (2018). Scalable machine learning tech-
8(2):937–953. niques for highly imbalanced credit card fraud detec-
tion: a comparative study. In Pacific Rim Interna-
[3] Aiello, S., Click, C., Roark, H., Rehak, L., and Stet- tional Conference on Artificial Intelligence, pages 237–
senko, P. (2016). Machine learning with python and h2o. 246. Springer.
Edited by Lanford, J., Published by H, 20:2016.
[16] Najem, S. M. and Kadeem, S. M. (2021). A survey on
fraud detection techniques in e-commerce.
[4] Breiman, L. (2001). Random forests. Machine learning,
45(1):5–32. [17] Prokhorenkova, L., Gusev, G., Vorobev, A., Doro-
gush, A. V., and Gulin, A. (2018). Catboost: unbiased
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., and boosting with categorical features. In Advances in neural
Kegelmeyer, W. P. (2002). Smote: synthetic minority information processing systems, pages 6638–6648.
over-sampling technique. Journal of artificial intelli-
gence research, 16:321–357. [18] Pun, J. and Lawryshyn, Y. (2012). Improving credit
card fraud detection using a meta-classification strategy.
[6] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable International Journal of Computer Applications, 56(10).
tree boosting system. In Proceedings of the 22nd acm
[19] Rajora, S., Li, D.-L., Jha, C., Bharill, N., Patel, O. P.,
sigkdd international conference on knowledge discovery
Joshi, S., Puthal, D., and Prasad, M. (2018). A com-
and data mining, pages 785–794.
parative study of machine learning techniques for credit
card fraud detection based on time variance. In 2018
[7] Ghori, K. M., Abbasi, R. A., Awais, M., Imran, M., Ul-
IEEE Symposium Series on Computational Intelligence
lah, A., and Szathmary, L. (2019). Performance analysis
(SSCI), pages 1958–1963. IEEE.
of different types of machine learning classifiers for non-
technical loss detection. IEEE Access, 8:16033–16048. [20] Renjith, S. (2018). Detection of fraudulent sellers in
online marketplaces using support vector machine ap-
[8] Gupta, N., Le, H., Boldina, M., and Woo, J. (2019). proach. arXiv preprint arXiv:1805.00464.
Predicting fraud of ad click using traditional and spark
ml. In KSII The 14th Asia Pacific International Con- [21] Suganya, S. and Kamalra, M. (2016). Meta classifi-
ference on Information Science and Technology (APIC- cation technique for improving credit card fraud detec-
IST), pages 24–28. tion. International Journal of Scientific and Technical
Advancements, 2(1):101–105.
[9] Hossin, M. and Sulaiman, M. (2015). A review on eval-
[22] Thejas, G., Dheeshjith, S., Iyengar, S., Sunitha, N.,
uation metrics for data classification evaluations. Inter-
and Badrinath, P. (2021). A hybrid and effective learn-
national Journal of Data Mining & Knowledge Manage-
ing approach for click fraud detection. Machine Learning
ment Process, 5(2):1.
with Applications, 3:100016.
[10] Kanei, F., Chiba, D., Hato, K., Yoshioka, K., Mat- [23] Wang, M., Yu, J., and Ji, Z. (2018). Credit fraud risk
sumoto, T., and Akiyama, M. (2020). Detecting and un- detection based on xgboost-lr hybrid model. In Proc.
derstanding online advertising fraud in the wild. IEICE Int. Conf. Electron. Bus., volume 2, pages 336–343.
Transactions on Information and Systems, 103(7):1512–
1523. [24] Zhang, Z., Zhou, X., Zhang, X., Wang, L., and Wang,
P. (2018). A model based on convolutional neural net-
[11] LeDell, E. and Poirier, S. (2020). H2O AutoML: Scal- work for online transaction fraud detection. Security and
able automatic machine learning. 7th ICML Workshop Communication Networks, 2018.
on Automated Machine Learning (AutoML).
[12] Lee, S.-J., Ahn, C., Song, K. M., and Ahn, H.
(2018). Trust and distrust in e-commerce. Sustainability,
10(4):1015.
[13] Meng, C., Zhou, L., and Liu, B. (2020). A case study
in credit fraud detection with smote and xgboost. In
Journal of Physics: Conference Series, volume 1601,
page 052016. IOP Publishing.
[14] Minastireanu, E.-A. and Mesnita, G. (2019). Light
gbm machine learning algorithm to online click fraud
detection. J. Inform. Assur. Cybersecur, 2019.
8