Machine Learning Methods for Detecting Fraud in Online Marketplaces. Raoul Dekou1, * , Sabljic Savo2 , Simon Kufeld3 , Diana Francesca2 , and Ricardo Kawase1 1 Mobile.de, MarktPlatz 1 Europarc Dreilinden, 14532, Berlin, Germany 2 Codecentric AG, Hochstraße 11, 42697, Solingen, Germany 3 Inovex GmgH, Ludwig-Erhard-Allee 6, 76131, Karlsruhe, Germany * Corresponding author: Raoul Dekou, rdekou@team.mobile.de 1 Introduction Abstract As reported in [12], retail e-commerce sales worldwide accounted for 1.86 trillion USD in 2016 and are ex- Connecting buyers and sellers in a safe and pected to rise to 4.48 trillion USD in 2021. In the secure environment is one of the biggest chal- meantime, a recent report on fraud attacks trends in lenges in online marketplaces. Probabilistic the first quarter of 20211 confirmed the shift of at- models built upon user-item databases ad- tacks towards retail websites and estimated that 25% dress the challenge, but often encounter issues of this traffic is malicious. Such increase in activity has such as lack of stability and robustness. These brought enough pressure to marketplaces which need issues are magnified in fraud scenarios where to ensure reliability and security of their services while datasets are highly imbalanced, noisy and ma- inspiring trust towards buyers. licious users deliberately adapt their behav- Unfortunately, the success of online marketplaces iors to avoid detection. In this context, we attracts unwanted attention from malicious users who leveraged the power of existing open sources try to abuse the platforms for personal monetary gain. machine learning libraries H2O and Catboost mobile.de does not control transactions between buyer and designed a pipeline to collect, process and and sellers. It is a “matchmaking” platform that predict the likelihood of a private seller’s list- bridges the gap between the two sets of entities. Once ing data to be fraudulent. We found that the the user with malicious intent creates an account, stacked ensemble model provides the best per- he/she also creates an attractive vehicle listing (the formance (F1=0.73) when compared to other goal is to get as many leads as possible). To achieve commonly used models in the field. Further, this, fraudsters take a series of lead-boosting steps. our models are benchmarked on a public Kag- They upload listings of high-demand vehicles into the gle Dataset, TalkingData AdTracking Fraud platform and set very low yet reasonable prices for the Detection Challenge where we compared them vehicles. Since every aspect of the listing looks legiti- to other studies and highlighted their gener- mate (the website, the seller and the vehicle), buyers alizability and effectiveness at handling online lower their guard and contact the fraudster. Through fraud. a series of interactions, the fraudster is able to con- vince the buyer (now a victim) to send a pre-payment Copyright © by the paper’s authors. Use permitted under Cre- money transfer, usually as a “reservation” fee. Once ative Commons License Attribution 4.0 International (CC BY 4.0). this happens, and the damage is done, the victims re- In: RWTH Aachen, CEUR-WS: Proceedings of The 2021 alize their mistake, they contact mobile.de‘s Customer International Workshop on Privacy, Security, and Trust Service and report the case. There are very few cases in Computational Intelligence, Gold coast, Queensland, Australia, 01-11-2021, published at http://https://xuyun- 1 https://securityboulevard.com/2021/07/top-industry-specific- zhang.github.io/pstci2021/ fraud-attack-trends-from-q1-2021/ (accessed on July 2021). 1 that reach this point, however, the total monthly loss customers being flagged as fraudulent. Finally, there can soar to thousands of Euros. is also the need for dynamic solutions given that fraud- Satisfied customers (buyers and sellers) are the sters adapt their behaviors to a point where they are foundation for a valuable and successful marketplace. able to bypass the detection from machine learning Thus, providing a secure enviroment and a safe expe- models. rience to our customers is a top priority at mobile.de, Literature suggests various examples of application and the motivation of this work which aims at prevent- of machine learning methods which aim at detect- ing and detecting fraudulent activity. To achieve our ing fraud. Najem and Kadeem [16] recent survey on goals, we tackled the fraud detection problem by lever- fraud detection techniques in e-commerce, provides a aging user generated data and building machine learn- broad view on the performance of the several models ing models which are able to identify fraudulent activi- on various datasets. It highlights that Random For- ties. It is also essential to design robust models, of high est (RF) is the most used and usually the most accu- precision which can also generalise well. This paper rate of all methods. Though Naive Bayes algorithms describes our approach to mitigate the case of fraud- are easy to implement, they are limited compared to ulent activity by fraudsters posing as private sellers. decision trees when it comes to modelling non linear Our contribution is twofold. First, we describe a pro- problems. Such information were taken into consider- duction pipeline to collect, process and score sellers’ ation when selecting candidate models for our pipeline listings using open source machine learning libraries which consists essentially of decision trees ensembles Catboost2 and H2O3 . We briefly highlight how to ef- (RF, Xgboost and Catboost). For instance, Kanei ficiently use these libraries to pre-select relevant can- et al. [10] trained a Random Forest model for detecting didate models and tune their hyper-parameters. Sec- fraudulent ad requests. In their study, they demon- ond, we demonstrate that our approach could poten- strated that the model robustness challenge could be tially inspire other used cases by verifying our detec- addressed by means of features which could not be tion methods on a sample of a large dataset publicly controlled by fraudsters such as the network statis- available at Kaggle.com4 . tics from clients and publishers. This set-up allowed The remainder of this paper is structured as fol- them to improve their recall rate by 10%. Renjith lows. In Section 2, we discuss existing work in the [20] described a pipeline using Support Vector Machine field. In Section 3, we provide deeper understanding (SVM) to detect fraudulent sellers in an online mar- of the problem and formalize it. In Sections 4 and 5, ketplace. The authors specifically pointed out that a we describe our methodology to tackle the problem. cold start problem may arise for new users when us- Section 6 contains our results, followed by the conclu- ing predictive models with seller or transaction infor- sion and prospects mation as features. In our approach, the cold start effect was mitigated by removing these types of fea- tures. Gupta et al. [8] benchmarked ensemble mod- 2 Related Work els for predicting the likelihood of a click on mobile Techniques used to detect fraud can be divided into phone advertisement to be fraudulent on a publicly two groups: expertise based and data driven. In the available Kaggle dataset. They tested two configura- first technique, experts use their knowledge to build tions: traditional and Big Data. In the traditional a set of rules that are tested and refined to filter configuration, they combined different sampling tech- out fraudulent activities. However, contrary to ma- niques (SMOTE, stratified sampling, etc) to reduce chine learning solutions traditional expert techniques the data size and handle the imbalanced training set. sometimes lack the ability to model non trivial on- This dataset which has been widely used in previous line connections [24]. The second set of techniques, studies [8, 14, 22], is employed in our study and results data driven, i.e. Machine learning solutions, over- from Gupta et al. [8] are used as our baseline. In our come this issue but yield different challenges. While work, we applied the same preprocessing techniques the increase of activity in marketplaces generates mas- and compared our results to their best model, Two sive datasets which require model scalability, the low Class Decision Forest5 with an F1 score of 0.944. Using occurrence of fraudulent events produces imbalanced a sample of the same dataset, Minastireanu and Mes- datasets. Maintaining both a high precision and recall nita [14], trained a Lightgbm model to detect fraudu- is often a challenge and many models provide signifi- lent clicks and reported an accuracy of 98%. The au- cant misclassification errors [2] which result in genuine thors specifically described an example of how feature engineering on original features set (click time, device, 2 https://www.catboost.ai/ (accessed on July 2021). 3 https://www.h2o.ai/ (accessed on 16 July 2021). 5 https://docs.microsoft.com/en-us/azure/machine- 4 https://www.kaggle.com/c/talkingdata-adtracking-fraud- learning/algorithm-module-reference/two-class-decision-forest detection/data (downloaded on 16 July 2021). (accessed on July 2021) 2 channel, etc) and K fold cross validation are combined model reduces the classification speed [2] which might to enable high performance. Besides, by testing their be an issue on big datasets. model on a large data sample (18 millions users clicks), they proved the robustness of the boosting machine for the case study. In the same context, Mohammed 3 Problem statement et al. [15] investigated the scalability of Random For- est, Balanced Bagging Ensemble and Gaussian Naive mobile.de supports two different types of sellers, Bayes on massive and highly imbalanced credit card namely dealers and private sellers. Dealers are those fraud datasets. They found that random undersam- registered dealerships in Germany and neighbour- pling is effective at handling imbalanced datasets, and ing countries who are paying customers of mobile.de. combined with RF, it is suitable for real time appli- These are professional sellers who make a living out of cations on large datasets. In their study, the Random buying and selling vehicles. Private sellers are the reg- Forest model provided the highest recall of 91%. Ra- ular common citizens who own a vehicle and use a clas- jora et al. [19] benchmarked the performance of various sified market to sell it (not registered as a business). machine learning algorithms on a credit card transac- Internally, at mobile.de a private seller is labelled and tion dataset with 31 attributes. They used random named as FSBO (For Sale By Owner), and for the rest undersampling technique to address the data imbal- of this paper, we will address a private seller with the ance and Principal Component Analysis (PCA) [1] as same terminology. Although there are several mali- dimensionality reduction technique. On top of PCA cious activities which can be classified as fraud such features, a time feature corresponding to the time de- as: account take over, falsification of documents, etc., lay from the first transaction is part of the training our objective in this study is focused on a single type set. Furthermore, the authors illustrated how the in- of users (FSBOs) that create fraudulent (fake) listings. clusion of this feature can impact the performance. RF Our pipeline overview is depicted in Figure 1. When provided a better performance without the time fea- a listing is created (or updated) our machine learning ture while Gradient Boosting Regression Tree perfor- models generate a fraud probability prediction and, in mance was constant. Meng et al. [13] also used a real case the result is above a certain threshold, the list- world credit card transactions dataset and combined ing is manually evaluated by a Customer Service (CS) Xgboost and sampling techniques to achieve great per- agent, who reviews the content of the listing and as- formance. SMOTE technique allowed an increase of signs a rating (ground truth). In addition to listings the recall from 0.8062 to 0.9 and the AUC from 0.9795 flagged by our ML models, Customer Service agents to 0.9853. Mohammed et al. [15] reported that Neural extend their reviewing process to listings which might Networks tend to overfit on fraud datasets and struggle have received users’ complaints. Eventually, one way to handle imbalanced datasets. Nevertheless, as illus- or another, every fraudulent listing is flagged in our trated by Adewumi and Akinyelu [2] in their survey, dataset, the vast majority happening before damage is such techniques are also commonly used for credit card done, and in very few cases, reports come from scam fraud detection. Najem and Kadeem [16] pointed out victims. The main classification task is binary in the that hybrid methods which combine several methods sense that the target variable to predict has two pos- to build a robust learner provide better performance sible outcomes OK or FRAUD. The goal is to detect than individual learners. For example, Wang et al. when a vehicle listing is (or becomes) fraudulent. It [23] built an hybrid mixed model consisting of Xg- can happen at the insertion time (version 1 of the list- boost and Logistic regression (LR) and benchmarked it ing) or at any time later due to a modification in the against common baseline models such as Xgboost, RF, data. SVM, Naive Bayes and Logistic Regression on the Ger- man Credit dataset published by UCI6 . In the hybrid model, an effective feature combination was obtained by using Xgboost leaf nodes as features for the LR model. This set up, provided an AUC of 0.8321 which is far beyond the value of 0.7321 obtained with LR, the best individual model. Other studies such as [18] and [21] use meta learning techniques to enhance the performance on credit card fraud dataset. However, combining the output of different classifiers to build a Figure 1: mobile.de in house data collection and 6 https://archive.ics.uci.edu/ml/index.php (accessed on July pipeline overview. 2021). 3 4 Datasets Table 1: In-house dataset preprocessing steps. In this study, we used two different datasets to train non overlapping - test (latest week) and test our machine learning models, mobile.de in- time based split - train (28 days) house dataset and a tailored sample of TalkingData random undersampling of AdTracking Fraud Detection Challenge dataset ob- undersampling the training set, 10% nega- tained from the machine learning competition plat- tive cases kept form Kaggle. kept and processed by ma- At mobile.de FRAUD cases are less frequent (posi- missing values chine learning models tive cases) than the OK cases leading to a highly imbal- feature engineer- ance dataset. The in-house dataset consists of 27 cate- yes (confidential) ing gorical variables and 10 continuous ones. To maintain the confidentiality of our data points, and to eliminate the risk of giving any clues that could lead to learnings Table 2: TalkingData dataset preprocessing steps. on how to bypass our fraud detection models, we re- 15% random sample of frain from disclosing the exact names of the attributes unique IPs then 8% strat- and features. subsampling ified sample from the The public dataset is taken from the China’s largest remaining set independent big data service platform which covers SMOTE with k=5 neigh- 70% of active mobile devices in the country, handles oversampling bours, positive class up to 3 billion clicks per day out of which 90% are poten- 11% tially fraudulent. Contrary to mobile.de case, here missing values absent click fraud is the most frequent class (negative class) stratified split - test (30%) and occurs when a person or an automated bot act- - train (70 %) ing as legitimate user clicks on an app ad without feature engineer- - click hour&day of the week downloading the app afterwards. The raw dataset ing - attributed time is removed contains 200 millions clicks over a 4 day period. It includes 7 data fields (IP, app, device, OS, channel, click time, attributed time) and a binary target to pre- sample of 8% of the remaining set. To handle the dict (is attributed). The target variable is imbalanced imbalance, we applied Synthetic Minority Over Sam- with 99.8% of negative cases. pling Technique (SMOTE) [5] with 5 neighbours and oversampled the positive class up to 11%. We then Tables 1 and 2 summarize the preprocessing steps applied a stratified split, keeping 70% of the set for applied on mobile.de and TalkingData datasets respec- training. The final set has 1,706,481 training samples tively. For our in-house dataset, the testing set corre- and 731,349 testing ones without any missing values. sponds to samples recorded 7 days prior to the day the model was trained. The training set corresponds to 28 days of data prior to the start date of the testing 5 Training Machine Learning Models set. The timely split was done to prevent the model In this section, we briefly summarize the theoretical from learning from future observations. In order to concepts behind the models used in our study, pro- reduce the imbalance and increase the performance, vide an overview of the machine learning libraries in we applied a random undersampling and kept 10 % of which the models were implemented and finally de- the majority class in the training set. This resulted scribe the hyper-parameter tuning steps and our per- in around 200,000 training samples and 240,000 test- formance metrics. ing ones. We kept raw missing entries within the sets, H2O and Catboost models handled them as separate As stated in [4], Random Forest is an ensemble ma- categories7,8 . chine learning algorithm consisting of a collection of For the Kaggle dataset, we borrowed the prepro- decision trees each built from random samples. In each cessing steps from [8] and we engineered two additional tree, thresholds are applied to the input features to features: click hour of the day and day of the week. maximize information gain while minimizing an impu- First, we reduced the data size by randomly sampling rity function (for e.g. Cross Entropy, Mean Squared 15% of unique IP addresses and retaining a stratified Error, etc). The final score is given by the average scores of all trees. Besides, RF provides maximum 7 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- depth and minimum sample split parameters to pre- science/gbm-faq/missing values.html (accessed on 16 July vent decision trees from overfitting on the training set. 2021). 8 https://catboost.ai/docs/concepts/algorithm-missing- Xgboost [6] is another ensemble method which be- values-processing.html (accessed on 16 July 2021). longs to the large family of boosting algorithms. In 4 general, boosting models combine shallow decision trees (also called weak learners), each built sequen- Table 3: H2O models hyperparameters (in-house tially considering the errors on previous trees to reduce dataset). bias and variance at the same time. Xgboost particu- parameter RF Xgb AutoML larly is an advanced implementation of gradient boost- maximum number of - - 20 ing which includes additional features such as parallel models processing and regularization techniques for handling number of trees 100 1000 - overfitting. maximum depth 50 35 - Introduced in [17], Catboost is a boosting model number of columns for 9 - - designed to handle and process categorical data effi- a DT split ciently. By default, Catboost implementation uses one columns sample rate - 0.8 - hot encoding technique on categorical variables except sample rate - 0.8 - for the ones with high cardinality. In such a case, learning rate - 0.009 - ordered targeted statistics [17] are used to maximize early stopping metric logloss logloss logloss information gain. Contrary to other machine learning early stopping rounds - 25 3 techniques which require preprocessing steps to con- vert categorical data into numbers, Catboost requires only the indices of the categorical features [7]. Table 4: Catboost hyperparameters and Hyperopt Meta learning technique aims at combining the out- “quantized” continuous distributions minimun and put of several based learners to improve the prediction maximum values used for optimisation. accuracy and utilize the strength of one learner to com- Hyperopt Parameter min max plement the weaknesses of others [18]. In this study, function we used H2O AutoML [11] to build a stacked ensem- l2 leaf reg qloguniform 0 2 ble. AutoML brings out a simple wrapper function learning rate qloguniform 0.001 0.5 optimized for training and combining a large number subsample quniform 0.5 1 of models in a short amount of time. This module colsample bylevel quniform 0.5 1 evaluates single machine learning models (GBM9 , Xg- boost, RF, Extremely Randomized Trees10 , Artificial Python, R and JAVA interfaces. For this study, we Neural networks11 and Generalised Linear Models12 ) combined Catboost’s Python and JAVA interfaces for and their stacked ensembles on validation sets using model training and deployment. relevant metrics (for e.g. AUC, logloss, etc). The best performing model is then retained for deployment. 5.1 Hyperparameters tuning H2O is an open source distributed library software The parameter optimization described in this section for machine learning and deep learning applications. is limited to our in-house dataset. In fact, because Its attributes: frame and clusters allow to easily pro- of TalkingData large sample size (1,706,481 entries) cess tabular data of various types in a distributed fash- carrying out an extensive hyper parameters tuning is ion. H2O platform supports various interface includ- daunting. Therefore, for this dataset, we applied a full ing R, Python and Java making it easier to complete parameter optimization only for the Catboost model analytic workflows [3]. In our case, we used H2O and kept similar parameters for their H2O counter- Python interface to train and optimize Distributed parts. Random Forest (DRF), Xgboost and AutoML mod- For H2O, 3, 5 and 10 folds Cross Validation (CV) els. The models trained are saved as MOJO (Model have provided the best performance for RF, AutoML Object Optimized) formats which are later embedded and Xgboost respectively. These models hyperparam- in JAVA environment for real time predictions. eters are depicted in Table 3. However, on the public The Catboost library is another high performance dataset, we set the maximum number of models to 10 open source framework for gradient boosting on deci- and the number of folds to 3 to circumvent memory sion trees. Similar to H2O, Catboost library supports limitations for AutoML. 9 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- For Catboost, Python library Hyperopt13 allowed science/gbm.html (accessed on 16 July 2021). hyperparameters optimization. Hyperpot provides 10 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- custom functions for hyperparameter search. Each science/drf.html#extremely-randomized-trees (accessed on 16 July 2021). parameter value is retrieved from a list of candidates 11 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- taken from a specific “quantized” continuous distribu- science/deep-learning.html (accessed on 16 July 2021). 12 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data- 13 https://github.com/hyperopt/hyperopt (accessed on 16 science/glm.html (accessed on 16 July 2021). July 2021). 5 Table 5: Area Under the Receiver Operating Char- Table 6: Machine learning models performance sum- acteristic Curve of the best single learner of each mary (in-house dataset). model family derived from H2O AutoML leaderboard() Model F1 Precision Recall AUC method (in-house dataset). AutoML 0.7293 0.7206 0.7833 0.9850 Metric AUC Xgb 0.7134 0.7104 0.7165 0.9794 Stacked Ensemble (all models) 0.9850 Catboost 0.7127 0.7375 0.6895 0.9809 Stacked Ensemble (best of each family) 0.9848 RF 0.6810 0.7274 0.6401 0.9786 Gradient Boosting Machine 0.9826 Extreme Gradient Booosting 0.9821 house test dataset but limited to the best algorithms of Random Forest 0.9790 each family (GBM, Xgboost, RF, Extremely Random- Extremely Randomized Trees 0.9719 ized Trees, Artificial Neural networks and Generalised Generalized Linear Model 0.9690 Linear Models). Tree based models outperform Artifi- Articifcial Neural Network 0.9200 cial Neural Networks and Generalised Linear Models. They suit well to complex non linear problems [16]. tion such as qloguniform and quniform (see Table 4). Especially, GBM and Xgboost yield the best AUC of Besides, models are trained for 500 iterations, using 3 0.982 followed by Random Forest of 0.9790 AUC. Be- folds CV, the logarithm loss function and Area Under sides, Najem and Kadeem [16] survey on fraud detec- the Receiver Operating Characteristic Curve (AUC) tion techniques in e-commerce demonstrated that RF evaluation metric. has the highest frequency usage and is the best per- forming one across various use cases. Based on these 5.2 Performance metrics observations, we initially retained AutoML, Xgboot and RF for our benchmark. Catboost model, which In an imbalanced classification task, the positive class is not part of H2O was benchmarked separately and denotes the less frequent value of the target and the added later for the comparison. negative class is its complement. When scoring a model, an optimal solution can be derived from the Tables 6 and 7 illustrate performance metrics ob- confusion matrix [9]. True positive (TP) and True neg- tained from the different models on mobile.de and ative values (TN) occur when the output of the model TalkingData datasets respectively. On the first one, matches with the ground truth label on positive and AutoML best model (stacked ensemble) yields an F1 negative classes respectively. Conversely, False Pos- score of 0.73 which is higher than the one of 0.71 ob- itive (FP) and False Negative (FN) occur when the tained with Xgboost and Catboost and of 0.68 with model provides predictions which mismatch with the Random Forest. It has been reported in [11] that true labels. To convert model probabilities into classes, stacked ensemble models usually produce better per- we chose a threshold in order to maximize the F1 score formance than individual models (Xgboost, Random on the testing set accordingly. F1 score is the harmonic Forest, etc) used in an AutoML run in accordance with mean between the precision and recall and evaluates our findings. On Talking Dataset, Catboost model the accuracy of the model at predicting the positive yields the best performance with an F1 score of 0.988. class. Another popular evaluation metric is the Area Catboost model is designed to process heterogeneous Under the Receiver Operating Characteristic Curve. data with categorical variables efficiently [17]. The Contrary, to the previous metrics, it is used to assess features cardinality is highlighted in Table 8. One hot the ability of a classifier to distinguish between classes encoding on one side and ordered targeted statistic ap- independently of any selected threshold. plied on variables of high cardinality have a significant impact on the model performance. Catboost also pro- vides get feature importance() method which gives the 6 Results contribution of each feature to the ensemble model. In order to retain candidate models for our evalua- The output of this method is summarized in Figure 2, tion, we first benchmarked a large pool of machine the app id for marketing and the IP address of click learning models. For this purpose, H2O AutoML ob- are the most important features. jects provide leaderboard() method which allows to In order to assess the generalizability of our mod- rank the models trained to build the stacked ensemble elling approach at detecting fraud, we compared our on chosen dataset and metric. These models are op- models with the work of Gupta et al. [8]. Their best timised with AutoML predefined random grid param- model, Two Class Decision Forest classifier provides eter searches which are different from our production a precision of 0.992 and a recall of 0.902 correspond- hyper-parameters tuning described in the previous sec- ing to an F1 score of 0.9442. All the models used in tion. Table 5 summarizes the AUC obtained on our in- our experiment outperform their results in terms of 6 Table 7: Machine learning models performance sum- mary (TalkingData dataset). Model F1 Precision Recall AUC Catboost 0.9888 0.9902 0.9873 0.9994 AutoML 0.9800 0.9848 0.9752 0.9987 Xgb 0.9787 0.9804 0.9771 0.9982 RF 0.9780 0.9801 0.9758 0.9985 Table 8: Count of distinct values per columns in Talk- Figure 2: Catboost model feature importance (Talk- ing data training set. ingData dataset). count of unique feature formed previously proposed models. The best model values on this set, Catboost provides an F1 score of 0.9888 IP 123099 which is significantly higher than the value of 0.9442 device 1450 reported in [8]. OS 558 With regard to the prospects of the study, we will channel 496 first explore dimensionality reduction techniques [19] app 383 and encoding methods in order to improve the per- hour 24 formance of the classifiers. Second, we will leverage dayofweek 4 the power of Big Data tools (for e.g Spark) to train and optimize the models on larger samples of data. F1 (see Table 7). Especially, our best model Catboost In addition to that, we aim at investigating differ- demonstrates a comparable precision and a better re- ent meta learning techniques combining Catboost and call. Relying on F1 score alone to compare our models H2O models to build robust classifiers and further pre- would be problematic since in the TalkingData’s con- vent fraud in our website. text the positive class correponds to the non fraudulent Furthermore, in our future work we will tackle the clicks. In the TalkingData adTracking Fraud Detec- problem of detecting fraud “as soon as possible”. It tion Challenge, Kaggle competitors’ machine learning is crucial that fraudulent listings are detected before models were evaluated based on AUC. Using such a it reaches the audience. To this end we plan to in- metric, our Catboost model yields an AUC of 0.9994 clude further features such as buyers’ and sellers’ user compared to 0.997 from Gupta et al. [8]. activity. Finally, we would like to highlight that the work present in this paper is currently in production, 7 Conclusions protecting buyers and sellers at mobile.de, and due to We presented a case study which described the appli- that we refrain from disclosing more technical details cation of ensemble methods to detect fraud in a large that could help malicious users to bypass our detection scale online marketplace (mobile.de). The business system. value of such an investigation is twofold. First, to en- able a trustworthy customers’ experience and enhance 8 Acknowledgements customers’ satisfaction. Second, to reduce Customer We would like to thank the Customer Service team Service operational cost in order to resolve fraudulent at mobile.de for their countless hours of manual work cases. in detecting fraud, and for providing us the ground To achieve our goals, we designed a Machine Learn- truth to start our work. We would also like to thank ing pipeline based on sellers’ listings data and opti- members of TnS and Data teams at mobile.de who mized a way to address common challenges in fight- have directly and indirectly been involved in this work, ing fraud (fraudsters adaptability, dataset imbalance, with special thanks to Moritz Aschoff and Matthias high false positive rate, etc). The main contribution Radtke. of this study is that it proposes a pipeline using open source data science libraries to collect, process and References score sellers listings to efficiently detect fraud. Our best model AutoML has provided an F1 score of 0.73 [1] Abdi, H. and Williams, L. J. (2010). Principal compo- outperforming Catboost, Xgboost and Random For- nent analysis. Wiley interdisciplinary reviews: compu- tational statistics, 2(4):433–459. est. These models were later tested on a TalkingData public dataset from Kaggle competition platform and [2] Adewumi, A. O. and Akinyelu, A. A. (2017). A sur- yielded great robustness at detecting fraud and outper- vey of machine-learning and nature-inspired based credit 7 card fraud detection techniques. International Jour- [15] Mohammed, R. A., Wong, K.-W., Shiratuddin, M. F., nal of System Assurance Engineering and Management, and Wang, X. (2018). Scalable machine learning tech- 8(2):937–953. niques for highly imbalanced credit card fraud detec- tion: a comparative study. In Pacific Rim Interna- [3] Aiello, S., Click, C., Roark, H., Rehak, L., and Stet- tional Conference on Artificial Intelligence, pages 237– senko, P. (2016). Machine learning with python and h2o. 246. Springer. Edited by Lanford, J., Published by H, 20:2016. [16] Najem, S. M. and Kadeem, S. M. (2021). A survey on fraud detection techniques in e-commerce. [4] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. [17] Prokhorenkova, L., Gusev, G., Vorobev, A., Doro- gush, A. V., and Gulin, A. (2018). Catboost: unbiased [5] Chawla, N. V., Bowyer, K. W., Hall, L. O., and boosting with categorical features. In Advances in neural Kegelmeyer, W. P. (2002). Smote: synthetic minority information processing systems, pages 6638–6648. over-sampling technique. Journal of artificial intelli- gence research, 16:321–357. [18] Pun, J. and Lawryshyn, Y. (2012). Improving credit card fraud detection using a meta-classification strategy. [6] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable International Journal of Computer Applications, 56(10). tree boosting system. In Proceedings of the 22nd acm [19] Rajora, S., Li, D.-L., Jha, C., Bharill, N., Patel, O. P., sigkdd international conference on knowledge discovery Joshi, S., Puthal, D., and Prasad, M. (2018). A com- and data mining, pages 785–794. parative study of machine learning techniques for credit card fraud detection based on time variance. In 2018 [7] Ghori, K. M., Abbasi, R. A., Awais, M., Imran, M., Ul- IEEE Symposium Series on Computational Intelligence lah, A., and Szathmary, L. (2019). Performance analysis (SSCI), pages 1958–1963. IEEE. of different types of machine learning classifiers for non- technical loss detection. IEEE Access, 8:16033–16048. [20] Renjith, S. (2018). Detection of fraudulent sellers in online marketplaces using support vector machine ap- [8] Gupta, N., Le, H., Boldina, M., and Woo, J. (2019). proach. arXiv preprint arXiv:1805.00464. Predicting fraud of ad click using traditional and spark ml. In KSII The 14th Asia Pacific International Con- [21] Suganya, S. and Kamalra, M. (2016). Meta classifi- ference on Information Science and Technology (APIC- cation technique for improving credit card fraud detec- IST), pages 24–28. tion. International Journal of Scientific and Technical Advancements, 2(1):101–105. [9] Hossin, M. and Sulaiman, M. (2015). A review on eval- [22] Thejas, G., Dheeshjith, S., Iyengar, S., Sunitha, N., uation metrics for data classification evaluations. Inter- and Badrinath, P. (2021). A hybrid and effective learn- national Journal of Data Mining & Knowledge Manage- ing approach for click fraud detection. Machine Learning ment Process, 5(2):1. with Applications, 3:100016. [10] Kanei, F., Chiba, D., Hato, K., Yoshioka, K., Mat- [23] Wang, M., Yu, J., and Ji, Z. (2018). Credit fraud risk sumoto, T., and Akiyama, M. (2020). Detecting and un- detection based on xgboost-lr hybrid model. In Proc. derstanding online advertising fraud in the wild. IEICE Int. Conf. Electron. Bus., volume 2, pages 336–343. Transactions on Information and Systems, 103(7):1512– 1523. [24] Zhang, Z., Zhou, X., Zhang, X., Wang, L., and Wang, P. (2018). A model based on convolutional neural net- [11] LeDell, E. and Poirier, S. (2020). H2O AutoML: Scal- work for online transaction fraud detection. Security and able automatic machine learning. 7th ICML Workshop Communication Networks, 2018. on Automated Machine Learning (AutoML). [12] Lee, S.-J., Ahn, C., Song, K. M., and Ahn, H. (2018). Trust and distrust in e-commerce. Sustainability, 10(4):1015. [13] Meng, C., Zhou, L., and Liu, B. (2020). A case study in credit fraud detection with smote and xgboost. In Journal of Physics: Conference Series, volume 1601, page 052016. IOP Publishing. [14] Minastireanu, E.-A. and Mesnita, G. (2019). Light gbm machine learning algorithm to online click fraud detection. J. Inform. Assur. Cybersecur, 2019. 8