<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Machine Learning Methods for Detecting Fraud in Online Marketplaces</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Raoul</forename><surname>Dekou</surname></persName>
							<email>rdekou@team.mobile.de</email>
							<affiliation key="aff0">
								<orgName type="department">Mobile.de</orgName>
								<address>
									<addrLine>MarktPlatz 1 Europarc Dreilinden</addrLine>
									<postCode>14532</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sabljic</forename><surname>Savo</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Codecentric AG</orgName>
								<address>
									<addrLine>Hochstraße 11</addrLine>
									<postCode>42697</postCode>
									<settlement>Solingen</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simon</forename><surname>Kufeld</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">Inovex GmgH</orgName>
								<address>
									<addrLine>Ludwig-Erhard-Allee 6</addrLine>
									<postCode>76131</postCode>
									<settlement>Karlsruhe</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Diana</forename><surname>Francesca</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Codecentric AG</orgName>
								<address>
									<addrLine>Hochstraße 11</addrLine>
									<postCode>42697</postCode>
									<settlement>Solingen</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ricardo</forename><surname>Kawase</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Mobile.de</orgName>
								<address>
									<addrLine>MarktPlatz 1 Europarc Dreilinden</addrLine>
									<postCode>14532</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Machine Learning Methods for Detecting Fraud in Online Marketplaces</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4E59389F3E171D690C8C94FBED1F052A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Connecting buyers and sellers in a safe and secure environment is one of the biggest challenges in online marketplaces. Probabilistic models built upon user-item databases address the challenge, but often encounter issues such as lack of stability and robustness. These issues are magnified in fraud scenarios where datasets are highly imbalanced, noisy and malicious users deliberately adapt their behaviors to avoid detection. In this context, we leveraged the power of existing open sources machine learning libraries H2O and Catboost and designed a pipeline to collect, process and predict the likelihood of a private seller's listing data to be fraudulent. We found that the stacked ensemble model provides the best performance (F1=0.73) when compared to other commonly used models in the field. Further, our models are benchmarked on a public Kaggle Dataset, TalkingData AdTracking Fraud Detection Challenge where we compared them to other studies and highlighted their generalizability and effectiveness at handling online fraud.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>As reported in <ref type="bibr" target="#b11">[12]</ref>, retail e-commerce sales worldwide accounted for 1.86 trillion USD in 2016 and are expected to rise to 4.48 trillion USD in 2021. In the meantime, a recent report on fraud attacks trends in the first quarter of 2021 1 confirmed the shift of attacks towards retail websites and estimated that 25% of this traffic is malicious. Such increase in activity has brought enough pressure to marketplaces which need to ensure reliability and security of their services while inspiring trust towards buyers.</p><p>Unfortunately, the success of online marketplaces attracts unwanted attention from malicious users who try to abuse the platforms for personal monetary gain. mobile.de does not control transactions between buyer and sellers. It is a "matchmaking" platform that bridges the gap between the two sets of entities. Once the user with malicious intent creates an account, he/she also creates an attractive vehicle listing (the goal is to get as many leads as possible). To achieve this, fraudsters take a series of lead-boosting steps. They upload listings of high-demand vehicles into the platform and set very low yet reasonable prices for the vehicles. Since every aspect of the listing looks legitimate (the website, the seller and the vehicle), buyers lower their guard and contact the fraudster. Through a series of interactions, the fraudster is able to convince the buyer (now a victim) to send a pre-payment money transfer, usually as a "reservation" fee. Once this happens, and the damage is done, the victims realize their mistake, they contact mobile.de's Customer Service and report the case. There are very few cases that reach this point, however, the total monthly loss can soar to thousands of Euros.</p><p>Satisfied customers (buyers and sellers) are the foundation for a valuable and successful marketplace. Thus, providing a secure enviroment and a safe experience to our customers is a top priority at mobile.de, and the motivation of this work which aims at preventing and detecting fraudulent activity. To achieve our goals, we tackled the fraud detection problem by leveraging user generated data and building machine learning models which are able to identify fraudulent activities. It is also essential to design robust models, of high precision which can also generalise well. This paper describes our approach to mitigate the case of fraudulent activity by fraudsters posing as private sellers. Our contribution is twofold. First, we describe a production pipeline to collect, process and score sellers' listings using open source machine learning libraries Catboost<ref type="foot" target="#foot_0">2</ref> and H2O<ref type="foot" target="#foot_1">3</ref> . We briefly highlight how to efficiently use these libraries to pre-select relevant candidate models and tune their hyper-parameters. Second, we demonstrate that our approach could potentially inspire other used cases by verifying our detection methods on a sample of a large dataset publicly available at Kaggle.com <ref type="foot" target="#foot_2">4</ref> .</p><p>The remainder of this paper is structured as follows. In Section 2, we discuss existing work in the field. In Section 3, we provide deeper understanding of the problem and formalize it. In Sections 4 and 5, we describe our methodology to tackle the problem. Section 6 contains our results, followed by the conclusion and prospects</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Techniques used to detect fraud can be divided into two groups: expertise based and data driven. In the first technique, experts use their knowledge to build a set of rules that are tested and refined to filter out fraudulent activities. However, contrary to machine learning solutions traditional expert techniques sometimes lack the ability to model non trivial online connections <ref type="bibr" target="#b23">[24]</ref>. The second set of techniques, data driven, i.e. Machine learning solutions, overcome this issue but yield different challenges. While the increase of activity in marketplaces generates massive datasets which require model scalability, the low occurrence of fraudulent events produces imbalanced datasets. Maintaining both a high precision and recall is often a challenge and many models provide significant misclassification errors <ref type="bibr" target="#b1">[2]</ref> which result in genuine customers being flagged as fraudulent. Finally, there is also the need for dynamic solutions given that fraudsters adapt their behaviors to a point where they are able to bypass the detection from machine learning models.</p><p>Literature suggests various examples of application of machine learning methods which aim at detecting fraud. Najem and Kadeem <ref type="bibr" target="#b15">[16]</ref> recent survey on fraud detection techniques in e-commerce, provides a broad view on the performance of the several models on various datasets. It highlights that Random Forest (RF) is the most used and usually the most accurate of all methods. Though Naive Bayes algorithms are easy to implement, they are limited compared to decision trees when it comes to modelling non linear problems. Such information were taken into consideration when selecting candidate models for our pipeline which consists essentially of decision trees ensembles (RF, Xgboost and Catboost). For instance, Kanei et al. <ref type="bibr" target="#b9">[10]</ref> trained a Random Forest model for detecting fraudulent ad requests. In their study, they demonstrated that the model robustness challenge could be addressed by means of features which could not be controlled by fraudsters such as the network statistics from clients and publishers. This set-up allowed them to improve their recall rate by 10%. Renjith <ref type="bibr" target="#b19">[20]</ref> described a pipeline using Support Vector Machine (SVM) to detect fraudulent sellers in an online marketplace. The authors specifically pointed out that a cold start problem may arise for new users when using predictive models with seller or transaction information as features. In our approach, the cold start effect was mitigated by removing these types of features. Gupta et al. <ref type="bibr" target="#b7">[8]</ref> benchmarked ensemble models for predicting the likelihood of a click on mobile phone advertisement to be fraudulent on a publicly available Kaggle dataset. They tested two configurations: traditional and Big Data. In the traditional configuration, they combined different sampling techniques (SMOTE, stratified sampling, etc) to reduce the data size and handle the imbalanced training set. This dataset which has been widely used in previous studies <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b21">22]</ref>, is employed in our study and results from Gupta et al. <ref type="bibr" target="#b7">[8]</ref> are used as our baseline. In our work, we applied the same preprocessing techniques and compared our results to their best model, Two Class Decision Forest<ref type="foot" target="#foot_3">5</ref> with an F1 score of 0.944. Using a sample of the same dataset, Minastireanu and Mesnita <ref type="bibr" target="#b13">[14]</ref>, trained a Lightgbm model to detect fraudulent clicks and reported an accuracy of 98%. The authors specifically described an example of how feature engineering on original features set (click time, device, channel, etc) and K fold cross validation are combined to enable high performance. Besides, by testing their model on a large data sample (18 millions users clicks), they proved the robustness of the boosting machine for the case study. In the same context, Mohammed et al. <ref type="bibr" target="#b14">[15]</ref> investigated the scalability of Random Forest, Balanced Bagging Ensemble and Gaussian Naive Bayes on massive and highly imbalanced credit card fraud datasets. They found that random undersampling is effective at handling imbalanced datasets, and combined with RF, it is suitable for real time applications on large datasets. In their study, the Random Forest model provided the highest recall of 91%. Rajora et al. <ref type="bibr" target="#b18">[19]</ref> benchmarked the performance of various machine learning algorithms on a credit card transaction dataset with 31 attributes. They used random undersampling technique to address the data imbalance and Principal Component Analysis (PCA) <ref type="bibr" target="#b0">[1]</ref> as dimensionality reduction technique. On top of PCA features, a time feature corresponding to the time delay from the first transaction is part of the training set. Furthermore, the authors illustrated how the inclusion of this feature can impact the performance. RF provided a better performance without the time feature while Gradient Boosting Regression Tree performance was constant. Meng et al. <ref type="bibr" target="#b12">[13]</ref> also used a real world credit card transactions dataset and combined Xgboost and sampling techniques to achieve great performance. SMOTE technique allowed an increase of the recall from 0.8062 to 0.9 and the AUC from 0.9795 to 0.9853. Mohammed et al. <ref type="bibr" target="#b14">[15]</ref> reported that Neural Networks tend to overfit on fraud datasets and struggle to handle imbalanced datasets. Nevertheless, as illustrated by Adewumi and Akinyelu <ref type="bibr" target="#b1">[2]</ref> in their survey, such techniques are also commonly used for credit card fraud detection. Najem and Kadeem <ref type="bibr" target="#b15">[16]</ref> pointed out that hybrid methods which combine several methods to build a robust learner provide better performance than individual learners. For example, Wang et al. <ref type="bibr" target="#b22">[23]</ref> built an hybrid mixed model consisting of Xgboost and Logistic regression (LR) and benchmarked it against common baseline models such as Xgboost, RF, SVM, Naive Bayes and Logistic Regression on the German Credit dataset published by UCI <ref type="foot" target="#foot_4">6</ref> . In the hybrid model, an effective feature combination was obtained by using Xgboost leaf nodes as features for the LR model. This set up, provided an AUC of 0.8321 which is far beyond the value of 0.7321 obtained with LR, the best individual model. Other studies such as <ref type="bibr" target="#b17">[18]</ref> and <ref type="bibr" target="#b20">[21]</ref> use meta learning techniques to enhance the performance on credit card fraud dataset. However, combining the output of different classifiers to build a model reduces the classification speed <ref type="bibr" target="#b1">[2]</ref> which might be an issue on big datasets.</p><p>3 Problem statement mobile.de supports two different types of sellers, namely dealers and private sellers. Dealers are those registered dealerships in Germany and neighbouring countries who are paying customers of mobile.de. These are professional sellers who make a living out of buying and selling vehicles. Private sellers are the regular common citizens who own a vehicle and use a classified market to sell it (not registered as a business). Internally, at mobile.de a private seller is labelled and named as FSBO (For Sale By Owner), and for the rest of this paper, we will address a private seller with the same terminology. Although there are several malicious activities which can be classified as fraud such as: account take over, falsification of documents, etc., our objective in this study is focused on a single type of users (FSBOs) that create fraudulent (fake) listings. Our pipeline overview is depicted in Figure <ref type="figure" target="#fig_0">1</ref>. When a listing is created (or updated) our machine learning models generate a fraud probability prediction and, in case the result is above a certain threshold, the listing is manually evaluated by a Customer Service (CS) agent, who reviews the content of the listing and assigns a rating (ground truth). In addition to listings flagged by our ML models, Customer Service agents extend their reviewing process to listings which might have received users' complaints. Eventually, one way or another, every fraudulent listing is flagged in our dataset, the vast majority happening before damage is done, and in very few cases, reports come from scam victims. The main classification task is binary in the sense that the target variable to predict has two possible outcomes OK or FRAUD. The goal is to detect when a vehicle listing is (or becomes) fraudulent. It can happen at the insertion time (version 1 of the listing) or at any time later due to a modification in the data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Datasets</head><p>In this study, we used two different datasets to train and test our machine learning models, mobile.de inhouse dataset and a tailored sample of TalkingData AdTracking Fraud Detection Challenge dataset obtained from the machine learning competition platform Kaggle.</p><p>At mobile.de FRAUD cases are less frequent (positive cases) than the OK cases leading to a highly imbalance dataset. The in-house dataset consists of 27 categorical variables and 10 continuous ones. To maintain the confidentiality of our data points, and to eliminate the risk of giving any clues that could lead to learnings on how to bypass our fraud detection models, we refrain from disclosing the exact names of the attributes and features.</p><p>The public dataset is taken from the China's largest independent big data service platform which covers 70% of active mobile devices in the country, handles 3 billion clicks per day out of which 90% are potentially fraudulent. Contrary to mobile.de case, here click fraud is the most frequent class (negative class) and occurs when a person or an automated bot acting as legitimate user clicks on an app ad without downloading the app afterwards. The raw dataset contains 200 millions clicks over a 4 day period. It includes 7 data fields (IP, app, device, OS, channel, click time, attributed time) and a binary target to predict (is attributed). The target variable is imbalanced with 99.8% of negative cases.</p><p>Tables <ref type="table" target="#tab_2">1 and 2</ref> summarize the preprocessing steps applied on mobile.de and TalkingData datasets respectively. For our in-house dataset, the testing set corresponds to samples recorded 7 days prior to the day the model was trained. The training set corresponds to 28 days of data prior to the start date of the testing set. The timely split was done to prevent the model from learning from future observations. In order to reduce the imbalance and increase the performance, we applied a random undersampling and kept 10 % of the majority class in the training set. This resulted in around 200,000 training samples and 240,000 testing ones. We kept raw missing entries within the sets, H2O and Catboost models handled them as separate categories 7,8 .</p><p>For the Kaggle dataset, we borrowed the preprocessing steps from <ref type="bibr" target="#b7">[8]</ref> and we engineered two additional features: click hour of the day and day of the week. First, we reduced the data size by randomly sampling 15% of unique IP addresses and retaining a stratified  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Training Machine Learning Models</head><p>In this section, we briefly summarize the theoretical concepts behind the models used in our study, provide an overview of the machine learning libraries in which the models were implemented and finally describe the hyper-parameter tuning steps and our performance metrics.</p><p>As stated in <ref type="bibr" target="#b3">[4]</ref>, Random Forest is an ensemble machine learning algorithm consisting of a collection of decision trees each built from random samples. In each tree, thresholds are applied to the input features to maximize information gain while minimizing an impurity function (for e.g. Cross Entropy, Mean Squared Error, etc). The final score is given by the average scores of all trees. Besides, RF provides maximum depth and minimum sample split parameters to prevent decision trees from overfitting on the training set.</p><p>Xgboost <ref type="bibr" target="#b5">[6]</ref> is another ensemble method which belongs to the large family of boosting algorithms. In general, boosting models combine shallow decision trees (also called weak learners), each built sequentially considering the errors on previous trees to reduce bias and variance at the same time. Xgboost particularly is an advanced implementation of gradient boosting which includes additional features such as parallel processing and regularization techniques for handling overfitting.</p><p>Introduced in <ref type="bibr" target="#b16">[17]</ref>, Catboost is a boosting model designed to handle and process categorical data efficiently. By default, Catboost implementation uses one hot encoding technique on categorical variables except for the ones with high cardinality. In such a case, ordered targeted statistics <ref type="bibr" target="#b16">[17]</ref> are used to maximize information gain. Contrary to other machine learning techniques which require preprocessing steps to convert categorical data into numbers, Catboost requires only the indices of the categorical features <ref type="bibr" target="#b6">[7]</ref>.</p><p>Meta learning technique aims at combining the output of several based learners to improve the prediction accuracy and utilize the strength of one learner to complement the weaknesses of others <ref type="bibr" target="#b17">[18]</ref>. In this study, we used H2O AutoML <ref type="bibr" target="#b10">[11]</ref> to build a stacked ensemble. AutoML brings out a simple wrapper function optimized for training and combining a large number of models in a short amount of time. This module evaluates single machine learning models (GBM<ref type="foot" target="#foot_5">9</ref> , Xgboost, RF, Extremely Randomized Trees<ref type="foot" target="#foot_6">10</ref> , Artificial Neural networks <ref type="foot" target="#foot_7">11</ref> and Generalised Linear Models <ref type="foot" target="#foot_8">12</ref> ) and their stacked ensembles on validation sets using relevant metrics (for e.g. AUC, logloss, etc). The best performing model is then retained for deployment.</p><p>H2O is an open source distributed library software for machine learning and deep learning applications. Its attributes: frame and clusters allow to easily process tabular data of various types in a distributed fashion. H2O platform supports various interface including R, Python and Java making it easier to complete analytic workflows <ref type="bibr" target="#b2">[3]</ref>. In our case, we used H2O Python interface to train and optimize Distributed Random Forest (DRF), Xgboost and AutoML models. The models trained are saved as MOJO (Model Object Optimized) formats which are later embedded in JAVA environment for real time predictions.</p><p>The Catboost library is another high performance open source framework for gradient boosting on decision trees. Similar to H2O, Catboost library supports </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Hyperparameters tuning</head><p>The parameter optimization described in this section is limited to our in-house dataset. In fact, because of TalkingData large sample size (1,706,481 entries) carrying out an extensive hyper parameters tuning is daunting. Therefore, for this dataset, we applied a full parameter optimization only for the Catboost model and kept similar parameters for their H2O counterparts. For H2O, 3, 5 and 10 folds Cross Validation (CV) have provided the best performance for RF, AutoML and Xgboost respectively. These models hyperparameters are depicted in Table <ref type="table" target="#tab_3">3</ref>. However, on the public dataset, we set the maximum number of models to 10 and the number of folds to 3 to circumvent memory limitations for AutoML.</p><p>For Catboost, Python library Hyperopt<ref type="foot" target="#foot_9">13</ref> allowed hyperparameters optimization. Hyperpot provides custom functions for hyperparameter search. Each parameter value is retrieved from a list of candidates taken from a specific "quantized" continuous distribu-  <ref type="table" target="#tab_4">4</ref>). Besides, models are trained for 500 iterations, using 3 folds CV, the logarithm loss function and Area Under the Receiver Operating Characteristic Curve (AUC) evaluation metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Performance metrics</head><p>In an imbalanced classification task, the positive class denotes the less frequent value of the target and the negative class is its complement. When scoring a model, an optimal solution can be derived from the confusion matrix <ref type="bibr" target="#b8">[9]</ref>. True positive (TP) and True negative values (TN) occur when the output of the model matches with the ground truth label on positive and negative classes respectively. Conversely, False Positive (FP) and False Negative (FN) occur when the model provides predictions which mismatch with the true labels. To convert model probabilities into classes, we chose a threshold in order to maximize the F1 score on the testing set accordingly. F1 score is the harmonic mean between the precision and recall and evaluates the accuracy of the model at predicting the positive class. Another popular evaluation metric is the Area Under the Receiver Operating Characteristic Curve. Contrary, to the previous metrics, it is used to assess the ability of a classifier to distinguish between classes independently of any selected threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head><p>In order to retain candidate models for our evaluation, we first benchmarked a large pool of machine learning models. For this purpose, H2O AutoML objects provide leaderboard() method which allows to rank the models trained to build the stacked ensemble on chosen dataset and metric. These models are optimised with AutoML predefined random grid parameter searches which are different from our production hyper-parameters tuning described in the previous section. Table <ref type="table" target="#tab_5">5</ref> summarizes the AUC obtained on our in-  <ref type="table" target="#tab_9">6 and 7</ref> illustrate performance metrics obtained from the different models on mobile.de and TalkingData datasets respectively. On the first one, AutoML best model (stacked ensemble) yields an F1 score of 0.73 which is higher than the one of 0.71 obtained with Xgboost and Catboost and of 0.68 with Random Forest. It has been reported in <ref type="bibr" target="#b10">[11]</ref> that stacked ensemble models usually produce better performance than individual models (Xgboost, Random Forest, etc) used in an AutoML run in accordance with our findings. On Talking Dataset, Catboost model yields the best performance with an F1 score of 0.988. Catboost model is designed to process heterogeneous data with categorical variables efficiently <ref type="bibr" target="#b16">[17]</ref>. The features cardinality is highlighted in Table <ref type="table" target="#tab_8">8</ref>. One hot encoding on one side and ordered targeted statistic applied on variables of high cardinality have a significant impact on the model performance. Catboost also provides get feature importance() method which gives the contribution of each feature to the ensemble model. The output of this method is summarized in Figure <ref type="figure" target="#fig_1">2</ref>, the app id for marketing and the IP address of click are the most important features.</p><p>In order to assess the generalizability of our modelling approach at detecting fraud, we compared our models with the work of Gupta et al. <ref type="bibr" target="#b7">[8]</ref>. Their best model, Two Class Decision Forest classifier provides a precision of 0.992 and a recall of 0.902 corresponding to an F1 score of 0.9442. All the models used in our experiment outperform their results in terms of </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions</head><p>We presented a case study which described the application of ensemble methods to detect fraud in a large scale online marketplace (mobile.de). The business value of such an investigation is twofold. First, to enable a trustworthy customers' experience and enhance customers' satisfaction. Second, to reduce Customer Service operational cost in order to resolve fraudulent cases.</p><p>To achieve our goals, we designed a Machine Learning pipeline based on sellers' listings data and optimized a way to address common challenges in fighting fraud (fraudsters adaptability, dataset imbalance, high false positive rate, etc). The main contribution of this study is that it proposes a pipeline using open source data science libraries to collect, process and score sellers listings to efficiently detect fraud. Our best model AutoML has provided an F1 score of 0.73 outperforming Catboost, Xgboost and Random Forest. These models were later tested on a TalkingData public dataset from Kaggle competition platform and yielded great robustness at detecting fraud and outper- formed previously proposed models. The best model on this set, Catboost provides an F1 score of 0.9888 which is significantly higher than the value of 0.9442 reported in <ref type="bibr" target="#b7">[8]</ref>.</p><p>With regard to the prospects of the study, we will first explore dimensionality reduction techniques <ref type="bibr" target="#b18">[19]</ref> and encoding methods in order to improve the performance of the classifiers. Second, we will leverage the power of Big Data tools (for e.g Spark) to train and optimize the models on larger samples of data. In addition to that, we aim at investigating different meta learning techniques combining Catboost and H2O models to build robust classifiers and further prevent fraud in our website.</p><p>Furthermore, in our future work we will tackle the problem of detecting fraud "as soon as possible". It is crucial that fraudulent listings are detected before it reaches the audience. To this end we plan to include further features such as buyers' and sellers' user activity. Finally, we would like to highlight that the work present in this paper is currently in production, protecting buyers and sellers at mobile.de, and due to that we refrain from disclosing more technical details that could help malicious users to bypass our detection system.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: mobile.de in house data collection and pipeline overview.</figDesc><graphic coords="3,320.41,569.56,226.77,96.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Catboost model feature importance (Talk-ingData dataset).</figDesc><graphic coords="7,320.41,54.07,226.77,101.97" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>In-house dataset preprocessing steps.</figDesc><table><row><cell>non overlapping</cell><cell>-test (latest week)</cell></row><row><cell>time based split</cell><cell>-train (28 days)</cell></row><row><cell></cell><cell>random undersampling of</cell></row><row><cell>undersampling</cell><cell>the training set, 10% nega-</cell></row><row><cell></cell><cell>tive cases kept</cell></row><row><cell>missing values</cell><cell>kept and processed by ma-chine learning models</cell></row><row><cell>feature engineer-ing</cell><cell>yes (confidential)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>TalkingData dataset preprocessing steps.</figDesc><table><row><cell></cell><cell>15% random sample of</cell></row><row><cell>subsampling</cell><cell>unique IPs then 8% strat-ified sample from the</cell></row><row><cell></cell><cell>remaining set</cell></row><row><cell></cell><cell>SMOTE with k=5 neigh-</cell></row><row><cell>oversampling</cell><cell>bours, positive class up to</cell></row><row><cell></cell><cell>11%</cell></row><row><cell>missing values</cell><cell>absent</cell></row><row><cell>stratified split</cell><cell>-test (30%)</cell></row><row><cell></cell><cell>-train (70 %)</cell></row><row><cell>feature engineer-</cell><cell>-click hour&amp;day of the week</cell></row><row><cell>ing</cell><cell>-attributed time is removed</cell></row><row><cell cols="2">sample of 8% of the remaining set. To handle the</cell></row><row><cell cols="2">imbalance, we applied Synthetic Minority Over Sam-</cell></row><row><cell cols="2">pling Technique (SMOTE) [5] with 5 neighbours and</cell></row><row><cell cols="2">oversampled the positive class up to 11%. We then</cell></row><row><cell cols="2">applied a stratified split, keeping 70% of the set for</cell></row><row><cell cols="2">training. The final set has 1,706,481 training samples</cell></row><row><cell cols="2">and 731,349 testing ones without any missing values.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc></figDesc><table><row><cell cols="4">H2O models hyperparameters (in-house</cell></row><row><cell>dataset).</cell><cell></cell><cell></cell><cell></cell></row><row><cell>parameter</cell><cell>RF</cell><cell>Xgb</cell><cell>AutoML</cell></row><row><cell>maximum number of models</cell><cell>-</cell><cell>-</cell><cell>20</cell></row><row><cell>number of trees</cell><cell>100</cell><cell>1000</cell><cell>-</cell></row><row><cell>maximum depth</cell><cell>50</cell><cell>35</cell><cell>-</cell></row><row><cell>number of columns for a DT split</cell><cell>9</cell><cell>-</cell><cell>-</cell></row><row><cell>columns sample rate</cell><cell>-</cell><cell>0.8</cell><cell>-</cell></row><row><cell>sample rate</cell><cell>-</cell><cell>0.8</cell><cell>-</cell></row><row><cell>learning rate</cell><cell>-</cell><cell cols="2">0.009 -</cell></row><row><cell>early stopping metric</cell><cell cols="3">logloss logloss logloss</cell></row><row><cell>early stopping rounds</cell><cell>-</cell><cell>25</cell><cell>3</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4 :</head><label>4</label><figDesc>Catboost hyperparameters and Hyperopt "quantized" continuous distributions minimun and maximum values used for optimisation.</figDesc><table><row><cell>Parameter</cell><cell>Hyperopt function</cell><cell>min</cell><cell>max</cell></row><row><cell>l2 leaf reg</cell><cell>qloguniform</cell><cell>0</cell><cell>2</cell></row><row><cell>learning rate</cell><cell>qloguniform</cell><cell cols="2">0.001 0.5</cell></row><row><cell>subsample</cell><cell>quniform</cell><cell>0.5</cell><cell>1</cell></row><row><cell cols="2">colsample bylevel quniform</cell><cell>0.5</cell><cell>1</cell></row><row><cell cols="4">Python, R and JAVA interfaces. For this study, we</cell></row><row><cell cols="4">combined Catboost's Python and JAVA interfaces for</cell></row><row><cell cols="2">model training and deployment.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5 :</head><label>5</label><figDesc>Area Under the Receiver Operating Characteristic Curve of the best single learner of each model family derived from H2O AutoML leaderboard() method (in-house dataset).</figDesc><table><row><cell>Metric</cell><cell>AUC</cell></row><row><cell>Stacked Ensemble (all models)</cell><cell>0.9850</cell></row><row><cell cols="2">Stacked Ensemble (best of each family) 0.9848</cell></row><row><cell>Gradient Boosting Machine</cell><cell>0.9826</cell></row><row><cell>Extreme Gradient Booosting</cell><cell>0.9821</cell></row><row><cell>Random Forest</cell><cell>0.9790</cell></row><row><cell>Extremely Randomized Trees</cell><cell>0.9719</cell></row><row><cell>Generalized Linear Model</cell><cell>0.9690</cell></row><row><cell>Articifcial Neural Network</cell><cell>0.9200</cell></row><row><cell cols="2">tion such as qloguniform and quniform (see Table</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 6 :</head><label>6</label><figDesc>Machine learning models performance summary (in-house dataset).</figDesc><table><row><cell>Model</cell><cell>F1</cell><cell cols="2">Precision Recall</cell><cell>AUC</cell></row><row><cell cols="2">AutoML 0.7293</cell><cell>0.7206</cell><cell>0.7833</cell><cell>0.9850</cell></row><row><cell>Xgb</cell><cell>0.7134</cell><cell>0.7104</cell><cell>0.7165</cell><cell>0.9794</cell></row><row><cell cols="2">Catboost 0.7127</cell><cell>0.7375</cell><cell>0.6895</cell><cell>0.9809</cell></row><row><cell>RF</cell><cell>0.6810</cell><cell>0.7274</cell><cell>0.6401</cell><cell>0.9786</cell></row><row><cell cols="5">house test dataset but limited to the best algorithms of</cell></row><row><cell cols="5">each family (GBM, Xgboost, RF, Extremely Random-</cell></row><row><cell cols="5">ized Trees, Artificial Neural networks and Generalised</cell></row><row><cell cols="5">Linear Models). Tree based models outperform Artifi-</cell></row><row><cell cols="5">cial Neural Networks and Generalised Linear Models.</cell></row><row><cell cols="5">They suit well to complex non linear problems [16].</cell></row><row><cell cols="5">Especially, GBM and Xgboost yield the best AUC of</cell></row><row><cell cols="5">0.982 followed by Random Forest of 0.9790 AUC. Be-</cell></row><row><cell cols="5">sides, Najem and Kadeem [16] survey on fraud detec-</cell></row><row><cell cols="5">tion techniques in e-commerce demonstrated that RF</cell></row><row><cell cols="5">has the highest frequency usage and is the best per-</cell></row><row><cell cols="5">forming one across various use cases. Based on these</cell></row><row><cell cols="5">observations, we initially retained AutoML, Xgboot</cell></row><row><cell cols="5">and RF for our benchmark. Catboost model, which</cell></row><row><cell cols="5">is not part of H2O was benchmarked separately and</cell></row><row><cell cols="3">added later for the comparison.</cell><cell></cell><cell></cell></row><row><cell>Tables</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 7 :</head><label>7</label><figDesc>Machine learning models performance summary (TalkingData dataset).</figDesc><table><row><cell>Model</cell><cell>F1</cell><cell cols="2">Precision Recall</cell><cell>AUC</cell></row><row><cell cols="2">Catboost 0.9888</cell><cell>0.9902</cell><cell>0.9873</cell><cell>0.9994</cell></row><row><cell cols="2">AutoML 0.9800</cell><cell>0.9848</cell><cell>0.9752</cell><cell>0.9987</cell></row><row><cell>Xgb</cell><cell>0.9787</cell><cell>0.9804</cell><cell>0.9771</cell><cell>0.9982</cell></row><row><cell>RF</cell><cell>0.9780</cell><cell>0.9801</cell><cell>0.9758</cell><cell>0.9985</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 8 :</head><label>8</label><figDesc>Count of distinct values per columns in Talking data training set.</figDesc><table><row><cell>feature</cell><cell>count of unique values</cell></row><row><cell>IP</cell><cell>123099</cell></row><row><cell>device</cell><cell>1450</cell></row><row><cell>OS</cell><cell>558</cell></row><row><cell>channel</cell><cell>496</cell></row><row><cell>app</cell><cell>383</cell></row><row><cell>hour</cell><cell>24</cell></row><row><cell>dayofweek</cell><cell>4</cell></row><row><cell>F1 (see</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 7 )</head><label>7</label><figDesc>. Especially, our best model Catboost demonstrates a comparable precision and a better recall. Relying on F1 score alone to compare our models would be problematic since in the TalkingData's context the positive class correponds to the non fraudulent clicks. In the TalkingData adTracking Fraud Detection Challenge, Kaggle competitors' machine learning models were evaluated based on AUC. Using such a metric, our Catboost model yields an AUC of 0.9994 compared to 0.997 from Gupta et al.<ref type="bibr" target="#b7">[8]</ref>.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://www.catboost.ai/ (accessed on July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://www.h2o.ai/ (accessed on 16 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://www.kaggle.com/c/talkingdata-adtracking-frauddetection/data (downloaded on 16 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://docs.microsoft.com/en-us/azure/machinelearning/algorithm-module-reference/two-class-decision-forest (accessed on July 2021)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">https://archive.ics.uci.edu/ml/index.php (accessed on July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_5">https://docs.h2o.ai/h2o/latest-stable/h2o-docs/datascience/gbm.html (accessed on 16 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_6"><ref type="bibr" target="#b9">10</ref> https://docs.h2o.ai/h2o/latest-stable/h2o-docs/datascience/drf.html#extremely-randomized-trees (accessed on 16 July</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_7">2021).11 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/datascience/deep-learning.html (accessed on 16</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_8">July 2021). 12 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/datascience/glm.html (accessed on 16 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_9">https://github.com/hyperopt/hyperopt (accessed on 16 July 2021).</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Acknowledgements</head><p>We would like to thank the Customer Service team at mobile.de for their countless hours of manual work in detecting fraud, and for providing us the ground truth to start our work. We would also like to thank members of TnS and Data teams at mobile.de who have directly and indirectly been involved in this work, with special thanks to Moritz Aschoff and Matthias Radtke.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Principal component analysis</title>
		<author>
			<persName><forename type="first">H</forename><surname>Abdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wiley interdisciplinary reviews: computational statistics</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="433" to="459" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of machine-learning and nature-inspired based credit card fraud detection techniques</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">O</forename><surname>Adewumi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Akinyelu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of System Assurance Engineering and Management</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="937" to="953" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Machine learning with python and h2o</title>
		<author>
			<persName><forename type="first">S</forename><surname>Aiello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Click</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Roark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rehak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Stetsenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lanford</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="volume">20</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Random forests. Machine learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="5" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Smote: synthetic minority over-sampling technique</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">V</forename><surname>Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Bowyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">O</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Kegelmeyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of artificial intelligence research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="321" to="357" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Xgboost: A scalable tree boosting system</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</title>
				<meeting>the 22nd acm sigkdd international conference on knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Performance analysis of different types of machine learning classifiers for nontechnical loss detection</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Ghori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Abbasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Awais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Imran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ullah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Szathmary</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="16033" to="16048" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Predicting fraud of ad click using traditional and spark ml</title>
		<author>
			<persName><forename type="first">N</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Boldina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Woo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KSII The 14th Asia Pacific International Conference on Information Science and Technology (APIC-IST)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="24" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A review on evaluation metrics for data classification evaluations</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hossin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sulaiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Data Mining &amp; Knowledge Management Process</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page">1</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Detecting and understanding online advertising fraud in the wild</title>
		<author>
			<persName><forename type="first">F</forename><surname>Kanei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chiba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yoshioka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Matsumoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Akiyama</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEICE Transactions on Information and Systems</title>
		<imprint>
			<biblScope unit="volume">103</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="1512" to="1523" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">H2O AutoML: Scalable automatic machine learning</title>
		<author>
			<persName><forename type="first">E</forename><surname>Ledell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Poirier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th ICML Workshop on Automated Machine Learning (AutoML)</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Trust and distrust in e-commerce</title>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ahn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ahn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sustainability</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page">1015</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A case study in credit fraud detection with smote and xgboost</title>
		<author>
			<persName><forename type="first">C</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Physics: Conference Series</title>
		<imprint>
			<biblScope unit="volume">1601</biblScope>
			<biblScope unit="page">52016</biblScope>
			<date type="published" when="2020">2020</date>
			<publisher>IOP Publishing</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Light gbm machine learning algorithm to online click fraud detection</title>
		<author>
			<persName><forename type="first">E.-A</forename><surname>Minastireanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mesnita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Inform. Assur. Cybersecur</title>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Scalable machine learning techniques for highly imbalanced credit card fraud detection: a comparative study</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Mohammed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K.-W</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Shiratuddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Pacific Rim International Conference on Artificial Intelligence</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="237" to="246" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">A survey on fraud detection techniques in e-commerce</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Najem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Kadeem</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Catboost: unbiased boosting with categorical features</title>
		<author>
			<persName><forename type="first">L</forename><surname>Prokhorenkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gusev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vorobev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V</forename><surname>Dorogush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gulin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="6638" to="6648" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Improving credit card fraud detection using a meta-classification strategy</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lawryshyn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Applications</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="issue">10</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A comparative study of machine learning techniques for credit card fraud detection based on time variance</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rajora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D.-L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bharill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">P</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Puthal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Prasad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Symposium Series on Computational Intelligence (SSCI)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="1958" to="1963" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Detection of fraudulent sellers in online marketplaces using support vector machine approach</title>
		<author>
			<persName><forename type="first">S</forename><surname>Renjith</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1805.00464</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Meta classification technique for improving credit card fraud detection</title>
		<author>
			<persName><forename type="first">S</forename><surname>Suganya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kamalra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Scientific and Technical Advancements</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="101" to="105" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A hybrid and effective learning approach for click fraud detection</title>
		<author>
			<persName><forename type="first">G</forename><surname>Thejas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dheeshjith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Iyengar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sunitha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Badrinath</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning with Applications</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">100016</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Credit fraud risk detection based on xgboost-lr hybrid model</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ji</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Int. Conf. Electron. Bus</title>
				<meeting>Int. Conf. Electron. Bus</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="336" to="343" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">A model based on convolutional neural network for online transaction fraud detection</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<publisher>Security and Communication Networks</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
