Botcha: Detecting Malicious Non-Human Traffic in the Wild Sunny Dhamnania , Ritwik Sinhab , Vishwa Vinaya , Lilly Kumarid and Margarita Savovae a Adobe Research, India b Adobe Research, United States d University of Washington, Seattle, United States e Adobe Systems, United States Abstract Malicious bots make up about a quarter of all traffic on the web and degrade the performance of per- sonalization and recommendation algorithms that operate on e-commerce websites. Positive-Unlabeled learning (PU learning) provides the ability to train a binary classifier using only positive (P) and unlabeled (U) instances. The unlabeled data comprises both positive and negative classes. It is possible to find labels for strict subsets of non-malicious actors, e.g., the assumption that only humans purchase during web sessions, or clear CAPTCHAs. However, finding signals of malicious behavior is almost impossible due to the ever-evolving and adversarial nature of bots. Such a set-up naturally lends itself to PU learning. Unfortunately, standard PU learning approaches assume that the labeled set of positives are a random sample of all positives, this is unlikely to hold in practice. In this work, we propose two modifications to PU learning that make it more robust to violations of the selected-completely-at-random assumption, leading to a system that can filter out malicious bots. In one public and one proprietary dataset, we show that proposed approaches are better at identifying humans in web data than standard PU learning methods. Keywords Positive unlabeled learning, biased sampling, non-human agents, unlabeled data, malicious bot, web traffic 1. Introduction Non-Human Traffic or traffic generated by robots (or bots) is estimated to constitute close to half of all web traffic [1]. Some bots have a legitimate purpose (e.g. web crawlers) while others try to intrude the systems with malicious intent. It is estimated that half of all bot traffic has a malicious intent [1]. Good bots identify themselves but malicious bots have an incentive to spoof their user agents and behave like humans. Malicious bots may be designed to generate fake reviews, scrape price or content, crack credentials, infiltrate payment systems, defraud advertisers, or spam online forums. Recommendation and personalization systems are particularly vulnerable to bot activity [2]. OHARS’20: Workshop on Online Misinformation- and Harm-Aware Recommender Systems, September 25, 2020, Virtual Event email: dhamnani.sunny@gmail.com (S. Dhamnani); risinha@adobe.com (R. Sinha); vinay@adobe.com (V. Vinay); lkumari@uw.edu (L. Kumari); mgalabov@adobe.com (M. Savova) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 51 Malicious Bots Humans Figure 1: Problem set-up: circles denote humans and crosses denote bots. Only the blue circles are known to be humans, the true label for all grey instances is unknown. The fact that larger circles are more likely to be blue denotes that the observation’s attributes determine their likelihood of being selected (labeled). Under SCAR (Selected Completely at Random) violation, the classification goal is to identify the dashed dividing line between the two classes. The major challenge in building machine learning (ML) models to detect bad bots is getting labeled data. In this context, ML methods that aim to learn from positive and unlabeled data (PU learning) provide promise [3]. PU learning learns from data where only a subset of one class is labeled. We explore an application of PU learning to malicious non-human traffic detection on the web. Considering humans as the positive class, we can identify positive instances by assuming that only humans purchase on e-commerce websites, clear CAPTCHAs, or visit from validated IP addresses. Current PU learning frameworks assume that the labeled subset of the positive class is Selected Completely at Random (SCAR) from the positive class, where the labeling mechanism does not depend on the attributes of the instance [3]. That is, the labeled subset of humans is not influenced by the features of the observations. Unfortunately, such an assumption is hard to justify in practice. For example, it is reasonable to expect that not all human visitors to an e-commerce website are equally likely to make a purchase. This requires us to revisit the PU framework to handle problems where the random sampling assumption is violated. Figure 1 describes the problem we are addressing. In this work, we address the question of classifying a web session as originating from a human surfer or a robot, using PU learning. Our contribution includes two novel models to handle biased sampling within the positive class, one of which is a scalable version of the proposals in [4]. In our experiments, positive-unlabeled scenarios are artificially created in a publicly available intrusion detection dataset. We notice that the proposed approaches perform better than existing PU learning models [3, 5]. In a proprietary e-commerce dataset, our methods work well in distinguishing humans from bots. We call our framework “Botcha”. Given the limited need for labeled data, it is readily applied in the wild. Filtering out all bot traffic allows recommendation and personalization systems to learn from unbiased data from real human activity. 52 2. Related Work Malicious non-human activity on the web has been observed in the context of fake reviews, information theft, the spread of misinformation, spam on social networks, and click fraud in advertising [6, 7]. Given the diverse and often adversarial nature of web fraud, it is imperative to find new strategies to detect bots. In such dynamic circumstances, data-driven strategies hold promise. While there has been some work to build recommendation systems that are robust to adver- sarial attacks [2, 8], in this work we aim to filter out all bot traffic to provide unbiased data to the recommendation and personalization systems to learn from. To classify a visitor as a bot or hu- man, the standard machine learning strategy requires representative instances from both classes and building a supervised learning model that can differentiate between them. Due to limited labeled data for bot detection, alternative data-efficient strategies have also been investigated. Semi-supervised learning has been applied to the bot detection problem [9]. Unfortunately, while it is reasonable to expect that we have a reliable subset of known humans, bots on the web are adversarial, ever-evolving and hard to sample from. This renders semi-supervised learning limited in scope. PU Learning requires only a subset of one of the two classes to be labeled. Hence, PU learning is appealing in the bot detection problem where we can assume that a subset of humans is labeled. Early work in [10] and [3] has shown how PU learning can achieve the effectiveness of standard supervised learning. We believe that the PU learning framework is natural for use in a variety of fraud detection applications on the web. Empirical success in a variety of scenarios has led to a recent focus on the class of PU learning algorithms [11]. Unfortunately, most prior work in this area assumes that the labeled points are randomly sampled from the positive class. This assumption is referred to as Selected Completely at Random [3, 4]. That is to say, the positively labeled instances in the dataset are a random unbiased sample of the universe of positive instances, and are not a function of the attributes of the data point. To allow the building of PU learning-based models in scenarios where this is an unrealistic assumption, we build on prior work by Bekker et al. (2019) [4]. However, it has two primary challenges. First, the model strategy presented in [4] requires the analyst to decide on a set of features to compute the propensity score. Second, the proposal requires optimization using an Expectation-Maximization (EM) algorithm. Unfortunately, the EM Algorithm is known to be slow to converge [12]. Given that we would like to apply this to a scenario with tens of millions of data points and hundreds of features, this presents certain challenges in the direct application of this to our work. To test our proposed algorithms, we first conduct a series of simulation experiments on standard supervised learning datasets representing different fraud-like setups. We artificially hide the true labels which we then hope to recover via the learning algorithm, thereby showing the viability of our methods. 3. Models We first describe the notations and then briefly review PU learning work in [3] (Section 3.2). In section 3.3 and 3.4, we describe the proposed approaches, which are the main contributions of 53 the paper. 3.1. Notation & Prerequisites To distinguish humans from bots we need to learn a classifier that generates the probabilities 𝑝(𝑦|𝑥). Here 𝑦 ∈ {0, 1} denotes if the observation was generated by a human (𝑦 = 1) or bot and 𝑥 is the corresponding feature vector. The dataset for PU learning are instances (𝑥, 𝑦, 𝑠) from a space 𝒳 × 𝒴 × 𝒮, where 𝒳 and 𝒴 denote the feature and label space respectively. The binary variable 𝑠 represents if the instance is labeled. Since only positive instances (humans) are labeled, 𝑝(𝑦 = 1|𝑠 = 1) = 1. Marginalizing 𝑝(𝑠 = 1|𝑥) over 𝑦, we get: 𝑝(𝑠 = 1|𝑥) = 𝑝(𝑠 = 1|𝑦 = 1, 𝑥) × 𝑝(𝑦 = 1|𝑥) + 𝑝(𝑠 = 1|𝑦 = 0, 𝑥) × 𝑝(𝑦 = 0|𝑥) Now, 𝑝(𝑠 = 1|𝑦 = 0, 𝑥) = 0 since only the positive instances are labeled. This leads to 𝑝(𝑠 = 1|𝑥) 𝑝(𝑦 = 1|𝑥) = (1) 𝑝(𝑠 = 1|𝑦 = 1, 𝑥) Equation (1) forms the basis of all models which we describe next. 3.2. Vanilla Model (EAM) The work by Elkan and Noto is based on the SCAR assumption. The approach assumes that the labeled positive instances were chosen uniformly at random from the universe of positive instances [3]. Formally this means 𝑝(𝑠 = 1|𝑦 = 1, 𝑥) = 𝑝(𝑠 = 1|𝑦 = 1), i.e., the sampling process is independent of 𝑥. We can rewrite equation (1) as 𝑝(𝑠 = 1|𝑥) 1 𝑝(𝑦 = 1|𝑥) = , with 𝑐 = ∑ 𝑃(𝑠 = 1|𝑥). (2) 𝑐 𝑛 ⟨𝑥,𝑦=1⟩ The constant 𝑐 represents the fraction of labeled positive points and 𝑛 is the size of the labeled set. Note that the numerator can be obtained by training a classifier that separates the labeled (𝑠 = 1) points from the unlabeled (𝑠 = 0). Similarly, 𝑐 can be estimated using this trained classifier and a validation set. Averaging predicted scores of known positives in the validation set gives an estimate for 𝑐. We refer to this model as Elkan’s Assumption Model (EAM) in our experiments and it forms the baseline for our methods. For a detailed discussion, we refer the readers to [3]. 3.3. Modified Assumption Model (MAM) The SCAR assumption above enables the building of PU Learning models for a range of scenarios. However, we believe that this is an unrealistic assumption and argue that explicitly accounting for selection bias for the known positives allows us to build models that are more aligned to the data. We propose Modified Assumption Model (MAM), geared towards practical cases where labeling is performed via a stratified procedure. Instead of using the SCAR assumption, we make a more lenient assumption that known positives come from two sub-groups, where for one the sampling depends on 𝑥 and the other is independent of 𝑥. 54 We introduce a new binary variable 𝑏 ∈ {0, 1} that indicates which of the two sub-groups a given labeled instance (𝑠 = 1) comes from. So, 𝑏 = 0 indicates that value of 𝑠 is independent of 𝑥, whereas 𝑏 = 1 implies that value of 𝑠 is dependent on 𝑥. Marginalizing over b, we get: 𝑝(𝑠 = 1|𝑦 = 1, 𝑥) = 𝑝(𝑠 = 1|𝑦 = 1, 𝑏 = 1, 𝑥) × 𝑝(𝑏 = 1|𝑦 = 1, 𝑥) + 𝑝(𝑠 = 1|𝑦 = 1, 𝑏 = 0, 𝑥) × 𝑝(𝑏 = 0|𝑦 = 1, 𝑥) Since 𝑠 is independent of 𝑥 when 𝑏 = 0, so 𝑝(𝑠 = 1|𝑦 = 1, 𝑏 = 0, 𝑥) = 𝑐 and given that 𝑝(𝑏 = 0|𝑦 = 1, 𝑥) = 1 − 𝑝(𝑏 = 1|𝑦 = 1, 𝑥), we can re-write above equation as 𝑝(𝑠 = 1|𝑥) 1 𝑝(𝑦 = 1|𝑥) = , with 𝑐 = ∑ 𝑃(𝑠 = 1|𝑥) (3) 𝑐 + 𝑝(𝑏 = 1|𝑦 = 1, 𝑥) × (1−𝑐) 𝑛 ⟨x,𝑦=1,𝑏=0⟩ Similar to EAM the numerator can be obtained by training a classifier that separates the labeled (𝑠 = 1) points from the unlabeled (𝑠 = 0). The denominator model can be trained using 𝑏 = 1 and 𝑏 = 0 sets (note that points in these sets are labeled and positive, i.e., 𝑠 = 1 and 𝑦 = 1). The constant 𝑐 can be estimated by averaging the scores predicted by the numerator model for instances with 𝑏 = 0 in the validation set. If 𝑝(𝑏 = 1|𝑦 = 1, 𝑥) = 0 for all data points, i.e., sampling is independent of 𝑥, we recover EAM from MAM. Our MAM proposal closely relates to the proposals made in [4], however, the algorithm in [4] does not scale to large scale datasets. 3.4. Relaxed Assumption Model (RAM) The most general model, referred as Relaxed Assumption Model (RAM), does not make any assumption about 𝑠 being independent of 𝑥. Instead, we attempt to model this process explicitly, i.e., we build a model for 𝑝(𝑠 = 1|𝑦 = 1, 𝑥) - the denominator in equation (1). We first acquire a set of positive unlabeled instances of 𝑦 = 1 with 𝑠 = 0 and then utilize standard binary classification methods to distinguish 𝑠 = 0 from 𝑠 = 1 amongst the positive instances. We propose the use of a nearest-neighbor based method that finds points in the dataset that are close to the known positives but are not in the sampled set (𝑠 = 1). Since any point outside the sampled set is 𝑠 = 0, the nearest neighbor to a (𝑦 = 1, 𝑠 = 1) point not in this set is implicitly taken to be (𝑦 = 1, 𝑠 = 0). Note that this assumption may not always be true. As in the other models, we aim to find techniques that are robust even when the modeling assumption may be wrong. It is important to note that we do not alter the numerator in equation (1) and hence training classifier for numerator remains identical to EAM. 4. Experiments and Results 4.1. Simulated Experiments on a Public Dataset In our first set of experiments, we artificially create PU learning datasets by hiding the ground truth labels of a labeled dataset during training. We then evaluate the trained model on a labeled test set. The simulations primarily involve controlling the subset of positive data points that are labeled for training, and all the other instances are unlabeled. The simulated datasets having varying degrees of “randomness”, one of the extremes is a completely random subset of positive 55 Table 1 Test-set performance on public dataset. The best algorithm in each column is colored blue and second best is light blue. RAM and MAM perform significantly better when SCAR assumption is violated (low randomness). EAM only provides a marginal improvement over RAM when the known positives are a random subset from positive class. Method Mixing 𝑚 = 0 Mixing 𝑚 = 30 Mixing 𝑚 = 70 Mixing 𝑚 = 100 AUC Pr@Recall99 AUC Pr@Recall99 AUC Pr@Recall99 AUC Pr@Recall99 Biased SVM [5] 0.705 0.524 0.705 0.576 0.688 0.535 0.689 0.560 EAM [3] 0.757 0.719 0.760 0.751 0.776 0.751 0.792 0.697 topper 𝑡 = 0.90 MAM (proposed) 0.811 0.724 0.761 0.736 0.778 0.737 0.701 0.636 RAM (proposed) 0.897 0.724 0.837 0.756 0.770 0.743 0.765 0.669 Biased SVM [5] 0.624 0.517 0.691 0.512 0.666 0.519 0.669 0.513 EAM [3] 0.761 0.730 0.761 0.751 0.774 0.747 0.791 0.701 topper 𝑡 = 0.925 MAM (proposed) 0.831 0.737 0.792 0.752 0.743 0.717 0.721 0.682 RAM (proposed) 0.906 0.773 0.812 0.767 0.764 0.745 0.748 0.700 samples (satisfying SCAR perfectly). The other is extreme is a carefully crafted subset of positive samples where SCAR assumption is violated. Public Dataset: We use the KDDCUP’99 dataset (NSL-KDD Dataset), a widely adopted labeled dataset for network intrusion detection. The train and test datasets have a total of 148, 517 records with 43 features each. To get around known problems with the dataset [13], we merge the given train and test records, which we then re-split into the train, validation and test sets in an 80:10:10 proportion. Overall, the dataset contains 71, 463 intrusive sessions (all intrusions are bot-generated) while the rest are legitimate sessions. Data Simulations: The process of creating artificial datasets involves hiding the labels for all negative points and a proportion of the positive points. We sample a labeled subset of positive data points to create a known subset of positives. We first build a supervised classifier to score each data point. The classification task here is to distinguish intrusive vs legitimate sessions and the score is the predicted class probability. We use this score to introduce sampling bias in creating the known subset of positives. Using a Random Forest classifier, we achieve an AUC (area under ROC curve) value of 0.9921 on the training data and 0.9911 on the test data. We then curate different PU learning datasets by performing sampling over the scored data points by controlling two parameters as described below. 1. Topper: This parameter is used to introduce sampling bias by selecting only those positive points whose prediction score (using the supervised model) is higher than the 𝑡 th quantile of all positive labeled points. This selection of the top fraction of positives introduces a sampling bias since we are only selecting points with a high score. The idea is to capture spread within the positive class, and one meaningful scale is to use the estimated probability that a point is positive, given its features. Note that sampling is only done for positive class, the labels for all negative points are hidden. 2. Mixing: This parameter controls ‘randomness’ for the known subset of positives. After creating a sample of known positives based on the topper parameter, at value 𝑚 we swap 𝑚% of the selected points with points from the positive set, the swapping is done with replacement. As we move from 𝑚 = 0 to 𝑚 = 100 we decrease the sampling bias in the set and correspondingly increase the randomness. A mixing of 100% means SCAR is completely satisfied. 56 The subset obtained at a particular value of 𝑡 and 𝑚 is the known labeled subset of positives, and the remaining points (all negatives and the unsampled positives) are treated as unlabeled. With distinct values of 𝑡 and 𝑚 we obtain different simulated datasets. At a particular value of the topper parameter (𝑡), with 𝑚 = 100, we get a completely random sample of positive class (satisfying SCAR), on the other end with 𝑚 = 0 we get an extremely biased sample, containing only high scoring points. When 𝑚 < 100 the sampling is not completely random and depends on the score of the supervised model that uses all the features 𝑥. Consequently, the sampling variable 𝑠 is not independent of 𝑥 and the dataset does not align with the assumption of Elkan and Noto. We show that in cases of biased sampling, the proposed methods outperform the baseline approaches that rely on SCAR assumption. Results on simulated datasets: We train MAM and RAM and compare against the baselines - EAM [3] and biased SVM [5] - on simulated datasets with varying degrees of randomness. For uniformity, we use Random Forest as the base classifier for all three methods EAM, MAM and RAM. Biased SVM uses a SVM formulation [5]. The performance metrics are AUC (area under ROC curve) and Precision@Recall99, precision when 99% of known positives in the validation set are classified correctly. Unlike the standard ‘0.5’ threshold for classification, we set the classification threshold such that 99% of the legiti- mate sessions (positives) are correctly classified as legitimate. This is particularly important since in real systems we do not wish to interrupt legitimate users with any scrutiny. And so, Precision@Recall99 is an important metric to consider. The results for the simulated experiments are shown in Table 1. When sampling is extreme, towards the left with smaller mixing parameter, RAM and MAM perform significantly better than the EAM along the two evaluation metrics. With more randomness (increasing mixing), EAM beats other methods but our proposed RAM still has competitive performance. Biased SVM performs poorly throughout. This shows that in extremely biased situations the proposed models MAM and RAM provide significant improvements by explicitly accounting for the sampling bias. On the other hand, EAM provides slight improvement at high mixing (random sample) since it is tailored specifically for scenarios when the SCAR assumption holds. 4.2. Application to Real E-Commerce Data This section describes the application of RAM to a proprietary dataset from the traffic logs of an e-commerce website. Data Description: The data contains a record for every page request, here referred to as a ‘hit’. We consider one week and collapse these records into ‘sessions’ for each user. A session combines a series of hits made by a user. The session ends with 30 minutes of inactivity. Overall we identify 3.6 million unique visitors from 6 million sessions, and more than 100 million hits. The task is to label a session as arising from a human or a bot. The sessions from legitimate bots are filtered out using user-agent strings. The feature representation of all sessions utilizes a standard set of technology (e.g. browser and device types), behavioral (e.g. time between hits) and session related (e.g. timezone and time-of-day). Since this is an e-commerce website, we also have information as to whether a particular session resulted in a purchase. This information is leveraged to build our partial set of positives. The details are presented next. Known subset of positives: Out of the 6 million sessions, 36𝑘 (0.6%) sessions are purchase 57 Table 2 Test-set observations for E-commerce dataset. Class of sessions No. of sessions No. predicted human % predicted human Positive 74k 73k ∼ 99% Negative 24k 608 ∼ 2.5% Unlabeled 1.08M 890k ∼ 82% Total 1.18M 965k ∼ 82% sessions and 360𝑘 (6%) sessions belong to an identified purchaser. We label this 6% of sessions as positive (Human class). The dataset is then split into the train, test, and validation sets in an 80:10:10 ratio for modeling purposes. Partially labeled test dataset: To validate our approach, we split the test data into 3 groups of points and observe the distribution of prediction scores across these classes. This split is based on heuristics which we describe next. Positive data points: The subset of sessions that had a user corresponding to a purchase session in the training dataset. Negative data points: The subset of sessions which have been originated from AWS/Azure servers are tagged as negative. The assumption being that browsing sessions originating from these cloud environments are unlikely to be initiated by humans. The set of AWS/Azure IPs are publicly available [14, 15]. Unlabeled data points: The set of sessions which are neither tagged as positive nor negative. It is important to note that all points during model training had the label of ‘Positive’ or ‘Unlabeled’. The ‘Negative’ label is only used for validation. Also notice that the set of known positives is neither complete (not all Humans purchase), nor is it an unbiased sample (different users have varying propensities for purchases). Using the validation set, we identify a threshold that captures 99% of positive labels. The output score of the RAM model is converted into a boolean is-human label by using this threshold. Table 2 shows the break-up of the traffic in the dataset and how RAM classifies points from each of these classes. As seen in the table, we misclassify only a few negatively labeled sessions (<3%) and in total, we have close to 82% human traffic as reported by this model. We expect high human traffic since the website has strict login requirements for accessing their content. Additionally, we observe a stark separation in the prediction scores for positive and negative classes. Most positive samples had a score close to 1, while negatives were scored close to 0. 5. Conclusions In this paper, we have addressed the problem of detecting non-human traffic using positive and unlabeled data. Providing recommendation and personalization systems unbiased data to learn from, leads to a better experience for the end-customer. We specifically accounted for the selected completely at random assumption in standard PU Learning methods and conducted simulation studies for validation. We also evaluated our most general model, RAM, on a large real word e-commerce dataset. Given the scale of fraud due to bots, such bot detection systems have a clear utility. The methods described in this paper show promising results in addressing the endemic bot problem. 58 References [1] D. Networks, 2020: Bad Bot Report | IT Security’s Most In-Depth Analysis on Bad Bots, https://bit.ly/2Azqx3d, 2020. Accessed: 2020-05-15. [2] S. Zhang, Y. Ouyang, J. Ford, F. Makedon, Analysis of a low-dimensional linear model under recommendation attacks, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 517–524. [3] C. Elkan, K. Noto, Learning classifiers from only positive and unlabeled data, in: Proceed- ings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2008, pp. 213–220. [4] J. Bekker, P. Robberechts, J. Davis, Beyond the selected completely at random assumption for learning from positive and unlabeled data, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 71–85. [5] B. Liu, Y. Dai, X. Li, W. S. Lee, P. S. Yu, Building text classifiers using positive and unlabeled examples, in: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, IEEE, 2003, pp. 179–186. [6] H. Gao, M. Tang, Y. Liu, P. Zhang, X. Liu, Research on the security of microsoft’s two-layer captcha, IEEE Transactions on Information Forensics and Security 12 (2017) 1671–1685. [7] O. Stitelman, C. Perlich, B. Dalessandro, R. Hook, T. Raeder, F. Provost, Using co-visitation networks for detecting large scale online display advertising exchange fraud, in: Proceed- ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, 2013, pp. 1240–1248. [8] X. He, Z. He, X. Du, T.-S. Chua, Adversarial personalized ranking for recommendation, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 355–364. [9] Y. Li, O. Martinez, X. Chen, Y. Li, J. E. Hopcroft, In a world that counts: Clustering and detecting fake social engagement at scale, in: Proceedings of the 25th International Conference on World Wide Web, WWW ’16, 2016, pp. 111–120. [10] W. S. Lee, B. Liu, Learning with positive and unlabeled examples using weighted logistic regression, in: In Proceedings of the 20th International Conference on Machine Learning, ICML’03, 2003, pp. 448–455. [11] J. Bekker, J. Davis, Learning from positive and unlabeled data: A survey, Machine Learning 109 (2020) 719–760. [12] F.-X. Jollois, M. Nadif, Speed-up for the expectation-maximization algorithm for clustering categorical data, Journal of Global Optimization 37 (2007) 513–525. [13] M. Tavallaee, E. Bagheri, W. Lu, A. A. Ghorbani, A detailed analysis of the kdd cup 99 data set, in: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Security and Defense Applications (CISDA), 2009, pp. 1–6. [14] AWS, AWS IP address ranges, https://amzn.to/2z2Ql7h, 2020. Accessed: 2020-04-27. [15] Microsoft, Microsoft azure datacenter IP ranges, https://bit.ly/36aMDon, 2017. Accessed: 2020-04-27. 59