=Paper=
{{Paper
|id=Vol-2600/paper19
|storemode=property
|title=Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events using Predicted Positive Data
|pdfUrl=https://ceur-ws.org/Vol-2600/paper19.pdf
|volume=Vol-2600
|authors=Edgar Robles,Fatima Zaidouni,Aliki Mavromoustaki,Payam Refael
|dblpUrl=https://dblp.org/rec/conf/aaaiss/RoblesZMR20
}}
==Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events using Predicted Positive Data==
Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events using Predicted Positive Data Edgar Robles∗ , Fatima Zaidouni∗ , Aliki Mavromoustaki, Payam Refael Univeristy of Costa Rica, San José, Costa Rica University of Rochester, Rochester, NY, USA Institute of Pure and Applied Mathematics, University of California Los Angeles, CA, USA Google Inc. , Los Angeles, CA, USA {edgar.roblesarias@ucr.ac.cr, fzaidoun@u.rochester.edu, aliki.mavromoustaki@gmail.com, payamiam@google.com} Abstract Introduction Our study focuses on data sets with extreme rare events, Binary classification is challenging when dealing with im- where the minority class accounts for 0.01% to 0.19% of balanced data sets where the important class is made of the population. Such data is very common in many applica- extremely rare events, usually with a prevalence of around tions including fraud detection where fraudsters constitute 0.1%. Such data sets are common in various real-world prob- a small percentage of users, medical diagnostics, face de- lems. Examples are medical diagnostics where the disease is considered a rare event, or multiple types of fraud detec- tection, money laundering, and intrusion detection. In such tion where regular transactions are the most prevalent. The data sets, the distribution of positive to negative classes is events are categorized as either predicted positive or pre- extremely imbalanced and it is typically more important to dicted negative against a certain threshold. In large imbal- correctly detect the positive class (Aggarwal 2014) that cor- anced data sets, it is expensive to verify, through human responds to the rare events. For instance, one of the examples raters, all the predictions made by the classifier. As a result, used in the result section is a public data set on credit card only predicted positive events are rated. Additionally, in most fraud detection where the positive class accounts for 0.17% industrial applications, it is useful to combine multiple clas- of all transactions. sifiers to make a decision. We developed a machine learning In the rest of this paper, we provide some background about pipeline which, combined with expert knowledge, decides on binary classifiers and their performance, we summarize re- an optimal threshold. ROC curves are reformulated to true positive (TP) versus false positive (FP) curves. We propose lated work and describe the problem statement in detail. two solutions to select an optimal threshold by maximizing Then we propose solutions for single classifier threshold- the area under the curve (AUC): a graph-based approach and ing and for an ensemble of classifiers where two methods an analytic approach. The graph-based approach constructs a are proposed; the analytic and graph approach. Finally, we graph to select an optimal path in the threshold space and the explain two experiments to benchmark our results and eval- analytic approach minimizes an energy function. Our results uate the performance of our algorithm, one using public data agree with the Google team’s manual attempts to choose the sets, and the other one using private Google data sets. operating point in their private data sets and binary classifiers while providing a rigorous mathematical framework. Our so- lutions built on the Google team’s expert knowledge efforts Background which identified the marginal precisions used in our methods. Binary Classifiers To further evaluate the performance of our algorithm, we split 3 public data sets into training and testing sets and used them A binary classifier is a model that classifies its input into to train five different classifiers. The results show improve- two classes; a positive and negative class. In this paper, we ment of the f-1 measure by 1.5%, the precision was improved call the classifier’s input a document, and its output a score. by an average of 5.1% and, the recall was reduced by 1.6% The higher the score, the more the classifier is confident on average. Depending on the purpose of the classification, that the document is in the positive class and vice versa. In we show how to reverse this trade-off. the threshold binary model, a threshold number is picked and compared to the output score of the classifier to decide which class is predicted. If the score is larger than the se- ∗ Co-First authors: Both authors contributed equally to this lected threshold, the document is classified in the positive work. class. If the score is lower than the threshold, the document is classified in the negative class. Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen Classifier performance (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- bining Machine Learning and Knowledge Engineering in Practice To evaluate the performance of the classifier, the documents (AAAI-MAKE 2020). Stanford University, Palo Alto, California, are often presented to human raters to determine the true USA, March 23-25, 2020. Use permitted under Creative Commons positives, the false positives, and the true and false nega- License Attribution 4.0 International (CC BY 4.0). tives. A higher threshold will cause the predicted class to be- come more precise on average while losing some recall and cut-off (Hong 2009). The challenge arises when dealing a lower threshold will cause the predicted class to become with imbalanced data sets where rare events constitute the less precise on average while gaining some recall. Note that: positive class. The iso-performance line method maximizes the classifica- TP tion accuracy, which places more weight on the majority Precision = TP + FP class, therefore overlooking the minority class in which and we are interested the most. This method is therefore not TP applicable in our case, its pitfalls are discussed in detail in Recall = (Calabrese 2014) and (Fawcett and Provost 1996). Opti- TP + FN This results in a trade-off where a very low threshold might mizing the threshold as a method to improve classification yield many false positives and a high threshold might yield accuracy was explored in (Weiss 2004) and (Provost 2000) many false negatives. This trade-off is usually visualized where rarity is explored and the challenges that it implies with a receiver operating characteristic (ROC) curve that are investigated. In (Weiss and Provost 2003), the authors plots the true positive rate (TPR), defined as show that adjusting decision thresholds is an efficient way to deal with this bias. In (Calabrese 2014), the author suggests TP a quality measure called the cost-sensitive unweighted TPR = , TP + FN accuracy which uses the true rate as defined in (Hong 2009) as a quality criterion. It finds the optimal threshold that versus the false positive rate (FPR), defined as minimizes the difference between the simple mean of the FP Type 1 error, associated with false positives, and Type 2 FPR = , error associated with false negatives. To correct for the FN + TP imbalanced misclassification costs, the errors are weighted as suggested in (Bradley 1997) and (Provost and Fawcett by their respective costs for which the ratio is usually 2001). A random classifier would be located along the main known. An example using this method is found in (Roelens diagonal (Fawcett 2006) and a good classifier would have et al. 2018) for flood prevention data. points in the upper left corner. This is visualized in Figure 1. In the context of credit card assessment, an MCDM (Mul- A widely used performance measurement is the area under tiple Criteria Decision Making) was proposed in (Beulah the curve (AUC) of the ROC. An optimal threshold is needed Jeba Jaya .Y 2015) to select an optimal probability cut-off to minimize the false positives and negatives by maximizing using hybrid data mining approaches. the AUC. Ensemble classifiers are common methods to combine multiple classifiers, each one looking at different features of 1.00 a dataset. In (Breiman 2001), a method is shown where a ROC group of poor classifiers can be harnessed together to obtain True positive rate random a way better performance. (Dietterich 2000) shows how dif- good ferent methods for voting in these ensembles of classifiers, better from bayesian averaging to error-correcting, output coding, ideal and boosting can often perform better than a single classifier. Different thresholding methods were suitable for a range 0.00 of applications, we identified few very important challenges, 0.00 1.00 described below, that are faced when dealing with rare event False positive rate data sets, which were not addressed before, in our knowl- edge. Figure 1: Typical ROC curve plotting TPR vs FPR. Random classifier produces a lines a long the diagonal, the closer the Problem Description points to the upper left corner the better the classifier perfor- In an extreme rare event problem, if one positive event hap- mance pens in every ten thousand samples, one would have to la- bel a million samples to detect 100 positive events. Like the data we are dealing with from Google, many datasets will Related Work lack information about the true and false negatives, because ROC and precision-recall analysis are widely used in it is very expensive to have humans spend time rating such a imbalanced data set classification, as shown in various types large amount of data. Therefore, it makes sense that the only of rare event data sets in (Kubat, Holte, and Matwin 1998), data classified by humans is the data that scored higher than (Weiss 1999), and (Carvalho and Freitas 2002). However, a given score on the classifier i.e the positive class. Unfor- choosing an optimal threshold also called “optimal op- tunately, not knowing the true and false negatives, therefore, erating point”, is only well-understood for balanced data calls for analysis methods that do not require them. sets where there are approximately as many positives as The second important challenge to address occurs when negatives. A standard method is to take the intersection of using multiple classifiers to evaluate a single document. It the iso-performance tangent line in the ROC as the optimal often happens that we need to look at different aspects of the document using different classifiers before making a de- the break-even line as known in information retrieval, a more cision. For instance, one classifier can be assigned images detailed explanation is given in [8]. and another classifier can look for text. Each classifier will Mathematically, output a different score and after a threshold is chosen, it can m∗ n∗ ! be compared to each of the score and a decision can be made X X about whether the document belongs to the positive or neg- IYi ≥t , IXi ≥t , t ∈ R≥0 , ative class. The difficulty arises in finding a threshold that i=1 i=1 optimizes all the models jointly. This takes us from a single- ∗ ∗ where m and n are the total number of positives and nega- dimensional optimization problem to a multi-dimensional tives, respectively, having scores greater than 0 for instance. one. In the following sections, we will explore the meth- Given a classifier, a low threshold T can be set, the docu- ods developed to address these challenges in the context of ments with a score higher than T can be sent to human raters extreme rare event data sets. to determine the TP and FP, then in the resulting data set, the threshold is raised to see how the TP and FP change. Proposed Methods Note that as the threshold keep getting higher, we would For a single classifier have fewer data points in the sub-sample and therefore a For the sake of completeness, we begin with the simple case wider confidence interval. Using the partial ROC curve im- of one classifier where we start defining the framework of plies that we can not use the rated data to simulate the TP our methods. and FP count had the model threshold been set even lower Let Xi be the independent and identically distributed (iid) than T. random variable representing the score of positive docu- With the partial ROC curve, we can formulate our optimiza- ments. The number of true positives at a threshold t is tion problem as follows: the number of positive documents greater than the score t. Given any point on the curve, if this point was the operat- Hence, for each threshold t we have TPR as: ing point and we wanted to get one additional true positive n document, we can determine the cost that we need to pay in 1X terms of a raw number of false positives. IX ≥t , n i=1 i In many situations, this formulation can relate to a busi- ness goal. For example, the positive class could be unwanted where n is the total number of positive samples and IXi ≥t is documents that a company needs to filter out without losing the indicator function. Similarly, let Yi be the iid random the wanted documents. Thus, they can provide a number to variable of the score of negative documents, then FPR is quantify how many good documents they are willing to give given as: up in order to filter out one additional bad document. This m 1 X can be answered by analyzing monetary costs for instance. IY ≥t , Based on this, a target derivative FT PP can be chosen, again m i=1 i this represents the amount of false positives gained vs the where m is the total number of negative samples. The ROC amount of true positives gained at a certain threshold. The curve is then parametrized as target derivative is based on expert knowledge, which was m n ! provided for us by Google in our application. 1 X 1X A partial ROC curve can be described as a parametric IY ≥t , IX ≥t , t ∈ R, m i=1 i n i=1 i function and is expected to converge to (P (Yi ≥ t), P (Xi ≥ t)). The f (t) = (F P (t), T P (t)), where f : T → R2 , trade-off optimization problem then becomes one of find- ing a threshold t that maximizes P (Xi ≥ t) and minimizes where T is the threshold space. A fit can be applied to find P (Yi ≥ t). However, as mentioned in the introduction, we the points where the derivative has the desired value. In the are addressing the situation where only the portion of the result section, we present the fitting methods that worked documents having scored higher than a certain threshold are best for typical partial ROC curves. rated. In other words, the distributions are left-censored, we only have IXi ≥t , IYi ≥t for t ≥ 0 and we only know the to- For an ensemble of classifiers tal number of sample m + n but do not have knowledge of In this case, the PN curve (partial ROC) becomes an N- either m or n. Hence, we are unable to plot an ROC curve. dimensional manifold, dF dT P P becomes a more complicated Since we lack information about the true and false neg- notion since a local direction must be picked for a unique atives, we use a reformulation of the ROC curve that plots derivative to exist. TP instead of TPR and FP instead of FPR , this removes In general, one can define the function as the normalization to probabilities that a normal ROC does by dividing by the positives or negatives. The reformulation f (~t) = (F P (~t), T P (~t)); f : T̂ → R2 , (1) is referred to as the positive-negative (PN) curve as defined in [8]. In our paper, we also refer to it as a partial ROC where T̂ is the threshold space described by the cross curve. The appearance is the same between the PN and ROC product of all the threshold values for each classifier. curve, but the chance-line (diagonal) in the ROC becomes We can consider the derivative at a point to be: Algorithm 2 Analytic implementation of FindOptimal- Curve ∆TP · u lim , Input: Set of points P representing (T P, F P, T 1, T 2, ...), ∆x→0 ∆FP · u a family of curves fp defined on [0, 1] → T where u is a vector that describes the direction at which Output: A curve C the derivative is being taken. We describe a path 1: Fit a function F̂ (T 1, T 2, ...) as the best fit for the points φ(s) = (g(s), h(s)) for 0 ≤ s ≤ 1 (F P, T 1, T 2, ...) 2: Fit a function T̂ (T 1, T 2, ...) as the best fit for the points over T̂ space, that is, a function φ : [0, 1] → T̂ . (T P, T 1, T 2, ...) We impose the following restrictions on φ for the resulting 3: Let gp ← (T̂ , F̂ ) ◦ fp curve to behave like a partial ROC curve: 4: q ← argmaxp AU C(gp (t))dt • φ(0) = (inf T1 , inf T2 , . . . , inf Tn ) 5: return {(t, fq (t))∀t ∈ [0, 1]} • φ(1) = (sup T1 , sup T2 , . . . , sup Tn ) • f (t) and g(t) must be monotonous The first two conditions are so that the curve starts and 1. Normalizing the thresholds ends at the correct points, and the last point is so that the 2. Plotting the three-dimensional plots (TP, T1, T2) and (FP, amount of true positives and false positives decrease with T1, T2) and fitting a surface through each one of them us- respect to our new “threshold”. ing an interpolation technique. We create an n × n grid To pick the best classifier, φ should maximize the area under on the threshold plane with each point on the grid corre- the curve: sponding to a pair of thresholds and a value of TP and FP. Z ˆ and FP This step allows finding the estimators TP ˆ at each argmaxφ T P dF P point of the surface. φ([0,1]) 3. Turning the grid into a linear space; since the data is nor- malized at this point, the domain will be from 0 to 1. We We propose two algorithms to find φ, one is based on min- can also specify the granularity i.e the size of the parti- imizing a function (the analytic approach) and the other one tions in this domain. In the next steps, the granularity will is based on building a graph (the graph approach). also indicate the number of AUC samples that are calcu- Before we give a step by step example implementation for lated in the process. both methods, we introduce the following simple base algo- This step is the start of the parametrization used in this rithm for both methods to find the optimal operating point. approach. If s denotes the parameter indicating the order of the partitions, then both thresholds are a function of s Algorithm 1 Base algorithm for finding the optimal operat- which is then used as an index in a separate parametric ing point function to output the corresponding TP and FP values Input: Set of points P representing (T P, F P, T 1, T 2, ...), separately. target derivative d 4. Defining an energy function as follows: Output: Optimal operating point (T 1, T 2, ...) E(~ ˆ ◦ φp~ , TP p) = AUC(FP ˆ ◦ φp~ ) 1: C ← FindOptimalCurve(P ) d where: 2: Find point q where dt C(q) = d 3: return q φp~ is a family of parametric functions defined from [0,1] to T1 ×T2 with p~ being the parameter. In our algorithm, we In what follows, we explain the FindOptimalCurve func- used a parametric function that is graphically close to the tion used in Algorithm 1 in the context of both the analytic resulting discrete path and that passes through the bounds and graph approach. (0,0) and (1,1): Analytic approach φp~ = xp (2) The core of this approach is the usage of a family of func- tions for which we optimize the parameters in a way that AUC is a function that calculates the area under the curve maximizes the area under the corresponding PN curve. This using the trapezoidal formula, this is an estimation of the approach estimates the integral yielding the area under the integral form of R the calculation of the area under the curve curve using the simple trapezoidal method. The algorithm described by: {(x,φp~ (x))} TP dFP implementation is summarized in Algorithm 2. 5. Minimizing the function below using the Nelder-Mead We describe the implementation of this algorithm in more minimization function. detail using two thresholds, as an example, in the following steps: − log(log(E)) TP FP Figure 3: The trapezoid area under a single segment between two edges, the weight is defined to be the negative of the highlighted area Figure 2: The graph used in this approach; each circle rep- resents a node which is a 4-tuple from the data (T1, T2, TP, FP) and each arrow represents an edge. Then the total weight of edges in a path is the negative of the area under the curve swept out in the TP vs FP plot. In another words, by fixing end points, we can find a better Note that we made the previous function negative so we TP vs FP curve by finding a path that has a minimal weight. can use a minimization method to maximize the AUC. We So, it suffices to find the shortest path from v0 to v∞ ,with also composed E with two logarithms in order to make (t1(v0 ), t2(v0 )) = (0, 0) , (t1(v∞ ), t2(v∞ )) = (∞, ∞) . it converge faster, otherwise, it might get stuck on local Intuitively, it finds a shortest path from the left bottom corner minima. to the right upper corner in the thresholds space T1 × T2 . 6. Creating a function that determines the curve in the TP- The edge is directed towards up or right to avoid bent paths FP space that corresponds to the resulting path within the in the corresponding TP vs FP curve. threshold space using the parameterization mentioned in After we find the set of nodes in the shortest path, we con- Step 3. nect the corresponding points and re-parametrize the path. dT P 7. Finding the point s where dF Note that it would be continuous but not smooth. P w.r.t. the parameterization s equals the target marginal precision, by estimating the The algorithm for this method is the following: derivative of f ◦ φ. 8. Finding the optimal operating point (T1,T2,TP,FP) given Algorithm 3 Graph based implementation of FindOptimal- a target marginal precision ie. the values of t where Curve φ(t) = s. It can be interpolated at the value of s given Input: Set of points P representing (T P, F P, T 1, T 2, ...) by the closest marginal precision to the target.Ridders sampled in a grid with respect to the thresholds method can be used for root finding to get the best pa- Output: A curve C rameter s. 1: Create a graph G connecting points to neighbors with a Graph-based approach greater coordinate than them The goal of the graph-based approach is again to find the 2: Let the origin be A0 and the point with the biggest co- continuous path on the threshold space that maximizes the ordinates be A∞ area under the curve in the TP-FP space, now with the con- 3: P ← FindOptimalPath(G, A0 , A∞ ) straint that every point is only allowed to move through its 4: Let f (t) : [0, 1] → F P ×T P such that it passes through neighbors. the path P We defined the node, edge, and weight of the graph shown 5: return {(t, f (t))∀t ∈ [0, 1]} in Figure 2, for 2-dimensions, as follows: • Node v: represents a pair of thresholds, having attributes Below we explain this approach in detail, starting from of (T1,T2,TP,FP) step 3 of the analytic approach for a 2-dimensional example, since the first two steps are similar. • Edge e: the connection between the nodes is established if and only if one of their thresholds is equal or their other 1. Transforming the threshold space into an n × n grid with threshold is consecutive in the discrete set of thresholds. each point on the grid corresponding to a pair of thresh- • Weight of edge e = (v1 , v2 ): this is equal to the olds and a value of TP and FP. negative of the area under the segments (trapezoid) 2. Calculating the weight of the graph as the negative area (FP(v1 ), TP(v1 )) and (FP(v2 ), TP(v2 )) as shematized under the segment formed by 2 points v1 , v2 in the TP vs in Figure 3. Mathematically, the weight is defined as: FP curve using equation 3 1 3. Determining the weight of the edge connecting the node w(e) = − (|FP(v1 ) − FP(v2 )|)(TP(v1 ) + TP(v2 )) of two consecutive points. 2 (3) Note that the predicted positives have to be non-negative, hence the segment is above the x-axis. The weight is Each classifier was trained using the first 20% of the com- non-positive therefore a lower weight indicates a higher ponents of the PCA. We need to simulate a joint optimiza- area under the segment. tion of two classifiers where each one of them makes deci- sions over a separate but related data set. Thus, one of the 4. Store the (TP, FP, T1 , T2 ) tuple in every node as extra classifiers was assigned the remaining even components and information. The edge can then be defined as a 3-tuple of a second classifier of the same type was assigned the remain- (nk , nl , wk,l ), where nk and nl are each node and wk,l is ing odd components. the negative area under the curve of each node. The hyperparameters, for each classifier, were tuned by The nodes are connected if they are within the same maximizing the average accuracy in a cross-validation test. neighborhood. A function can do so if the one of the A cross-validation test was done by splitting the training thresholds is the same between two neighbours or the set T into 5 different cross-validation groups: T1 , T2 , ...T5 , thresholds are consecutive to each other in the discrete where for S each training set Ti , a classifier is trained using set of thresholds. the set j6=i Tj and the score for the ith fold is denoted by The path selected is therefore just a collection of nodes si and is obtained by evaluating the accuracy of the classifier from the left bottom to right upper corner in thresholds on Ti . The hyperparameters with the highest average value space and the total weight of the path corresponds to the of s1 , ..., s5 are picked. negative area under the curve swept out by it in the TP - The classifiers used for the testing were Linear SVMs, FP space. Random Forests, Decision Trees, k-nearest neighbors, XG- Boosting and Naive Bayes. Note that in order to compute the shortest path be- We explored multiple classification techniques to deter- tween the lower-left corner (source node) to the top mine whether our methods improve their performance. We right corner node, the Bellman-Ford algorithm can be relied on (Pedregosa et al. 2011) for a compilation of clas- implemented since it accepts edges with negative weights sification algorithms. In all classifiers, we estimated the hy- unlike the usual Dijkstra’s algorithm. perparameters through a grid search with cross-validation. For the k-nearest neighbors √ classifier, we tested all the val- 5. Connecting the nodes to obtain a continuous path in the ues of k from 1 to n where n is the number of points threshold space and determining the curve in the TP-FP in the training data, as well as the L1 and L2 metrics, the space that corresponds to the selected path. It does so us- default behavior for classification, is to pick the class with ing the same parametrization explained in step 3 of the the biggest number of nearest neighbors assigned to them. analytic approach. The decision tree classifier used the grid search coupled with 6. Getting the marginal precision, refer to step 7 of the ana- the Gini and entropy criterion. We optimized the maximum lytic approach. depth parameter for all integer values between 1 and 500 as 7. Determining the optimal operating point given a target well as without setting a maximum depth, the default clas- marginal precision, refer to step 8 of the analytic ap- sification method is to follow the instructions for the tree. proach. The random forest classifier also used the Gini and entropy criterion with the number of estimators between 50 and 500, with steps of 10 between them, the default behavior for clas- Experiments and Results sification is to let each decision tree in the forest vote, and Evaluating the performance using public datasets the class with the most votes wins. For the extreme gradient To evaluate the performance of the method, we use rare booster (XGboost), we used a maximum depth from 1 to 20, event data sets where the amount of false negatives and true a learning rate from 0.1 to 1, with steps of 0,1, and a number positives can be calculated, in order to compare the preci- of estimators from 50 to 500 with a step of 10, the default sion and recall of two standard classifiers with our methods classification behavior is the same as random forests. For on the resulting ensemble of classifiers. This was done using the linear SVMs, we again used the grid-search with cross- three rare event datasets123 whose true and false positive and validation to look for the hyperparameters between using L1 negative values are known. and L2 norm for the penalty, the default threshold used was Each dataset was cleaned by changing empty values for 0. For SVMs, we used the hinge or squared hinge loss func- the average value in the column, turning categorical datasets tions to train the classifier with a dual true or false and using into one-hot encoded vectors, and encoding cyclical values all values of the regularization parameter λ from 0 to 1 with like days of the weeks into points in a unit circle. These a step of 0.1, the default threshold is 0. We also used a Naive datasets were then passed through a PCA procedure (Princi- Bayes classifier with no optimized parameters, whose de- pal Component Analysis) to reduce the dimensionality. The fault classification behavior is to choose the class with the data sets were then split into a training set T and a holdout highest probability. testing set H, where T represents 75% of the data set and H A grid was made using the result of the cross product of represents the other 25%. the thresholds of equally spaced points in each dimension. An input would be considered to have a true value if either 1 https://www.kaggle.com/mlg-ulb/creditcardfraud of the scores produced by each classifier surpassed their own 2 https://www.kaggle.com/henriqueyamahata/bank-marketing threshold. This means that we are taking the union of the 3 https://www.kaggle.com/c/ieee-fraud-detection classification. Using the scores outputted by the classifiers, Figure 4: The F1 measure of the best performing method Figure 5: The precision of the best performing method to to create the ensemble against the average F1 measure of create the ensemble against the average F1 measure of the the two classifiers that compose it for each pair of dataset two classifiers that compose it for each pair of dataset and and classifier type. The classifier types are abbreviated as classifier type. The classifier types are abbreviated as the the following: KNN is K-nearest neighbors, SVM is a linear following: KNN is K-nearest neighbors, SVM is a linear SVM, NB is Naive Bayes, RF is Random Forests and XGB SVM, NB is Naive Bayes, RF is Random Forests and XGB is XGBoosting. is XGBoosting. we calculated the amount of false positives and true positives for each pair of thresholds in the grid. This was the input for Algorithm 1. We then calculated the precision and recall from the testing set. To evaluate their performance, the F1 measure was used instead of the accuracy score. This latter is not suitable for rare event data sets where a difference in accuracy of 0.01% can make a huge difference in both the precision and recall. The F1 measure is calculated by pr F1 = 2 , p+r where p is the precision and r is the recall. Figure 6: The recall of the best performing method to create Figures 4, 5 and 6 show the F1 measure, precision, and the ensemble against the average F1 measure of the two clas- recall, respectively, for each experiment. An experiment was sifiers that compose it for each pair of dataset and classifier defined as a combined pair of a dataset and a classifier type. type. The classifier types are abbreviated as the following: When either the precision or recall were extremely low, the KNN is K-nearest neighbors, SVM is a linear SVM, NB is other quantity is ”undefined”. In this case, taking the union Naive Bayes, RF is Random Forests and XGB is XGBoost- of the classifiers’ decisions results in classifying every doc- ing. ument as positive, which makes the ensemble classify every document as positive. Overall, the precision was increased on average by 5.1% columns represent two thresholds of two different classifiers while the recall was reduced by an average of 1.6% and F1 and the next two represent the number of true and false measure increased by an average of 1.5%. positives for each pair of thresholds as determined by Table 1 and Table 2 of the appendix show the results for human raters, assuming all documents over the threshold is all the simulated experiments with all the mentioned classi- considered as part of the positive class. These datasets, have fiers. Undefined values are showed as - on the tables. As the a prevalence of the positive class of about 0.01% of the total F1 measure of the classifiers that do not use the proposed number rows. methods increases, the F1 measure of the ensemble result- ing from the proposed methods also increases. This trend The result of running Algorithm 2 and 3 is shown in is shown on figure 7. It also plots the identity line to show Figure 10 for an example data set. The figure represents the that the proposed methods improve the F1 measure of all the threshold paths obtained using both the analytic and graph points above this line. approach, as well as the location of the operating point. Benchmark using Google data sets Using the parameterization from Equation 1, we can use The Google data contains 22 data sets each consisting of the threshold paths to determine the optimal partial ROC about 10,000 rows and 4 columns of data. The two first curve that was selected by each of these algorithms. Figure 7: The average F1 measure of the classifiers with- Figure 9: The operating point picked by each algorithm in out the proposed method compared to the F1 measure of the the TP vs FP space. with the proposed method. Discussion Next, we can determine the optimal operating point given a We used an or (union) in combining the decisions of the clas- target marginal precision as determined by a business goal sifiers but it can be substituted by an and operation depend- set by Google’s expert knowledge efforts prior to this work. ing on whether the application prioritizes positive classifi- The result of running Algorithm 1 is shown for 3 different cation or negative classification. Our methods also assume methods in Figure 8 in the threshold space and correspond- that the ROC curve for both of the classifiers has a unique ingly in Figure 9 in TP vs FP space. decreasing derivative which can be expected from a tuned classifier. Both the analytic and graph methods were presented as al- though both useful, there are different disadvantages to each one of them. The analytic method is heavily dependent on the family of functions used to optimize over, and is prone to get stuck in an optimal minimum. On the other hand, it is uncertain whether the graph method is prone to overfit- ting to the data or not, however, unlike the analytic method, there is no way to mitigate it. A possible alteration to be done to mitigate overfitting with the graph method would be to hold a series of simulations where some connections be- tween the nodes are deleted, and then picking the average path between the resulting set of paths. Both algorithms scale reasonably well with respect to the amount of data. The graph algorithm scales at a rate of Figure 8: The operating point picked by each algorithm in O(nk ), where n is the number of rows and k is the num- the threshold space. ber of classifiers. This is good when the number of dimen- sions is low, however, it scales exponentially with the num- ber of classifiers in the ensemble. The analytic approach The analytic and graph methods mentioned in figures 8 works the opposite way since it depends on an optimiza- and 9 are the proposed methods, while the benchmark is a tion problem, it does not scale very well with respect to the fine-tuned operating point used by Google based on sepa- number of rows, however, it scales linearly with the number rating the false positives into small intervals called buckets, of dimensions. We compared the running time of the algo- and picking the highest TP for each bucket, then adjusting rithms using a computer with an i7-6700 and 32GB of RAM this manually to fit the goals of the optimal operating point. for all the Google data sets combined. The analytic approach While this latter is good as a benchmark for our algorithms, ran on average 10 times faster than the graph approach; for it can not be used as another optimization method since it n = 10, 000 and k = 2, the analytic method takes 60 sec- lacks mathematical rigor and produces an erratic behavior in onds while the graph method takes 6 seconds in our condi- the threshold space which, without fine-tuning, would imply tions. the non-existence or non-uniqueness of the derivative of the Finally, the experiments were done using an automated partial ROC curve needed to find the operating point. method to tune the classifiers, notably the grid-search with Our methods make a good attempt at approximating the cross-validation in order to obtain a large number of samples operating point selected by the benchmark method. These across different accuracies. However, this could be improved points are also plotted on their respective paths in Figure 10. since it does not reflect a real-world application of this pro- with their default classification behavior. The result im- proved the precision while lowering the recall less signifi- cantly. This tradeoff could be reversed, if desired, by sim- ply changing the ensemble decision between the individual classifiers from ’or’ to ’and’. We discussed important limita- tions and possibilities for future work. Notably, optimizing different metrics other than AUC individually or in combi- nations would be worth exploring. Moreover, we optimized two binary classifiers jointly but the theoretical framework described in the methods section extends to N dimensions. Using this in practice will require optimizing over multi- ple parameters in the analytic approach and extending the graph to an N-dimensional lattice. Exploring this possibility through practical applications might open great avenues for the general optimization of ensemble classifiers. Acknowledgements We would like to thank our teammates Ching Pui Wan and Joanne Beckford, NSF grant DMS-0931852, Payam Re- fael and his team from Google, our academic mentor Aliki Mavromoustaki, and all of the IPAM staff at UCLA for mak- ing this work possible. References Aggarwal, C. C. 2014. Data classification: algorithms and Figure 10: The path picked by the analytic (top) and graph applications. CRC Press, Taylor & Francis Group. method (bottom), using the function in equation 2 for the Beulah Jeba Jaya .Y, J. J. T. 2015. Multiple criteria deci- analytic method. sion making based credit risk prediction using optimal cut- off point approach. International Journal of Applied Engi- neering Research 10:20041–20054. cess in which the hyperparameters would be estimated with the help of additional fine-tuning methods that depend on Bradley, A. P. 1997. The use of the area under the roc curve the specific application of the data set. in the evaluation of machine learning algorithms. Pattern recognition 30(7):1145–1159. Conclusion Breiman, L. 2001. Random forests. Machine Learning 45(1):5–32. The proposed algorithm finds optimal thresholds for the out- Calabrese, R. 2014. Optimal cut-off for rare events and un- put of binary classifiers used for rare event datasets. The balanced misclassification costs. Journal of Applied Statis- method works by finding the best possible path in the TP tics 41(8):1678–1693. - FP space, which is the one that maximizes the area under the curve of the projected partial ROC curve. This curve is Carvalho, D. R., and Freitas, A. A. 2002. A genetic- then used to find the optimal operating point corresponding algorithm for discovering small-disjunct rules in data min- to a given target derivative. The latter represents the trade- ing. Applied Soft Computing 2(2):75–88. off of lost false positives to gained true positives which was Dietterich, T. G. 2000. Ensemble methods in machine learn- obtained from the Google team’s expert knowledge efforts ing. In Multiple Classifier Systems, 1–15. Berlin, Heidel- prior to this work. Two optimization methods are proposed: berg: Springer Berlin Heidelberg. an analytical method, which optimizes parameters of an en- Fawcett, T., and Provost, F. 1996. Combining data mining ergy function and a graph method, which performs the op- and machine learning for effective user profiling. In Pro- timization through finding the path in a graph to maximize ceedings of the Second International Conference on Knowl- the AUC. edge Discovery and Data Mining, 8–13. AAAI Press. Our results were very close to the optimal operating Fawcett, T. 2006. An introduction to roc analysis. Pattern points that were picked manually by the Google team. We recognition letters 27(8):861–874. were, therefore, able to automate their process of finding an optimal threshold while providing a mathematical frame- Hong, C. S. 2009. Optimal threshold from roc and cap work and using their target derivatives(marginal precisions). curves. Communications in Statistics - Simulation and Com- Moreover, we compared the F1 measure, precision, and re- putation 38(10):2060–2072. call of combining five classifiers and three datasets while Kubat, M.; Holte, R. C.; and Matwin, S. 1998. Machine either relying on our thresholding methods and taking the learning for the detection of oil spills in satellite radar im- union of the decisions or using the classifiers individually ages. Machine learning 30(2-3):195–215. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830. Provost, F., and Fawcett, T. 2001. Robust classification for imprecise environments. Machine learning 42(3):203–231. Provost, F. 2000. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI’2000 workshop on imbalanced data sets, volume 68, 1–3. AAAI Press. Roelens, J.; Rosier, I.; Dondeyne, S.; Van Orshoven, J.; and Diels, J. 2018. Extracting drainage networks and their connectivity using lidar data. Hydrological processes 32(8):1026–1037. Weiss, G. M., and Provost, F. 2003. Learning when training data are costly: The effect of class distribution on tree induc- tion. Journal of artificial intelligence research 19:315–354. Weiss, G. M. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 1, 718–725. Mor- gan Kaufmann Publishers Inc. Weiss, G. M. 2004. Mining with rarity: a unifying frame- work. ACM Sigkdd Explorations Newsletter 6(1):7–19. Appendix Table 1: The results of the experiment being performed on multiple datasets with different types of classifiers. The empty values in precision or F1 measure occur when every data point is classified as a negative, and so precision is undefined and recall is 0. Dataset Ensemble Classifier Precision Recall F1 measure CreditCards LinearSVM Analytical 0.88 0.76 0.82 Graph 0.897 0.76 0.82 Linear SVM 1 0.88 0.66 0.75 Linear SVM 2 0.88 0.56 0.68 Random Forest Analytical 0.90 0.76 0.82 Graph - 0 - Random Forest 1 0.94 0.73 0.82 Random Forest 2 0.92 0.73 0.81 Decision Tree Analytical - 0 - Graph - 0 - Decision Tree 1 0.6696 0.63 0.65 Decision Tree 2 0.7818 0.72 0.75 k-nearest neighbors Analytical - 0 - Graph - 0 - k-nearest neighbors 0.8191 0.6417 0.7196 k-nearest neighbors 0.9342 0.5917 0.7245 xgboost Analytical 0.9231 0.7000 0.7962 Graph 1.0000 0.0083 0.0165 xgboost 0.9140 0.7083 0.7981 xgboost 0.9000 0.7500 0.8182 Naive Bayes Analytical 0.0836 0.8000 0.1513 Graph - 0.0000 - Naive Bayes 0.0671 0.8417 0.1243 Naive Bayes 0.0746 0.8167 0.1368 Bank LinearSVM Analytical 0.1125 1.0000 0.2022 Graph - 0.0000 - Linear SVM 1 0.6633 0.3368 0.4467 Linear SVM 2 0.6582 0.3359 0.4448 Random Forest Analytical 0.6529 0.4905 0.5602 Graph - 0.0000 - Random Forest 1 0.6477 0.4827 0.5532 Random Forest 2 0.6516 0.4732 0.5483 Decision Tree Analytical - 0.0000 - Graph - 0.0000 - Decision Tree 1 - 0.0000 - Decision Tree 2 - 0.0000 - k-nearest neighbors Analytical 0.6766 0.3903 0.4951 Graph - 0.0000 - k-nearest neighbors 0.6528 0.4482 0.5315 k-nearest neighbors 0.6508 0.4473 0.5302 xgboost Analytical 0.6458 0.4724 0.5456 Graph 0.6832 0.3929 0.4989 xgboost 0.6359 0.4870 0.5516 xgboost 0.6333 0.5026 0.5604 Naive Bayes Analytical 0.5724 0.3584 0.4408 Graph 0.5694 0.3083 0.4000 Naive Bayes 1 0.5284 0.5147 0.5214 Naive Bayes 2 0.5206 0.5121 0.5163 Table 2: Continuation of Table 1 Dataset Ensemble Classifier Precision Recall F1 measure IEEEFraud k-nearest neighbors Analytical - 0.0000 - Graph - 0.0000 - k-nearest neighbors 1 0.6380 0.0897 0.1573 k-nearest neighbors 2 0.6476 0.0942 0.1645 Naive Bayes Analytical 0.0342 1.0000 0.0662 Graph - 0.0000 - Naive Bayes 1 0.0873 0.1356 - Naive Bayes 2 0.0840 0.0954 0.0004 LinearSVM Analytical 0.0342 1.0000 0.0662 Graph 0.0000 0.0000 - Linear SVM 1 0.0175 0.0002 0.0004 Linear SVM 2 0.0870 0.0008 0.0016 Decision Tree Analytical - 0.0000 - Graph - 0.0000 - Decision Tree 1 - 0.0000 - Decision Tree 2 - 0.0000 - Random Forest Analytical 0.7605 0.2647 0.3927 Graph - 0.0000 - Random Forest 1 0.8652 0.1855 0.3055 Random Forest 2 0.8641 0.1851 0.3049