=Paper=
{{Paper
|id=Vol-2600/paper19
|storemode=property
|title=Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events using Predicted Positive Data
|pdfUrl=https://ceur-ws.org/Vol-2600/paper19.pdf
|volume=Vol-2600
|authors=Edgar Robles,Fatima Zaidouni,Aliki Mavromoustaki,Payam Refael
|dblpUrl=https://dblp.org/rec/conf/aaaiss/RoblesZMR20
}}
==Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events using Predicted Positive Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2600/paper19.pdf</pdf>
<pre>
  Threshold Optimization in Multiple Binary Classifiers for Extreme Rare Events
                          using Predicted Positive Data
                    Edgar Robles∗ , Fatima Zaidouni∗ , Aliki Mavromoustaki, Payam Refael
                                        Univeristy of Costa Rica, San José, Costa Rica
                                        University of Rochester, Rochester, NY, USA
                 Institute of Pure and Applied Mathematics, University of California Los Angeles, CA, USA
                                             Google Inc. , Los Angeles, CA, USA
    {edgar.roblesarias@ucr.ac.cr, fzaidoun@u.rochester.edu, aliki.mavromoustaki@gmail.com, payamiam@google.com}


                            Abstract                                                        Introduction
                                                                      Our study focuses on data sets with extreme rare events,
  Binary classification is challenging when dealing with im-          where the minority class accounts for 0.01% to 0.19% of
  balanced data sets where the important class is made of             the population. Such data is very common in many applica-
  extremely rare events, usually with a prevalence of around          tions including fraud detection where fraudsters constitute
  0.1%. Such data sets are common in various real-world prob-
                                                                      a small percentage of users, medical diagnostics, face de-
  lems. Examples are medical diagnostics where the disease
  is considered a rare event, or multiple types of fraud detec-       tection, money laundering, and intrusion detection. In such
  tion where regular transactions are the most prevalent. The         data sets, the distribution of positive to negative classes is
  events are categorized as either predicted positive or pre-         extremely imbalanced and it is typically more important to
  dicted negative against a certain threshold. In large imbal-        correctly detect the positive class (Aggarwal 2014) that cor-
  anced data sets, it is expensive to verify, through human           responds to the rare events. For instance, one of the examples
  raters, all the predictions made by the classifier. As a result,    used in the result section is a public data set on credit card
  only predicted positive events are rated. Additionally, in most     fraud detection where the positive class accounts for 0.17%
  industrial applications, it is useful to combine multiple clas-     of all transactions.
  sifiers to make a decision. We developed a machine learning         In the rest of this paper, we provide some background about
  pipeline which, combined with expert knowledge, decides on
                                                                      binary classifiers and their performance, we summarize re-
  an optimal threshold. ROC curves are reformulated to true
  positive (TP) versus false positive (FP) curves. We propose         lated work and describe the problem statement in detail.
  two solutions to select an optimal threshold by maximizing          Then we propose solutions for single classifier threshold-
  the area under the curve (AUC): a graph-based approach and          ing and for an ensemble of classifiers where two methods
  an analytic approach. The graph-based approach constructs a         are proposed; the analytic and graph approach. Finally, we
  graph to select an optimal path in the threshold space and the      explain two experiments to benchmark our results and eval-
  analytic approach minimizes an energy function. Our results         uate the performance of our algorithm, one using public data
  agree with the Google team’s manual attempts to choose the          sets, and the other one using private Google data sets.
  operating point in their private data sets and binary classifiers
  while providing a rigorous mathematical framework. Our so-
  lutions built on the Google team’s expert knowledge efforts
                                                                                             Background
  which identified the marginal precisions used in our methods.       Binary Classifiers
  To further evaluate the performance of our algorithm, we split
  3 public data sets into training and testing sets and used them
                                                                      A binary classifier is a model that classifies its input into
  to train five different classifiers. The results show improve-      two classes; a positive and negative class. In this paper, we
  ment of the f-1 measure by 1.5%, the precision was improved         call the classifier’s input a document, and its output a score.
  by an average of 5.1% and, the recall was reduced by 1.6%           The higher the score, the more the classifier is confident
  on average. Depending on the purpose of the classification,         that the document is in the positive class and vice versa. In
  we show how to reverse this trade-off.                              the threshold binary model, a threshold number is picked
                                                                      and compared to the output score of the classifier to decide
                                                                      which class is predicted. If the score is larger than the se-
  ∗
    Co-First authors: Both authors contributed equally to this        lected threshold, the document is classified in the positive
work.                                                                 class. If the score is lower than the threshold, the document
                                                                      is classified in the negative class.
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen     Classifier performance
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
bining Machine Learning and Knowledge Engineering in Practice         To evaluate the performance of the classifier, the documents
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,         are often presented to human raters to determine the true
USA, March 23-25, 2020. Use permitted under Creative Commons          positives, the false positives, and the true and false nega-
License Attribution 4.0 International (CC BY 4.0).                    tives. A higher threshold will cause the predicted class to be-
come more precise on average while losing some recall and                     cut-off (Hong 2009). The challenge arises when dealing
a lower threshold will cause the predicted class to become                    with imbalanced data sets where rare events constitute the
less precise on average while gaining some recall. Note that:                 positive class.
                                                                              The iso-performance line method maximizes the classifica-
                                                         TP                   tion accuracy, which places more weight on the majority
                                     Precision =
                                                       TP + FP                class, therefore overlooking the minority class in which
and                                                                           we are interested the most. This method is therefore not
                                   TP                                         applicable in our case, its pitfalls are discussed in detail in
                                      Recall =                                (Calabrese 2014) and (Fawcett and Provost 1996). Opti-
                               TP + FN
This results in a trade-off where a very low threshold might                  mizing the threshold as a method to improve classification
yield many false positives and a high threshold might yield                   accuracy was explored in (Weiss 2004) and (Provost 2000)
many false negatives. This trade-off is usually visualized                    where rarity is explored and the challenges that it implies
with a receiver operating characteristic (ROC) curve that                     are investigated. In (Weiss and Provost 2003), the authors
plots the true positive rate (TPR), defined as                                show that adjusting decision thresholds is an efficient way to
                                                                              deal with this bias. In (Calabrese 2014), the author suggests
                                                     TP                       a quality measure called the cost-sensitive unweighted
                                      TPR =                ,
                                                   TP + FN                    accuracy which uses the true rate as defined in (Hong 2009)
                                                                              as a quality criterion. It finds the optimal threshold that
versus the false positive rate (FPR), defined as                              minimizes the difference between the simple mean of the
                                                     FP                       Type 1 error, associated with false positives, and Type 2
                                      FPR =                ,                  error associated with false negatives. To correct for the
                                                   FN + TP
                                                                              imbalanced misclassification costs, the errors are weighted
as suggested in (Bradley 1997) and (Provost and Fawcett                       by their respective costs for which the ratio is usually
2001). A random classifier would be located along the main                    known. An example using this method is found in (Roelens
diagonal (Fawcett 2006) and a good classifier would have                      et al. 2018) for flood prevention data.
points in the upper left corner. This is visualized in Figure 1.              In the context of credit card assessment, an MCDM (Mul-
A widely used performance measurement is the area under                       tiple Criteria Decision Making) was proposed in (Beulah
the curve (AUC) of the ROC. An optimal threshold is needed                    Jeba Jaya .Y 2015) to select an optimal probability cut-off
to minimize the false positives and negatives by maximizing                   using hybrid data mining approaches.
the AUC.                                                                      Ensemble classifiers are common methods to combine
                                                                              multiple classifiers, each one looking at different features of
                           1.00
                                                                              a dataset. In (Breiman 2001), a method is shown where a
                                                                    ROC
                                                                              group of poor classifiers can be harnessed together to obtain
      True positive rate


                                                                     random
                                                                              a way better performance. (Dietterich 2000) shows how dif-
                                                                     good     ferent methods for voting in these ensembles of classifiers,
                                                                     better   from bayesian averaging to error-correcting, output coding,
                                                                     ideal    and boosting can often perform better than a single classifier.

                                                                                 Different thresholding methods were suitable for a range
                           0.00                                               of applications, we identified few very important challenges,
                              0.00                           1.00
                                                                              described below, that are faced when dealing with rare event
                                       False positive rate
                                                                              data sets, which were not addressed before, in our knowl-
                                                                              edge.
Figure 1: Typical ROC curve plotting TPR vs FPR. Random
classifier produces a lines a long the diagonal, the closer the                                Problem Description
points to the upper left corner the better the classifier perfor-
                                                                              In an extreme rare event problem, if one positive event hap-
mance
                                                                              pens in every ten thousand samples, one would have to la-
                                                                              bel a million samples to detect 100 positive events. Like the
                                                                              data we are dealing with from Google, many datasets will
                                        Related Work                          lack information about the true and false negatives, because
ROC and precision-recall analysis are widely used in                          it is very expensive to have humans spend time rating such a
imbalanced data set classification, as shown in various types                 large amount of data. Therefore, it makes sense that the only
of rare event data sets in (Kubat, Holte, and Matwin 1998),                   data classified by humans is the data that scored higher than
(Weiss 1999), and (Carvalho and Freitas 2002). However,                       a given score on the classifier i.e the positive class. Unfor-
choosing an optimal threshold also called “optimal op-                        tunately, not knowing the true and false negatives, therefore,
erating point”, is only well-understood for balanced data                     calls for analysis methods that do not require them.
sets where there are approximately as many positives as                           The second important challenge to address occurs when
negatives. A standard method is to take the intersection of                   using multiple classifiers to evaluate a single document. It
the iso-performance tangent line in the ROC as the optimal                    often happens that we need to look at different aspects of
the document using different classifiers before making a de-       the break-even line as known in information retrieval, a more
cision. For instance, one classifier can be assigned images        detailed explanation is given in [8].
and another classifier can look for text. Each classifier will        Mathematically,
output a different score and after a threshold is chosen, it can
                                                                                 m∗           n∗
                                                                                                       !
be compared to each of the score and a decision can be made                     X             X
about whether the document belongs to the positive or neg-                           IYi ≥t ,    IXi ≥t , t ∈ R≥0 ,
ative class. The difficulty arises in finding a threshold that                     i=1         i=1
optimizes all the models jointly. This takes us from a single-              ∗         ∗
                                                                   where m and n are the total number of positives and nega-
dimensional optimization problem to a multi-dimensional            tives, respectively, having scores greater than 0 for instance.
one. In the following sections, we will explore the meth-             Given a classifier, a low threshold T can be set, the docu-
ods developed to address these challenges in the context of        ments with a score higher than T can be sent to human raters
extreme rare event data sets.                                      to determine the TP and FP, then in the resulting data set,
                                                                   the threshold is raised to see how the TP and FP change.
                   Proposed Methods                                Note that as the threshold keep getting higher, we would
For a single classifier                                            have fewer data points in the sub-sample and therefore a
For the sake of completeness, we begin with the simple case        wider confidence interval. Using the partial ROC curve im-
of one classifier where we start defining the framework of         plies that we can not use the rated data to simulate the TP
our methods.                                                       and FP count had the model threshold been set even lower
   Let Xi be the independent and identically distributed (iid)     than T.
random variable representing the score of positive docu-           With the partial ROC curve, we can formulate our optimiza-
ments. The number of true positives at a threshold t is            tion problem as follows:
the number of positive documents greater than the score t.         Given any point on the curve, if this point was the operat-
Hence, for each threshold t we have TPR as:                        ing point and we wanted to get one additional true positive
                            n
                                                                   document, we can determine the cost that we need to pay in
                         1X                                        terms of a raw number of false positives.
                              IX ≥t ,
                         n i=1 i                                      In many situations, this formulation can relate to a busi-
                                                                   ness goal. For example, the positive class could be unwanted
where n is the total number of positive samples and IXi ≥t is      documents that a company needs to filter out without losing
the indicator function. Similarly, let Yi be the iid random        the wanted documents. Thus, they can provide a number to
variable of the score of negative documents, then FPR is           quantify how many good documents they are willing to give
given as:                                                          up in order to filter out one additional bad document. This
                           m
                        1 X                                        can be answered by analyzing monetary costs for instance.
                               IY ≥t ,                             Based on this, a target derivative FT PP can be chosen, again
                        m i=1 i
                                                                   this represents the amount of false positives gained vs the
where m is the total number of negative samples. The ROC           amount of true positives gained at a certain threshold. The
curve is then parametrized as                                      target derivative is based on expert knowledge, which was
                m             n
                                     !                             provided for us by Google in our application.
             1 X           1X                                         A partial ROC curve can be described as a parametric
                   IY ≥t ,      IX ≥t , t ∈ R,
            m i=1 i        n i=1 i                                 function
and is expected to converge to (P (Yi ≥ t), P (Xi ≥ t)). The              f (t) = (F P (t), T P (t)), where f : T → R2 ,
trade-off optimization problem then becomes one of find-
ing a threshold t that maximizes P (Xi ≥ t) and minimizes          where T is the threshold space. A fit can be applied to find
P (Yi ≥ t). However, as mentioned in the introduction, we          the points where the derivative has the desired value. In the
are addressing the situation where only the portion of the         result section, we present the fitting methods that worked
documents having scored higher than a certain threshold are        best for typical partial ROC curves.
rated. In other words, the distributions are left-censored, we
only have IXi ≥t , IYi ≥t for t ≥ 0 and we only know the to-       For an ensemble of classifiers
tal number of sample m + n but do not have knowledge of            In this case, the PN curve (partial ROC) becomes an N-
either m or n. Hence, we are unable to plot an ROC curve.          dimensional manifold, dF dT P
                                                                                               P becomes a more complicated
   Since we lack information about the true and false neg-         notion since a local direction must be picked for a unique
atives, we use a reformulation of the ROC curve that plots         derivative to exist.
TP instead of TPR and FP instead of FPR , this removes             In general, one can define the function as
the normalization to probabilities that a normal ROC does
by dividing by the positives or negatives. The reformulation                    f (~t) = (F P (~t), T P (~t)); f : T̂ → R2 ,   (1)
is referred to as the positive-negative (PN) curve as defined
in [8]. In our paper, we also refer to it as a partial ROC         where T̂ is the threshold space described by the cross
curve. The appearance is the same between the PN and ROC           product of all the threshold values for each classifier.
curve, but the chance-line (diagonal) in the ROC becomes
  We can consider the derivative at a point to be:               Algorithm 2 Analytic implementation of FindOptimal-
                                                                 Curve
                              ∆TP · u
                       lim            ,                          Input: Set of points P representing (T P, F P, T 1, T 2, ...),
                       ∆x→0 ∆FP · u
                                                                 a family of curves fp defined on [0, 1] → T
   where u is a vector that describes the direction at which     Output: A curve C
the derivative is being taken.
   We describe a path                                             1: Fit a function F̂ (T 1, T 2, ...) as the best fit for the points
             φ(s) = (g(s), h(s)) for 0 ≤ s ≤ 1                       (F P, T 1, T 2, ...)
                                                                  2: Fit a function T̂ (T 1, T 2, ...) as the best fit for the points
over T̂ space, that is, a function φ : [0, 1] → T̂ .                 (T P, T 1, T 2, ...)
   We impose the following restrictions on φ for the resulting    3: Let gp ← (T̂ , F̂ ) ◦ fp
curve to behave like a partial ROC curve:                         4: q ← argmaxp AU C(gp (t))dt
• φ(0) = (inf T1 , inf T2 , . . . , inf Tn )                      5: return {(t, fq (t))∀t ∈ [0, 1]}
• φ(1) = (sup T1 , sup T2 , . . . , sup Tn )
• f (t) and g(t) must be monotonous
   The first two conditions are so that the curve starts and     1. Normalizing the thresholds
ends at the correct points, and the last point is so that the    2. Plotting the three-dimensional plots (TP, T1, T2) and (FP,
amount of true positives and false positives decrease with          T1, T2) and fitting a surface through each one of them us-
respect to our new “threshold”.                                     ing an interpolation technique. We create an n × n grid
To pick the best classifier, φ should maximize the area under       on the threshold plane with each point on the grid corre-
the curve:                                                          sponding to a pair of thresholds and a value of TP and FP.
                            Z                                                                                ˆ and FP
                                                                    This step allows finding the estimators TP       ˆ at each
                 argmaxφ                T P dF P                    point of the surface.
                             φ([0,1])
                                                                 3. Turning the grid into a linear space; since the data is nor-
                                                                    malized at this point, the domain will be from 0 to 1. We
   We propose two algorithms to find φ, one is based on min-        can also specify the granularity i.e the size of the parti-
imizing a function (the analytic approach) and the other one        tions in this domain. In the next steps, the granularity will
is based on building a graph (the graph approach).                  also indicate the number of AUC samples that are calcu-
   Before we give a step by step example implementation for         lated in the process.
both methods, we introduce the following simple base algo-          This step is the start of the parametrization used in this
rithm for both methods to find the optimal operating point.         approach. If s denotes the parameter indicating the order
                                                                    of the partitions, then both thresholds are a function of s
Algorithm 1 Base algorithm for finding the optimal operat-          which is then used as an index in a separate parametric
ing point                                                           function to output the corresponding TP and FP values
Input: Set of points P representing (T P, F P, T 1, T 2, ...),      separately.
target derivative d
                                                                 4. Defining an energy function as follows:
Output: Optimal operating point (T 1, T 2, ...)
                                                                                E(~         ˆ ◦ φp~ , TP
                                                                                  p) = AUC(FP          ˆ ◦ φp~ )
 1: C ← FindOptimalCurve(P )
                       d                                            where:
 2: Find point q where dt C(q) = d
 3: return q                                                        φp~ is a family of parametric functions defined from [0,1]
                                                                    to T1 ×T2 with p~ being the parameter. In our algorithm, we
   In what follows, we explain the FindOptimalCurve func-           used a parametric function that is graphically close to the
tion used in Algorithm 1 in the context of both the analytic        resulting discrete path and that passes through the bounds
and graph approach.                                                 (0,0) and (1,1):

Analytic approach                                                                              φp~ = xp                          (2)
The core of this approach is the usage of a family of func-
tions for which we optimize the parameters in a way that            AUC is a function that calculates the area under the curve
maximizes the area under the corresponding PN curve. This           using the trapezoidal formula, this is an estimation of the
approach estimates the integral yielding the area under the         integral form of
                                                                                  R the calculation of the area under the curve
curve using the simple trapezoidal method. The algorithm            described by: {(x,φp~ (x))} TP dFP
implementation is summarized in Algorithm 2.                     5. Minimizing the function below using the Nelder-Mead
   We describe the implementation of this algorithm in more         minimization function.
detail using two thresholds, as an example, in the following
steps:                                                                                     − log(log(E))
                                                                                     TP


                                                                                                           FP
                                                                    Figure 3: The trapezoid area under a single segment between
                                                                    two edges, the weight is defined to be the negative of the
                                                                    highlighted area
Figure 2: The graph used in this approach; each circle rep-
resents a node which is a 4-tuple from the data (T1, T2, TP,
FP) and each arrow represents an edge.                                 Then the total weight of edges in a path is the negative
                                                                    of the area under the curve swept out in the TP vs FP plot.
                                                                    In another words, by fixing end points, we can find a better
   Note that we made the previous function negative so we           TP vs FP curve by finding a path that has a minimal weight.
   can use a minimization method to maximize the AUC. We            So, it suffices to find the shortest path from v0 to v∞ ,with
   also composed E with two logarithms in order to make             (t1(v0 ), t2(v0 )) = (0, 0) , (t1(v∞ ), t2(v∞ )) = (∞, ∞) .
   it converge faster, otherwise, it might get stuck on local       Intuitively, it finds a shortest path from the left bottom corner
   minima.                                                          to the right upper corner in the thresholds space T1 × T2 .
6. Creating a function that determines the curve in the TP-         The edge is directed towards up or right to avoid bent paths
   FP space that corresponds to the resulting path within the       in the corresponding TP vs FP curve.
   threshold space using the parameterization mentioned in             After we find the set of nodes in the shortest path, we con-
   Step 3.                                                          nect the corresponding points and re-parametrize the path.
                              dT P
7. Finding the point s where dF                                     Note that it would be continuous but not smooth.
                                 P w.r.t. the parameterization
   s equals the target marginal precision, by estimating the        The algorithm for this method is the following:
   derivative of f ◦ φ.
8. Finding the optimal operating point (T1,T2,TP,FP) given          Algorithm 3 Graph based implementation of FindOptimal-
   a target marginal precision ie. the values of t where            Curve
   φ(t) = s. It can be interpolated at the value of s given         Input: Set of points P representing (T P, F P, T 1, T 2, ...)
   by the closest marginal precision to the target.Ridders          sampled in a grid with respect to the thresholds
   method can be used for root finding to get the best pa-          Output: A curve C
   rameter s.
                                                                     1: Create a graph G connecting points to neighbors with a
Graph-based approach                                                    greater coordinate than them
The goal of the graph-based approach is again to find the            2: Let the origin be A0 and the point with the biggest co-
continuous path on the threshold space that maximizes the               ordinates be A∞
area under the curve in the TP-FP space, now with the con-           3: P ← FindOptimalPath(G, A0 , A∞ )
straint that every point is only allowed to move through its         4: Let f (t) : [0, 1] → F P ×T P such that it passes through
neighbors.                                                              the path P
   We defined the node, edge, and weight of the graph shown          5: return {(t, f (t))∀t ∈ [0, 1]}
in Figure 2, for 2-dimensions, as follows:
• Node v: represents a pair of thresholds, having attributes           Below we explain this approach in detail, starting from
   of (T1,T2,TP,FP)                                                 step 3 of the analytic approach for a 2-dimensional example,
                                                                    since the first two steps are similar.
• Edge e: the connection between the nodes is established
   if and only if one of their thresholds is equal or their other   1. Transforming the threshold space into an n × n grid with
   threshold is consecutive in the discrete set of thresholds.         each point on the grid corresponding to a pair of thresh-
• Weight of edge e = (v1 , v2 ): this is equal to the                  olds and a value of TP and FP.
   negative of the area under the segments (trapezoid)              2. Calculating the weight of the graph as the negative area
   (FP(v1 ), TP(v1 )) and (FP(v2 ), TP(v2 )) as shematized             under the segment formed by 2 points v1 , v2 in the TP vs
   in Figure 3. Mathematically, the weight is defined as:              FP curve using equation 3
                1                                                   3. Determining the weight of the edge connecting the node
     w(e) = − (|FP(v1 ) − FP(v2 )|)(TP(v1 ) + TP(v2 ))                 of two consecutive points.
                2
                                                              (3)      Note that the predicted positives have to be non-negative,
   hence the segment is above the x-axis. The weight is              Each classifier was trained using the first 20% of the com-
   non-positive therefore a lower weight indicates a higher       ponents of the PCA. We need to simulate a joint optimiza-
   area under the segment.                                        tion of two classifiers where each one of them makes deci-
                                                                  sions over a separate but related data set. Thus, one of the
4. Store the (TP, FP, T1 , T2 ) tuple in every node as extra      classifiers was assigned the remaining even components and
   information. The edge can then be defined as a 3-tuple of      a second classifier of the same type was assigned the remain-
   (nk , nl , wk,l ), where nk and nl are each node and wk,l is   ing odd components.
   the negative area under the curve of each node.                   The hyperparameters, for each classifier, were tuned by
   The nodes are connected if they are within the same            maximizing the average accuracy in a cross-validation test.
   neighborhood. A function can do so if the one of the           A cross-validation test was done by splitting the training
   thresholds is the same between two neighbours or the           set T into 5 different cross-validation groups: T1 , T2 , ...T5 ,
   thresholds are consecutive to each other in the discrete       where for S each training set Ti , a classifier is trained using
   set of thresholds.                                             the set j6=i Tj and the score for the ith fold is denoted by
   The path selected is therefore just a collection of nodes      si and is obtained by evaluating the accuracy of the classifier
   from the left bottom to right upper corner in thresholds       on Ti . The hyperparameters with the highest average value
   space and the total weight of the path corresponds to the      of s1 , ..., s5 are picked.
   negative area under the curve swept out by it in the TP -         The classifiers used for the testing were Linear SVMs,
   FP space.                                                      Random Forests, Decision Trees, k-nearest neighbors, XG-
                                                                  Boosting and Naive Bayes.
   Note that in order to compute the shortest path be-               We explored multiple classification techniques to deter-
   tween the lower-left corner (source node) to the top           mine whether our methods improve their performance. We
   right corner node, the Bellman-Ford algorithm can be           relied on (Pedregosa et al. 2011) for a compilation of clas-
   implemented since it accepts edges with negative weights       sification algorithms. In all classifiers, we estimated the hy-
   unlike the usual Dijkstra’s algorithm.                         perparameters through a grid search with cross-validation.
                                                                  For the k-nearest neighbors
                                                                                           √      classifier, we tested all the val-
5. Connecting the nodes to obtain a continuous path in the        ues of k from 1 to n where n is the number of points
   threshold space and determining the curve in the TP-FP         in the training data, as well as the L1 and L2 metrics, the
   space that corresponds to the selected path. It does so us-    default behavior for classification, is to pick the class with
   ing the same parametrization explained in step 3 of the        the biggest number of nearest neighbors assigned to them.
   analytic approach.                                             The decision tree classifier used the grid search coupled with
6. Getting the marginal precision, refer to step 7 of the ana-    the Gini and entropy criterion. We optimized the maximum
   lytic approach.                                                depth parameter for all integer values between 1 and 500 as
7. Determining the optimal operating point given a target         well as without setting a maximum depth, the default clas-
   marginal precision, refer to step 8 of the analytic ap-        sification method is to follow the instructions for the tree.
   proach.                                                        The random forest classifier also used the Gini and entropy
                                                                  criterion with the number of estimators between 50 and 500,
                                                                  with steps of 10 between them, the default behavior for clas-
               Experiments and Results                            sification is to let each decision tree in the forest vote, and
Evaluating the performance using public datasets                  the class with the most votes wins. For the extreme gradient
To evaluate the performance of the method, we use rare            booster (XGboost), we used a maximum depth from 1 to 20,
event data sets where the amount of false negatives and true      a learning rate from 0.1 to 1, with steps of 0,1, and a number
positives can be calculated, in order to compare the preci-       of estimators from 50 to 500 with a step of 10, the default
sion and recall of two standard classifiers with our methods      classification behavior is the same as random forests. For
on the resulting ensemble of classifiers. This was done using     the linear SVMs, we again used the grid-search with cross-
three rare event datasets123 whose true and false positive and    validation to look for the hyperparameters between using L1
negative values are known.                                        and L2 norm for the penalty, the default threshold used was
   Each dataset was cleaned by changing empty values for          0. For SVMs, we used the hinge or squared hinge loss func-
the average value in the column, turning categorical datasets     tions to train the classifier with a dual true or false and using
into one-hot encoded vectors, and encoding cyclical values        all values of the regularization parameter λ from 0 to 1 with
like days of the weeks into points in a unit circle. These        a step of 0.1, the default threshold is 0. We also used a Naive
datasets were then passed through a PCA procedure (Princi-        Bayes classifier with no optimized parameters, whose de-
pal Component Analysis) to reduce the dimensionality. The         fault classification behavior is to choose the class with the
data sets were then split into a training set T and a holdout     highest probability.
testing set H, where T represents 75% of the data set and H          A grid was made using the result of the cross product of
represents the other 25%.                                         the thresholds of equally spaced points in each dimension.
                                                                  An input would be considered to have a true value if either
   1
     https://www.kaggle.com/mlg-ulb/creditcardfraud               of the scores produced by each classifier surpassed their own
   2
     https://www.kaggle.com/henriqueyamahata/bank-marketing       threshold. This means that we are taking the union of the
   3
     https://www.kaggle.com/c/ieee-fraud-detection                classification. Using the scores outputted by the classifiers,
Figure 4: The F1 measure of the best performing method            Figure 5: The precision of the best performing method to
to create the ensemble against the average F1 measure of          create the ensemble against the average F1 measure of the
the two classifiers that compose it for each pair of dataset      two classifiers that compose it for each pair of dataset and
and classifier type. The classifier types are abbreviated as      classifier type. The classifier types are abbreviated as the
the following: KNN is K-nearest neighbors, SVM is a linear        following: KNN is K-nearest neighbors, SVM is a linear
SVM, NB is Naive Bayes, RF is Random Forests and XGB              SVM, NB is Naive Bayes, RF is Random Forests and XGB
is XGBoosting.                                                    is XGBoosting.


we calculated the amount of false positives and true positives
for each pair of thresholds in the grid. This was the input
for Algorithm 1. We then calculated the precision and recall
from the testing set.
   To evaluate their performance, the F1 measure was used
instead of the accuracy score. This latter is not suitable for
rare event data sets where a difference in accuracy of 0.01%
can make a huge difference in both the precision and recall.
The F1 measure is calculated by
                                  pr
                        F1 = 2       ,
                                 p+r
where p is the precision and r is the recall.                     Figure 6: The recall of the best performing method to create
   Figures 4, 5 and 6 show the F1 measure, precision, and         the ensemble against the average F1 measure of the two clas-
recall, respectively, for each experiment. An experiment was      sifiers that compose it for each pair of dataset and classifier
defined as a combined pair of a dataset and a classifier type.    type. The classifier types are abbreviated as the following:
When either the precision or recall were extremely low, the       KNN is K-nearest neighbors, SVM is a linear SVM, NB is
other quantity is ”undefined”. In this case, taking the union     Naive Bayes, RF is Random Forests and XGB is XGBoost-
of the classifiers’ decisions results in classifying every doc-   ing.
ument as positive, which makes the ensemble classify every
document as positive.
   Overall, the precision was increased on average by 5.1%        columns represent two thresholds of two different classifiers
while the recall was reduced by an average of 1.6% and F1         and the next two represent the number of true and false
measure increased by an average of 1.5%.                          positives for each pair of thresholds as determined by
Table 1 and Table 2 of the appendix show the results for          human raters, assuming all documents over the threshold is
all the simulated experiments with all the mentioned classi-      considered as part of the positive class. These datasets, have
fiers. Undefined values are showed as - on the tables. As the     a prevalence of the positive class of about 0.01% of the total
F1 measure of the classifiers that do not use the proposed        number rows.
methods increases, the F1 measure of the ensemble result-
ing from the proposed methods also increases. This trend             The result of running Algorithm 2 and 3 is shown in
is shown on figure 7. It also plots the identity line to show     Figure 10 for an example data set. The figure represents the
that the proposed methods improve the F1 measure of all the       threshold paths obtained using both the analytic and graph
points above this line.                                           approach, as well as the location of the operating point.
Benchmark using Google data sets                                     Using the parameterization from Equation 1, we can use
The Google data contains 22 data sets each consisting of          the threshold paths to determine the optimal partial ROC
about 10,000 rows and 4 columns of data. The two first            curve that was selected by each of these algorithms.
Figure 7: The average F1 measure of the classifiers with-         Figure 9: The operating point picked by each algorithm in
out the proposed method compared to the F1 measure of the         the TP vs FP space.
with the proposed method.

                                                                                          Discussion
Next, we can determine the optimal operating point given a        We used an or (union) in combining the decisions of the clas-
target marginal precision as determined by a business goal        sifiers but it can be substituted by an and operation depend-
set by Google’s expert knowledge efforts prior to this work.      ing on whether the application prioritizes positive classifi-
The result of running Algorithm 1 is shown for 3 different        cation or negative classification. Our methods also assume
methods in Figure 8 in the threshold space and correspond-        that the ROC curve for both of the classifiers has a unique
ingly in Figure 9 in TP vs FP space.                              decreasing derivative which can be expected from a tuned
                                                                  classifier.
                                                                  Both the analytic and graph methods were presented as al-
                                                                  though both useful, there are different disadvantages to each
                                                                  one of them. The analytic method is heavily dependent on
                                                                  the family of functions used to optimize over, and is prone
                                                                  to get stuck in an optimal minimum. On the other hand, it
                                                                  is uncertain whether the graph method is prone to overfit-
                                                                  ting to the data or not, however, unlike the analytic method,
                                                                  there is no way to mitigate it. A possible alteration to be
                                                                  done to mitigate overfitting with the graph method would be
                                                                  to hold a series of simulations where some connections be-
                                                                  tween the nodes are deleted, and then picking the average
                                                                  path between the resulting set of paths.
                                                                     Both algorithms scale reasonably well with respect to the
                                                                  amount of data. The graph algorithm scales at a rate of
Figure 8: The operating point picked by each algorithm in         O(nk ), where n is the number of rows and k is the num-
the threshold space.                                              ber of classifiers. This is good when the number of dimen-
                                                                  sions is low, however, it scales exponentially with the num-
                                                                  ber of classifiers in the ensemble. The analytic approach
   The analytic and graph methods mentioned in figures 8          works the opposite way since it depends on an optimiza-
and 9 are the proposed methods, while the benchmark is a          tion problem, it does not scale very well with respect to the
fine-tuned operating point used by Google based on sepa-          number of rows, however, it scales linearly with the number
rating the false positives into small intervals called buckets,   of dimensions. We compared the running time of the algo-
and picking the highest TP for each bucket, then adjusting        rithms using a computer with an i7-6700 and 32GB of RAM
this manually to fit the goals of the optimal operating point.    for all the Google data sets combined. The analytic approach
While this latter is good as a benchmark for our algorithms,      ran on average 10 times faster than the graph approach; for
it can not be used as another optimization method since it        n = 10, 000 and k = 2, the analytic method takes 60 sec-
lacks mathematical rigor and produces an erratic behavior in      onds while the graph method takes 6 seconds in our condi-
the threshold space which, without fine-tuning, would imply       tions.
the non-existence or non-uniqueness of the derivative of the         Finally, the experiments were done using an automated
partial ROC curve needed to find the operating point.             method to tune the classifiers, notably the grid-search with
Our methods make a good attempt at approximating the              cross-validation in order to obtain a large number of samples
operating point selected by the benchmark method. These           across different accuracies. However, this could be improved
points are also plotted on their respective paths in Figure 10.   since it does not reflect a real-world application of this pro-
                                                                 with their default classification behavior. The result im-
                                                                 proved the precision while lowering the recall less signifi-
                                                                 cantly. This tradeoff could be reversed, if desired, by sim-
                                                                 ply changing the ensemble decision between the individual
                                                                 classifiers from ’or’ to ’and’. We discussed important limita-
                                                                 tions and possibilities for future work. Notably, optimizing
                                                                 different metrics other than AUC individually or in combi-
                                                                 nations would be worth exploring. Moreover, we optimized
                                                                 two binary classifiers jointly but the theoretical framework
                                                                 described in the methods section extends to N dimensions.
                                                                 Using this in practice will require optimizing over multi-
                                                                 ple parameters in the analytic approach and extending the
                                                                 graph to an N-dimensional lattice. Exploring this possibility
                                                                 through practical applications might open great avenues for
                                                                 the general optimization of ensemble classifiers.

                                                                                   Acknowledgements
                                                                 We would like to thank our teammates Ching Pui Wan and
                                                                 Joanne Beckford, NSF grant DMS-0931852, Payam Re-
                                                                 fael and his team from Google, our academic mentor Aliki
                                                                 Mavromoustaki, and all of the IPAM staff at UCLA for mak-
                                                                 ing this work possible.

                                                                                         References
                                                                 Aggarwal, C. C. 2014. Data classification: algorithms and
Figure 10: The path picked by the analytic (top) and graph       applications. CRC Press, Taylor & Francis Group.
method (bottom), using the function in equation 2 for the
                                                                 Beulah Jeba Jaya .Y, J. J. T. 2015. Multiple criteria deci-
analytic method.
                                                                 sion making based credit risk prediction using optimal cut-
                                                                 off point approach. International Journal of Applied Engi-
                                                                 neering Research 10:20041–20054.
cess in which the hyperparameters would be estimated with
the help of additional fine-tuning methods that depend on        Bradley, A. P. 1997. The use of the area under the roc curve
the specific application of the data set.                        in the evaluation of machine learning algorithms. Pattern
                                                                 recognition 30(7):1145–1159.
                       Conclusion                                Breiman, L. 2001. Random forests. Machine Learning
                                                                 45(1):5–32.
The proposed algorithm finds optimal thresholds for the out-
                                                                 Calabrese, R. 2014. Optimal cut-off for rare events and un-
put of binary classifiers used for rare event datasets. The
                                                                 balanced misclassification costs. Journal of Applied Statis-
method works by finding the best possible path in the TP
                                                                 tics 41(8):1678–1693.
- FP space, which is the one that maximizes the area under
the curve of the projected partial ROC curve. This curve is      Carvalho, D. R., and Freitas, A. A. 2002. A genetic-
then used to find the optimal operating point corresponding      algorithm for discovering small-disjunct rules in data min-
to a given target derivative. The latter represents the trade-   ing. Applied Soft Computing 2(2):75–88.
off of lost false positives to gained true positives which was   Dietterich, T. G. 2000. Ensemble methods in machine learn-
obtained from the Google team’s expert knowledge efforts         ing. In Multiple Classifier Systems, 1–15. Berlin, Heidel-
prior to this work. Two optimization methods are proposed:       berg: Springer Berlin Heidelberg.
an analytical method, which optimizes parameters of an en-       Fawcett, T., and Provost, F. 1996. Combining data mining
ergy function and a graph method, which performs the op-         and machine learning for effective user profiling. In Pro-
timization through finding the path in a graph to maximize       ceedings of the Second International Conference on Knowl-
the AUC.                                                         edge Discovery and Data Mining, 8–13. AAAI Press.
   Our results were very close to the optimal operating
                                                                 Fawcett, T. 2006. An introduction to roc analysis. Pattern
points that were picked manually by the Google team. We
                                                                 recognition letters 27(8):861–874.
were, therefore, able to automate their process of finding
an optimal threshold while providing a mathematical frame-       Hong, C. S. 2009. Optimal threshold from roc and cap
work and using their target derivatives(marginal precisions).    curves. Communications in Statistics - Simulation and Com-
Moreover, we compared the F1 measure, precision, and re-         putation 38(10):2060–2072.
call of combining five classifiers and three datasets while      Kubat, M.; Holte, R. C.; and Matwin, S. 1998. Machine
either relying on our thresholding methods and taking the        learning for the detection of oil spills in satellite radar im-
union of the decisions or using the classifiers individually     ages. Machine learning 30(2-3):195–215.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
learn: Machine learning in Python. Journal of Machine
Learning Research 12:2825–2830.
Provost, F., and Fawcett, T. 2001. Robust classification for
imprecise environments. Machine learning 42(3):203–231.
Provost, F. 2000. Machine learning from imbalanced data
sets 101. In Proceedings of the AAAI’2000 workshop on
imbalanced data sets, volume 68, 1–3. AAAI Press.
Roelens, J.; Rosier, I.; Dondeyne, S.; Van Orshoven, J.;
and Diels, J. 2018. Extracting drainage networks and
their connectivity using lidar data. Hydrological processes
32(8):1026–1037.
Weiss, G. M., and Provost, F. 2003. Learning when training
data are costly: The effect of class distribution on tree induc-
tion. Journal of artificial intelligence research 19:315–354.
Weiss, G. M. 1999. Timeweaver: A genetic algorithm
for identifying predictive patterns in sequences of events.
In Proceedings of the 1st Annual Conference on Genetic
and Evolutionary Computation-Volume 1, 718–725. Mor-
gan Kaufmann Publishers Inc.
Weiss, G. M. 2004. Mining with rarity: a unifying frame-
work. ACM Sigkdd Explorations Newsletter 6(1):7–19.
                                                          Appendix
Table 1: The results of the experiment being performed on multiple datasets with different types of classifiers. The empty values
in precision or F1 measure occur when every data point is classified as a negative, and so precision is undefined and recall is 0.
                 Dataset        Ensemble             Classifier              Precision Recall F1 measure
                 CreditCards LinearSVM               Analytical              0.88         0.76       0.82
                                                     Graph                   0.897        0.76       0.82
                                                     Linear SVM 1            0.88         0.66       0.75
                                                     Linear SVM 2            0.88         0.56       0.68
                                Random Forest        Analytical              0.90         0.76       0.82
                                                     Graph                        -       0                -
                                                     Random Forest 1         0.94         0.73       0.82
                                                     Random Forest 2         0.92         0.73       0.81
                                Decision Tree        Analytical                   -       0                -
                                                     Graph                        -       0                -
                                                     Decision Tree 1         0.6696       0.63       0.65
                                                     Decision Tree 2         0.7818       0.72       0.75
                                k-nearest neighbors Analytical                    -       0                -
                                                     Graph                        -       0                -
                                                     k-nearest neighbors 0.8191           0.6417 0.7196
                                                     k-nearest neighbors 0.9342           0.5917 0.7245
                                xgboost              Analytical              0.9231       0.7000 0.7962
                                                     Graph                   1.0000       0.0083 0.0165
                                                     xgboost                 0.9140       0.7083 0.7981
                                                     xgboost                 0.9000       0.7500 0.8182
                                Naive Bayes          Analytical              0.0836       0.8000 0.1513
                                                     Graph                        -       0.0000           -
                                                     Naive Bayes             0.0671       0.8417 0.1243
                                                     Naive Bayes             0.0746       0.8167 0.1368

                Bank            LinearSVM              Analytical             0.1125       1.0000    0.2022
                                                       Graph                      -        0.0000         -
                                                       Linear SVM 1           0.6633       0.3368    0.4467
                                                       Linear SVM 2           0.6582       0.3359    0.4448
                                Random Forest          Analytical             0.6529       0.4905    0.5602
                                                       Graph                      -        0.0000         -
                                                       Random Forest 1        0.6477       0.4827    0.5532
                                                       Random Forest 2        0.6516       0.4732    0.5483
                                Decision Tree          Analytical                 -        0.0000         -
                                                       Graph                      -        0.0000         -
                                                       Decision Tree 1            -        0.0000         -
                                                       Decision Tree 2            -        0.0000         -
                                k-nearest neighbors    Analytical             0.6766       0.3903    0.4951
                                                       Graph                      -        0.0000         -
                                                       k-nearest neighbors    0.6528       0.4482    0.5315
                                                       k-nearest neighbors    0.6508       0.4473    0.5302
                                xgboost                Analytical             0.6458       0.4724    0.5456
                                                       Graph                  0.6832       0.3929    0.4989
                                                       xgboost                0.6359       0.4870    0.5516
                                                       xgboost                0.6333       0.5026    0.5604
                                Naive Bayes            Analytical             0.5724       0.3584    0.4408
                                                       Graph                  0.5694       0.3083    0.4000
                                                       Naive Bayes 1          0.5284       0.5147    0.5214
                                                       Naive Bayes 2          0.5206       0.5121    0.5163
                            Table 2: Continuation of Table 1
Dataset     Ensemble             Classifier              Precision   Recall   F1 measure
IEEEFraud   k-nearest neighbors Analytical                   -       0.0000        -
                                 Graph                       -       0.0000        -
                                 k-nearest neighbors 1 0.6380        0.0897   0.1573
                                 k-nearest neighbors 2 0.6476        0.0942   0.1645
            Naive Bayes          Analytical              0.0342      1.0000   0.0662
                                 Graph                       -       0.0000        -
                                 Naive Bayes 1           0.0873      0.1356        -
                                 Naive Bayes 2           0.0840      0.0954   0.0004
            LinearSVM            Analytical              0.0342      1.0000   0.0662
                                 Graph                   0.0000      0.0000        -
                                 Linear SVM 1            0.0175      0.0002   0.0004
                                 Linear SVM 2            0.0870      0.0008   0.0016
            Decision Tree        Analytical                  -       0.0000        -
                                 Graph                       -       0.0000        -
                                 Decision Tree 1             -       0.0000        -
                                 Decision Tree 2             -       0.0000        -
            Random Forest        Analytical              0.7605      0.2647   0.3927
                                 Graph                       -       0.0000   -
                                 Random Forest 1         0.8652      0.1855   0.3055
                                 Random Forest 2         0.8641      0.1851   0.3049

</pre>