=Paper= {{Paper |id=Vol-3920/paper11 |storemode=property |title=Is 𝐹1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ‘ π‘π‘œπ‘Ÿπ‘’, a Cost-Aware Alternative for Model Assessment |pdfUrl=https://ceur-ws.org/Vol-3920/paper11.pdf |volume=Vol-3920 |authors=Manish Marwah,Asad Narayanan,Stephan Jou,Martin Arlitt,Maria Pospelova |dblpUrl=https://dblp.org/rec/conf/camlis/MarwahNJAP24 }} ==Is 𝐹1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ‘ π‘π‘œπ‘Ÿπ‘’, a Cost-Aware Alternative for Model Assessment== https://ceur-ws.org/Vol-3920/paper11.pdf
                         Is 𝐹1 Score Suboptimal for Cybersecurity Models?
                         Introducing πΆπ‘ π‘π‘œπ‘Ÿπ‘’, a Cost-Aware Alternative for
                         Model Assessment
                         Manish Marwah1,* , Asad Narayanan2 , Stephan Jou2 , Martin Arlitt2 and Maria Pospelova2
                         1
                             OpenText, USA
                         2
                             OpenText, Canada


                                        Abstract
                                        The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal
                                        and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack
                                        is very different from marking a benign activity as an attack. Various design choices during machine learning
                                        model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between
                                        these two errors. However, most of the commonly used metrics to evaluate model quality, such as 𝐹1 score,
                                        which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for
                                        users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ ,
                                        based on precision and recall that can replace 𝐹1 score for model evaluation and selection. It includes a cost
                                        ratio that takes into account the differing costs of handling false positives and false negatives. We derive and
                                        characterize the new cost metric and compare it to 𝐹1 score. Further, we use this metric for model thresholding
                                        for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%.

                                         Keywords
                                         machine learning, cybersecurity, 𝐹1 score, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , misclassification, cost-sensitive machine learning, false positive,
                                         false negative




                         1. Introduction
                         Applications of machine learning in cybersecurity are widespread and rapidly growing, with models
                         being deployed to prevent, detect, and respond to threats such as malware, intrusion, fraud, and phishing.
                         The main metric for assessing the performance of classification models is 𝐹1 score [1], also known as
                         𝐹1 measure, which is the harmonic mean of precision and recall.
                            While 𝐹1 score is used for assessing models, it is not directly used as a loss function since it is not
                         differentiable (or convex). A simpler and commonly used approach is a two-stage optimization process.
                         First, a model is trained using a conventional loss function such as cross-entropy, and then an optimal
                         threshold is selected based on 𝐹1 score [2].
                            𝐹1 score works particularly well in highly imbalanced settings prevalent in cybersecurity. However,
                         it treats both kinds of errors a machine learning classifier can make – false positives (FPs) and false
                         negatives (FNs) – equally. Usually, the cost of these errors is unequal and depends on the application
                         context. For example, in cybersecurity applications, while both these errors can have severe negative
                         consequences, one error might be preferred over the other. Specifically, false positives lead to alarm
                         fatigue, a phenomenon where a high frequency of false alarms causes operators to ignore or dismiss
                         all alarms. This problem is often exacerbated by base rate fallacy, where people underestimate the
                         potential volume of false positives due to a high true positive rate while ignoring a low base rate.1 False
                         negatives, on the other hand, imply that a vulnerability or attack has gone undetected. While ideally

                          CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA
                         *
                           Corresponding author.
                          $ mmarwah@opentext.com (M. Marwah); anarayanan@opentext.com (A. Narayanan); sjou@opentext.com (S. Jou);
                          marlitt@opentext.com (M. Arlitt); mpospelova@opentext.com (M. Pospelova)
                                        Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             Even when the true positive rate (TPR), that is, 𝑃 (𝐴|𝑉 ) is high, the probability that an alarm corresponds to a real threat or
                             vulnerability, that is, 𝑃 (𝑉 |𝐴), is usually very low. This follows directly from Bayes rule: 𝑃 (𝑉 |𝐴) ∝ 𝑃 (𝐴|𝑉 ) Β· 𝑃 (𝑉 ) and
                             the fact that the base rate, 𝑃 (𝑉 ), is usually very low.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
one would want to minimize both these errors, in practice there is a trade-off between the number of
FPs and FNs. An organization, based on its goals and requirements may assign differing costs to these
errors. For example, for a ransomware detection model the damage caused by a FN may be several
orders of magnitude greater than the cost of a security analyst handling a FP; while for other models,
e.g., port scanning detection, the costs may be similar or even higher for handling a FP. By using 𝐹1
score such cost considerations are usually ignored.2 So a natural question to ask is: given the cost
difference (or ratio) between the consequences of a FP and a FN for a particular use case, how can an
organization incorporate that information while building machine learning models for that application?
   There is considerable prior work on cost-sensitive learning [3, 4, 5, 6, 7, 8]. These aim to modify
the model learning process, e.g., by altering the loss function to incorporate cost, adding weights to
the training samples, or readjusting class frequencies in the training set such that the trained model
intrinsically produces results that are cost sensitive. In this paper, we do not change the underlying
learning process and instead propose a new cost-aware metric as a replacement of 𝐹1 score that can be
used for model thresholding, comparison and selection. It is defined in terms of recall, precision, and a
cost ratio, and can be used, for example, in determining the minimum cost point on a precision-recall
curve. We applied the new metric, called cost score, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , to several cybersecurity related datasets
and discovered significant cost differences between using 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ .
   While cost score applies to any classification problem, it is especially relevant in cybersecurity since
the mismatch in the costs of misclassification can be significant. The main purpose of cost score is
to make it easier for practitioners to incorporate cost during model thresholding and selection. It is
an easy replacement for 𝐹1 score since πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is also defined in terms of precision and recall (and an
additional cost ratio).
   The key contributions of the paper are:

        β€’ Introduction of a new cost-based metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , defined as ( Precision
                                                                             1
                                                                                   βˆ’ 1 βˆ’ π‘Ÿπ‘ ) Β· Recall + π‘Ÿπ‘ , where
          π‘Ÿπ‘ is the cost ratio, which incorporates the differing costs of misclassification and can be used as
          a cost-aware alternative to 𝐹1 score.
        β€’ Characterization and derivation of the new metric, and its comparison with 𝐹1 score.
        β€’ Application of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to five cybersecurity related datasets, four of which are publicly available
          and one is private, for multiple values of cost ratio. The results show a cost saving of up to 86%
          with an average saving of 49% over using 𝐹1 score in situations where costs are unequal.


2. Related Work
2.1. Drawbacks of 𝐹1 score and alternatives
While 𝐹1 score is preferred to any one of accuracy, precision, or recall, especially for an imbalanced
dataset, its primary drawback in our context is that all misclassifications are considered equal [9]. The
other drawbacks [10, 11] include 1) lack of symmetry with respect to class labels, e.g., changing the
positive class in a binary classifier produces a different result; and, 2) no dependence on true negatives.
A more robust though not as popular alternative addressing some of these problems while still working
well for imbalanced datasets is the Matthews Correlation Coefficient (MCC) [12], which in many cases
is preferred to 𝐹1 score [13]. It is symmetric and produces a high score only if all four confusion matrix
entries (see Table 2) show good results [13]. However, it treats FPs and FNs equally. Unlike 𝐹1 score
and MCC, our proposed metric is not symmetric with respect to FNs and FPs, taking their distinct
impacts into consideration through a cost ratio. Further, like MCC but unlike 𝐹1 score, our metric is
symmetric in the treatment of true positives and true negatives. Our metric is not normalized like MCC
and 𝐹1 score, and varies between 0 (best) and ∞ (worst). This does not impact model thresholding, or
comparison, however, the actual value of the cost metric in itself is not very meaningful, but can be
converted to the corresponding recall and precision values. Since neither MCC nor 𝐹1 score considers
2
    A weighted version of 𝐹1 score exists, however, it is usually not used, since it is not obvious how precision and recall should
    be weighed to incorporate the differing costs of FNs and FPs.
differing costs of errors, and the latter is more widely used, we compare our proposed metric with 𝐹1
score in the rest of the paper.

2.2. Cost sensitive learning and assessment
Since in real-world applications cost differences between types of errors can be large, cost-sensitive
machine learning has been an active area of research in the past few decades [3, 4, 5, 7], especially in
areas such as security [14, 15] and medicine [16]. For example, Lee et al. [14] proposed cost models
for intrusion detection; Liu et al. [15] incorporate cost considerations both in feature selection and
classification for software defect prediction. Some of this and similar work could be used to estimate
cost ratios for our proposed cost metric.
   At a high-level, cost-sensitive machine learning [8] can be categorized into two different approaches:
1) where the machine learning methods are modified to incorporate the unequal costs of errors; and, 2)
where existing machine learning models – trained with cost oblivious methods – are converted into
cost-sensitive ones using a wrapper [5, 7]. In this paper, we focus on the second approach, which is
also referred to as cost-sensitive meta learning. While there are various methods to implement this
approach, we will focus on thresholding or threshold adjusting [7], where the decision threshold of
a probabilistic model is selected based on a cost function. Sheng et al. [7] showed that thresholding
outperforms several other cost sensitive meta learning methods such as MetaCost [5].
   In the most general case, the cost function for thresholding can be constructed from the entries
of a confusion matrix with a weight attached to each of them, that is, FPs, FNs, TPs and TNs [6].
Our proposed cost metric uses a similar formulation, however, it is expressed in terms of precision
and recall, metrics that data scientists already know well and understand. We are not aware of any
existing cost metric defined in terms of precision, recall, and a cost ratio. Unlike 𝐹1 score or MCC, the
proposed metric is directly proportional to the total cost of misclassification. We believe it can serve as
a cost-aware replacement for 𝐹1 score or MCC.


3. Proposed Metric: Cost Score
While the proposed metric is applicable to any machine learning classification model, including multi-
class and multilabel settings, for simplicity we will assume a binary classification task in the following
discussion. The notation used is summarized in Table 1.
   Starting with the cost of misclassifications, we derive expressions for cost score that can replace
𝐹1 score. In particular, we derive two equivalent expressions β€” one in terms of TPR (recall) and FPR;
the other in terms of precision and recall. They both include an error cost ratio (π‘Ÿπ‘ ), which is a ratio
between the cost of a FN to that of a FP. The first one is dependent on the base rate (𝑃 (𝑉 )), while the
second, similar to F1-score, does not directly depend on it.
   The basic evaluation metrics for a binary classifier can be defined from a confusion matrix, shown
in Table 2. One can also look at a confusion matrix from a probabilistic perspective, where the four
possible outcomes define a probabilistic space, with each outcome a joint probability, as shown in
Table 3. The total probability along a row or a column are the corresponding marginal probabilities.

Conditional Probabilistic Definitions of Classifier Metrics
False positive rate ( 𝐹𝑛𝑃 ): 𝑃 (𝐴|¬𝑉 )

False negative rate ( 𝐹𝑝𝑁 ): 𝑃 (¬𝐴|𝑉 )

True positive rate (recall) ( 𝑇𝑝𝑃 ): 𝑃 (𝐴|𝑉 )

True negative rate ( 𝑇𝑛𝑁 ): 𝑃 (¬𝐴|¬𝑉 )
Table 1
Notation
               Symbol      Description
               Β¬           logical not
               𝑉           vulnerability or threat, or in general positive class
               𝐴           positive classification by a detector, which may result in an alarm
               𝑇𝑃          true positive
               𝑇𝑁          true negative
               𝐹𝑃          false positive
               𝐹𝑁          false negative
               𝑁           total number of data points
               𝑁𝐹 𝑃        number of false positives
               𝑁𝐹 𝑁        number of false negatives
               𝑝           total number of positives
               𝑝
               ^           total number of predicted positives
               𝑛           total number of negatives
               𝑛
               ^           total number of predicted negatives
               𝐢𝐹 𝑃        cost of a false positive
               𝐢𝐹 𝑁        cost of a false negative
               𝐢           total cost of misclassification
               π‘Ÿπ‘          Error cost ratio, defined as 𝐢 𝐹𝑁
                                                         𝐢𝐹 𝑃
               𝑅           Recall
               𝑃 π‘Ÿπ‘’π‘       Precision

Table 2
Confusion Matrix
                                                             Ground Truth
                                                         𝑉 (or 𝑇 ) ¬𝑉 (or 𝐹 )
                                             𝐴 (or 𝑇 )     TP          FP       𝑝
                                                                                ^
                               Prediction
                                            ¬𝐴 (or 𝐹 )     FN          TN       𝑛
                                                                                ^
                                                            p           n

Table 3
Confusion Matrix – probabilistic view
                                                       Ground Truth
                                                     𝑉           ¬𝑉
                                             𝐴    𝑃 (𝐴, 𝑉 )   𝑃 (𝐴, ¬𝑉 )     𝑃 (𝐴)
                           Prediction
                                            ¬𝐴   𝑃 (¬𝐴, 𝑉 ) 𝑃 (¬𝐴, ¬𝑉 )     𝑃 (¬𝐴)
                                                   𝑃 (𝑉 )      𝑃 (¬𝑉 )


Precision ( 𝑇𝑝^𝑃 ): 𝑃 (𝑉 |𝐴)

False discovery rate (or 1 - precision) ( 𝐹𝑝^𝑃 ): 𝑃 (¬𝑉 |𝐴)


3.1. Cost function
The cost incurred as a result of misclassification is composed of the cost of false positives and that of
false negatives. 𝑃 (𝐴, ¬𝑉 ) and 𝑃 (¬𝐴, 𝑉 ) represent the probability of false positives and false negatives,
respectively. Thus, their number can be expressed as:


                                             𝑁𝐹 𝑃 = 𝑁 Β· 𝑃 (𝐴, ¬𝑉 )
                                             𝑁𝐹 𝑁 = 𝑁 Β· 𝑃 (¬𝐴, 𝑉 )
  Multiplying by the corresponding costs gives us the total cost of errors:


                          𝐢 = 𝐢𝐹 𝑃 Β· 𝑁𝐹 𝑃 + 𝐢𝐹 𝑁 Β· 𝑁𝐹 𝑁
                             = 𝐢𝐹 𝑃 Β· 𝑁 Β· 𝑃 (𝐴, ¬𝑉 ) + 𝐢𝐹 𝑁 Β· 𝑁 Β· 𝑃 (¬𝐴, 𝑉 )

  Factoring out the common terms and introducing a cost ratio, π‘Ÿπ‘ , gives:

                              𝐢 = 𝐢𝐹 𝑃 Β· 𝑁 (𝑃 (𝐴, ¬𝑉 ) + π‘Ÿπ‘ Β· 𝑃 (¬𝐴, 𝑉 ))                              (1)
                                 = 𝐾 Β· [𝑃 (𝐴, ¬𝑉 ) + π‘Ÿπ‘ Β· 𝑃 (¬𝐴, 𝑉 )]                                  (2)

3.2. Cost score in terms of TPR and FPR
Here we model the cost function in terms of TPR (recall) and FPR. Data scientists frequently evaluate a
model in terms of TPR, which corresponds to the fraction of positive cases detected and FPR, which is
the fraction of the negatives that were misclassified as positives. In fact, an ROC curve (a plot between
TPR and FPR) is widely used for thresholding a probabilistic classifier.
   Using the product rule, we rewrite the probability distribution for false positives in terms of FPR and
𝑃 (𝑉 ).


                                   𝑃 (𝐴, ¬𝑉 ) = 𝑃 (𝐴|¬𝑉 ) Β· 𝑃 (¬𝑉 )                                    (3)
                                               = 𝐹 𝑃 𝑅 Β· (1 βˆ’ 𝑃 (𝑉 ))                                  (4)

  Similarly, we rewrite the joint distribution for false negatives in terms of TPR and 𝑃 (𝑉 ).


                                  𝑃 (¬𝐴, 𝑉 ) = 𝑃 (¬𝐴|𝑉 ) Β· 𝑃 (𝑉 )                                      (5)
                                             = (1 βˆ’ 𝑃 (𝐴|𝑉 )) Β· 𝑃 (𝑉 )                                 (6)
                                             = (1 βˆ’ 𝑇 𝑃 𝑅) Β· 𝑃 (𝑉 )                                    (7)

  Substituting 4 and 7 in the cost expression, 2, and rearranging, we get:

                          𝐢 = 𝐾 Β· [𝐹 𝑃 𝑅 + 𝑃 (𝑉 )(π‘Ÿπ‘ βˆ’ π‘Ÿπ‘ Β· 𝑇 𝑃 𝑅 βˆ’ 𝐹 𝑃 𝑅)]
  To minimize the cost, we can ignore 𝐾, and thus the cost score becomes:

                          πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = 𝐹 𝑃 𝑅 + 𝑃 (𝑉 ) Β· (π‘Ÿπ‘ βˆ’ π‘Ÿπ‘ Β· 𝑇 𝑃 𝑅 βˆ’ 𝐹 𝑃 𝑅)                          (8)

3.3. Cost score in terms of precision and recall
While FPR is a useful metric as it captures the number of false positives, it can be tricky to understand,
especially when the base rate, 𝑃 (𝑉 ), is low, which is usually the case in cybersecurity problems. For
problems such as intrusion or threat detection, FPs add overhead to the workflow of a security analyst.
For phishing website detection, a FP may result in a website being blocked in error for an end user.
In either case, setting a target FPR requires knowledge of the base rate and would change as the base
rate changes. In other words, even a seemingly low FPR may not be good enough, given a low base
rate. Further, variance in base rate would affect overhead of a security analyst in case of intrusion
detection or the fraction of erroneously blocked websites for a user in case of phishing detection even if
the FPR stays constant. Precision on the other hand directly captures the operator overhead or fraction
of erroneously blocked websites independent of the base rate.
   A main attraction of 𝐹1 score is its use of precision instead of FPR. When the costs of FP and FN are
similar, 𝐹1 score is an effective evaluation metric, however, with unequal costs of misclassification, we
can usually find a better solution by incorporating this cost differential in the metric. Below, we derive
an expression for πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in terms of precision and recall, similar to 𝐹1 score, but that also includes a
cost ratio.
   We can rewrite the probability of a false positive in terms of precision (𝑃 π‘Ÿπ‘’π‘) and marginal probability
of alarm.

                                     𝑃 (𝐴, ¬𝑉 ) = 𝑃 (¬𝑉 |𝐴) Β· 𝑃 (𝐴)                                       (9)
                                                = (1 βˆ’ 𝑃 π‘Ÿπ‘’π‘) Β· 𝑃 (𝐴)                                    (10)

  P(A) can be expressed in terms of 𝑃 (𝑉 ), 𝑃 π‘Ÿπ‘’π‘ and 𝑅 (recall) using Bayes rule:

                                   𝑃 (𝑉 |𝐴) Β· 𝑃 (𝐴) = 𝑃 (𝐴|𝑉 ) Β· 𝑃 (𝑉 )
                                                      𝑃 (𝐴|𝑉 )
                                              𝑃 (𝐴) =           Β· 𝑃 (𝑉 )
                                                      𝑃 (𝑉 |𝐴)
                                                        𝑅
                                                    =       Β· 𝑃 (𝑉 )
                                                      𝑃 π‘Ÿπ‘’π‘
  Substituting this value of 𝑃 (𝐴) in Equation 10, we get:

                                                1 βˆ’ 𝑃 π‘Ÿπ‘’π‘
                                   𝑃 (𝐴, ¬𝑉 ) =           Β· 𝑅 Β· 𝑃 (𝑉 )                                   (11)
                                                  𝑃 π‘Ÿπ‘’π‘
  As in the previous section (Equation 7), the probability of a false negative can be written as:

                                      𝑃 (¬𝐴, 𝑉 ) = (1 βˆ’ 𝑅) Β· 𝑃 (𝑉 )                                      (12)

  Therefore, substituting the values of probabilities of a false positive and a false negative from Equations
11 and 12, respectively, into the cost expression (Equation 1), we get

                                                    1 βˆ’ 𝑃 π‘Ÿπ‘’π‘
                        πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = 𝑁 Β· 𝐢𝐹 𝑃 Β· 𝑃 (𝑉 ) Β· [         Β· 𝑅 + π‘Ÿπ‘ (1 βˆ’ 𝑅)]
                                                      𝑃 π‘Ÿπ‘’π‘
  Since 𝑁 , 𝐢𝐹 𝑃 and 𝑃 (𝑉 ) are constant for a given dataset, we can rewrite the cost expression as:
                                                1
                                  πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = (        βˆ’ 1) Β· 𝑅 + π‘Ÿπ‘ (1 βˆ’ 𝑅)                              (13)
                                              𝑃 π‘Ÿπ‘’π‘
   This expression defines the cost in terms of precision, recall and cost ratio and can be used instead
of 𝐹1 score for any tasks that require model comparison such as model thresholding, hyperparameter
tuning, model selection and feature selection.
   πΆπ‘ π‘π‘œπ‘Ÿπ‘’ goes to zero for 𝑃 π‘Ÿπ‘’π‘ = 1 and 𝑅 = 1, as expected. As 𝑃 π‘Ÿπ‘’π‘ β†’ 0 and 𝑅 β†’ 0, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ β†’ ∞.
   We have derived two equivalent cost expressions – one involving TPR and FPR (quantities used in
an ROC curve) and the second involving precision and recall (quantities used in computing 𝐹1 score).
Similarly, it may be possible to derive additional equivalent cost expressions in terms of other commonly
used metrics. In the remainder of the paper, we will only consider the cost expression πΆπ‘ π‘π‘œπ‘Ÿπ‘’ defined in
terms of precision and recall (similar to 𝐹1 score). This definition of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is not directly dependent on
the base rate (𝑃 (𝑉 )), unlike the one in the previous section.

3.4. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ Isocost Contours
To better understand the cost score metric, we will examine its dependence on precision and recall,
and compare it with 𝐹1 score. Figure 1 shows a precision-recall (PR) plot with 𝐹1 score isocurves or
contours. Each curve corresponds to a constant value of 𝐹1 score as specified next to the curve. If
recall and precision are identical, 𝐹1 score computes to the same value. However, if there is a wide gap
between them, 𝐹1 tends to be closer to the lower value, as can be seen in the top-left and bottom-right
regions of the plot. As expected, the highest (best) value contours are towards the top-right corner of
the plot (that is, towards perfect recall and precision). Further, the slope of the curves is always negative
(as shown in Appendix A), implying there is always a trade-off between recall and precision.
Figure 1: Precision-Recall plot of F1-score iso-curves.


   We can similarly obtain isocost curves (or contour lines) for cost score assuming a particular cost
ratio, π‘ŸπΆ . The cost score (Equation 13) can written as:
                                                          1
                                           πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = (         βˆ’ 1 βˆ’ π‘Ÿπ‘ ) Β· 𝑅 + π‘Ÿπ‘                                        (14)
                                                        𝑃 π‘Ÿπ‘’π‘
and plotted for constant values of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ on a PR plot. Figure 2 shows the isocost curves for three cost
ratios: π‘Ÿπ‘ = 1, that is, FN and FP cost the same; π‘Ÿπ‘ = 10, that is, FN are ten times as expensive as FP;
and π‘Ÿπ‘ = 0.1, that is, FN are one-tenth as expensive as FP.
  There are three distinct regions in the plot, based on the slope of the curves. From the above equation,
we can compute the slope (see Appendix, A for details).

                                         πœ•π‘ƒ π‘Ÿπ‘’π‘           πΆπ‘ π‘π‘œπ‘Ÿπ‘’ βˆ’ π‘Ÿπ‘
                                                =
                                          πœ•π‘…      (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2

Depending on the value of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , the slope can be positive, negative or zero as shown below.

                                              βŽͺ> 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ > π‘Ÿπ‘
                                              ⎧
                                   πœ•π‘ƒ π‘Ÿπ‘’π‘ ⎨
                                           = < 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ < π‘Ÿπ‘
                                     πœ•π‘…
                                               = 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = π‘Ÿπ‘
                                              βŽͺ
                                              ⎩

   For lower (better) values of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , when πΆπ‘ π‘π‘œπ‘Ÿπ‘’ < π‘Ÿπ‘ , the slope is negative and the isocost curves
are similar to the isocurves for 𝐹1 score. The horizontal line corresponds to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = π‘Ÿπ‘ , and the curves
below it have a positive slope with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ > π‘Ÿπ‘ . The isocurves closest to the top-right corner have the
lowest costs.
   While the isocost contours are plotted assuming 𝑃 π‘Ÿπ‘’π‘ and 𝑅 are independent, that is obviously
not the case for a particular model. In fact, 𝑃 π‘Ÿπ‘’π‘, 𝑃 (𝑉 |𝐴), and 𝑅, 𝑃 (𝐴|𝑉 ), are related by Bayes rule:
𝑃 π‘Ÿπ‘’π‘ = 𝑃𝑃 (𝑉 )
           (𝐴) Β· 𝑅 . The feasible 𝑃 π‘Ÿπ‘’π‘-𝑅 pairs obtained by varying model thresholds are given by a PR
curve. A hypothetical PR curve is shown as a dotted black line in Figure 2. The cost corresponding to
each point on the PR curve is given by the isocost intersecting that point. The minimum cost point on
the PR curve is the one that intersects the lowest cost contour. If the PR curve is convex, the minimum
cost contour will touch the PR curve only at one point where their tangents have equal slope.3 However,
in practice empirically constructed PR curves are not always convex and thus the minimum cost point
3
    Under assumption of convexity, this can be proved by contradiction. Assume the minimum isocost touches a PR curve at at
    least two points; however, since both functions are convex, there must be another lower cost isocost touching the PR curve
    at at least one point. Thus, the lowest cost isocost must touch the PR curve at exactly one point.
                                (a) π‘Ÿπ‘ = 1                              (b) π‘Ÿπ‘ = 10




                                                      (c) π‘Ÿπ‘ = 0.1




Figure 2: Isocost contours for πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for three different cost ratios, π‘Ÿπ‘ . The πΆπ‘ π‘π‘œπ‘Ÿπ‘’ corresponding to each contour
is listed next to it. The black dotted-line is the PR curve for a particular model.


may not be unique. In Figure 2 point A, B and C approximately show the minimum cost point for the
three cost ratios.
   What do isocost contours mean in terms of the confusion matrix? πΆπ‘ π‘π‘œπ‘Ÿπ‘’ remains constant along
a contour, and is proportional to 𝐹 𝑃 + π‘Ÿπ‘ 𝐹 𝑁 , which must remain constant as recall and precision
change. In Table 4, we have parameterized the confusion matrix entries with π‘˜ such that as π‘˜ changes
for a particular π‘Ÿπ‘ , precision and recall vary, however, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ remains constant. This can be seen by
computing 𝐹 𝑃 + π‘Ÿπ‘ 𝐹 𝑁 for the Table entries, which is (𝐹 𝑃 β€² + π‘Ÿπ‘ π‘˜) + π‘Ÿπ‘ (𝐹 𝑁 β€² βˆ’ π‘˜) and independent
of π‘˜ and thus constant.

Table 4
Confusion Matrix – parameterized by π‘˜
                                                    Ground Truth
                                                  𝑉           ¬𝑉
                                       𝐴      𝑇 𝑃 β€² + π‘˜ 𝐹 𝑃 β€² + π‘Ÿπ‘ Β· π‘˜     ^ + π‘Ÿπ‘ Β· π‘˜ + π‘˜
                                                                           𝑝
                         Prediction
                                       ¬𝐴     𝐹 𝑁 β€² βˆ’ π‘˜ 𝑇 𝑁 β€² βˆ’ π‘Ÿπ‘ Β· π‘˜     ^ βˆ’ π‘Ÿπ‘ Β· π‘˜ βˆ’ π‘˜
                                                                           𝑛
                                                  p            n



3.5. How does 𝐹1 score Compare with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ when π‘Ÿπ‘ = 1?
While 𝐹1 -score varies from 0 to 1, with 1 indicating perfect performance, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is proportional to the
actual cost of handing model errors, with a zero-cost indicating perfect performance (that is, no FPs or
FNs). 𝐹1 score treats FNs and FPs uniformly, as does πΆπ‘ π‘π‘œπ‘Ÿπ‘’ when π‘Ÿπ‘ = 1. So a natural question is if
πΆπ‘ π‘π‘œπ‘Ÿπ‘’ differs from 𝐹1 score when π‘Ÿπ‘ = 1? To compare, we transform 𝐹1 -score to a cost metric:
                                                         1
                                             𝐹 1π‘π‘œπ‘ π‘‘ =      βˆ’1                                         (15)
                                                         𝐹1
   When 𝐹1 is 1, 𝐹 1π‘π‘œπ‘ π‘‘ = 0, and when 𝐹1 is 0, 𝐹 1π‘π‘œπ‘ π‘‘ β†’ ∞, and thus exhibits behavior of a cost
function and can be directly compared with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ .
   To compare πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹 1π‘π‘œπ‘ π‘‘ , we reduce both in terms of the elements of the confusion matrix and
find that:


                                            πΆπ‘ π‘π‘œπ‘Ÿπ‘’ ∝ 𝐹 𝑃 + 𝐹 𝑁                                         (16)
                                                      𝐹𝑃 + 𝐹𝑁
                                           𝐹 1π‘π‘œπ‘ π‘‘ ∝                                                   (17)
                                                          𝑇𝑃
  Thus, when π‘Ÿπ‘ = 1, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹 1π‘π‘œπ‘ π‘‘ are not identical; while πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is proportional to the total
number of errors, 𝐹 1π‘π‘œπ‘ π‘‘ is also inversely proportional to the number of true positives. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ only
considers the cost of errors; it assigns zero cost to both TPs and TNs. In that sense, it treats TP and TN
symmetrically unlike 𝐹1 score.

3.6. Multiclass and multilabel classifiers
While we derived the cost metric assuming a binary classification problem, its extension to multiclass
and multilabel classification problems is straightforward. A cost ratio per class would need to be defined.
For a multiclass classifier, a user would have to assign cost ratios considering each class as positive
and the rest as negative. Similarly, for a multilabel classifier, a user would have to assign independent
ratios for each class. This will allow a πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to be computed per class. To compute a single cost metric,
the per-class πΆπ‘ π‘π‘œπ‘Ÿπ‘’ would need to be aggregated. The simplest aggregation function is an arithmetic
mean, although a class weighted mean based on class importance, or another type of aggregation, e.g.,
a harmonic mean, can also be performed. The β€œone class versus the rest” approach is similar to how 𝐹1
score and other metrics are computed in a multiclass setting.

3.7. Minimizing πΆπ‘ π‘π‘œπ‘Ÿπ‘’ based on model threshold and other hyperparameters
In Section 3.4, we described use of isocost contours to visually determine the lowest cost point on
a PR curve. In practice, to find the minimum cost based on model threshold, and the corresponding
precision-recall values, precision and recall can be considered functions of the threshold value (𝑑) with
the optimal value of 𝑑 determined by minimizing the cost function with respect to 𝑑.
                                             1
                         𝑑 = argmin𝑑 [(            βˆ’ 1) Β· 𝑅(𝑑) + π‘Ÿπ‘ Β· (1 βˆ’ 𝑅(𝑑))]
                                          𝑃 π‘Ÿπ‘’π‘(𝑑)
   In addition to model threshold, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can also be used for selecting other model hyperparameters
such as the number of neighbors in π‘˜-NN; number of trees, maximum tree depth, etc. in tree based
models; number and type of layers, activation functions, etc. in neural networks; and for model
comparison and selection. Hyperparameter tuning [17] is typically performed using methods such
as grid search, random search, gradient-based optimization, etc. Typically, cross-validation is used in
conjunction to evaluate the quality of a particular choice of a dataset. In all these methods, the proposed
πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can replace another cost-oblivious metric such as 𝐹1 score.


4. Experimental Evaluation
4.1. Datasets
The datasets used for our experiments were chosen based on their relevance to security and the varying
cost of misclassification between target classes. To comprehensively analyze the impact of costs, we
selected five different datasets, four publicly available datasets and one privately collected dataset. The
publicly available datasets include the UNSW-NB15 intrusion detection data, KDD Cup 99 network
intrusion data, credit card transaction data, and phishing URL data.

   1. UNSW-NB15 Intrusion Detection Data: This network dataset, developed by the Intelligent
      Security Group at UNSW Canberra, comprises events categorized into nine distinct types of
      attacks including normal traffic. To suit the experimental requirements of our study, the dataset
      was transformed into a binary classification setting, where a subset of attack classes (Backdoor,
      Exploits, Reconnaissance) are consolidated into class 1, while normal traffic is represented as
      class 0. There are a total of 93,000 events in class 0 and 60,841 events in class 1. For our research,
      we utilized the CSV version of the dataset, which comes pre-partitioned into training and testing
      sets [18][19].
   2. KDD Cup 99 Network Intrusion Data: This dataset originated from packet traces captured
      during the 1998 DARPA Intrusion Detection System Evaluation. It encompasses 145,585 unique
      records categorized into 23 distinct classes, which include various types of attacks alongside
      normal network traffic. Each record is characterized by 41 features that are derived from the
      packet traces. For this research, the dataset has been adapted to focus on a binary classification
      task: class 0 represents normal instances, while class 1 aggregates all other attack types. Also, to
      explore the impact of different thresholds on the model’s performance, training was conducted
      using only 1% of the dataset. The dataset is accessed through the datasets available in the Python
      sklearn package [20].
   3. Credit Card Transactions Data: This dataset contains credit card transaction logs with 29
      features that are tagged into legitimate and fraudulent transactions. There are a total of 284,315
      transactions out of which 492 are fraudulent (class 1) and 56,866 are legitimate (class 0) [21]. Of
      the five datasets, this one has the highest skew.
   4. Phishing data: This dataset is a collection of 60,252 webpages along with their URL and HTML
      sources. Out of these, 27,280 are phishing sites (class 1) whereas 32,972 are benign (class 0) [22].
      We only use the URLs for building the model.
   5. Internal data: This is a private dataset, used within an organization that represents the results
      of an extensive audit conducted on vulnerabilities in source code. Each vulnerability is classified
      into two classes, class 0 or class 1 (actual class names are masked for anonymity) by human
      auditors during the auditing process. The model is trained on this manually audited data and
      predicts if a given vulnerability belongs to class 0 or class 1. There are a total of 144,978 instances
      out of which 18,738 belong to class 1. Each vulnerability has 58 features which encompass a wide
      array of metrics that were generated during the analysis of the codebase.

   The information about each dataset is summarized in Table 5. It is important to note that not all
datasets used are balanced. For instance, the credit card fraud data has less than 1% of instances in
class 1. Similarly, the internal data has only 15% instances in class 1.

4.2. Experiment Setup
We train a classification model using a RandomForest algorithm for each dataset. The goal is not to
train the best possible model for the dataset but to obtain a reasonably good model with a probabilistic
output.
   The steps are as follows:

   1. Model Training: A RandomForest classifier is trained on each dataset. Although the training
      sets have different skews, we effectively used a balanced dataset for training so the classifier gets
      an equal opportunity to learn both classes.
   2. Threshold adjustment using 𝐹1 score: The validation dataset is used to identify the best
      threshold based on the 𝐹1 score. The validation dataset was selected by sampling a proportion of
      the data, ensuring that the class distribution mirrored that of the training data. This approach was
Table 5
Summary of datasets

                                         Number of instances
                  Dataset                                           Number of features
                                         Class 0         Class 1
                  UNSW-NB15               93,000          60,841                        42
                  Credit card fraud      284,315             492                        29
                  KDD cup 99              87,832          57,753                        41
                  Phishing data           32,972          27,280                       188
                  Internal data          126,240          18,738                        58


      taken because the actual skew of classes in production deployment is unknown. However, it is
      important to note that for actual production systems, the validation set should be representative
      of the true data distribution. Specifically, for the UNSW-NB15 dataset, the validation set was
      sampled from events in the test data CSV file.
   3. Threshold adjustment using πΆπ‘ π‘π‘œπ‘Ÿπ‘’ : The predictions from the trained model are analyzed
      across different cost ratios. Using the validation sets, we apply the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to determine the optimal
      threshold for each cost ratio.
   4. Comparison: The model’s cost with thresholds chosen based on the 𝐹1 score is compared against
      the costs with thresholds chosen using the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ .

  This setup allows us to evaluate the effectiveness of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in optimizing model performance under
varying cost conditions.

4.3. Results
The proposed cost metric used is tailored for enhancing the performance of machine learning models in
scenarios where the cost of false negatives greatly differs from the cost of false positives. This in turn
helps in optimizing predictions based on cost considerations, thereby addressing a critical limitation in
existing evaluation methods.

4.3.1. 𝐹1 Score for thresholding
To illustrate the advantages of our approach, we will first use 𝐹1 score to adjust a model’s threshold.
Figure 3 depicts the changes in 𝐹1 score for different threshold values across each dataset. The histograms
in each plot represent the distribution of data within the corresponding probability intervals. Each
color in the histogram represents the distribution of the corresponding ground truth class represented
by the color. Due to the significant skew of the credit card dataset, the density of class 1 is not visible in
the histogram.
   The 𝐹1 score for each dataset starts from a threshold of 0, where all instances are classified as the
positive class and recall is 1. It ends at a score of 0 at a threshold of 1, where all instances are tagged as
negatives and recall is 0. The rate of change of the 𝐹1 score in a threshold interval is proportional to
the proportion of data within the interval and their ground truth values. This explains why, for some
datasets, the 𝐹1 score is flat or nearly flat in the middle range of thresholds.
   It is clear from Figure 3 that most models do a good job of separating the two classes. The model
trained on KDD cup 99 data is able to separate both the classes more distinctively and has most of the
data points near probability zero and one. This makes the threshold vs 𝐹1 score curve mostly flat in the
middle range of the probabilities. The best 𝐹1 score is achieved at a threshold of 0.33. Similarly, the
phishing, credit card fraud, and intrusion detection datasets perform well on the validation datasets
        (a) Threshold vs. F1-score        (b) Threshold vs. F1-score        (c) Threshold vs. F1-score
               (UNSW-NB15)                     (Credit card fraud)                 (Phishing data)




                         (d) Threshold vs. F1-score        (e) Threshold vs. F1-score
                                 (KDD cup 99)                     (Internal data)




        Figure 3: Threshold for best 𝐹1 score for each dataset.


with well-separated bimodal distributions. The threshold with the highest 𝐹1 score is marked with a
vertical line in each plot. For the internal dataset, the model struggles to separate both classes as can
be seen by the overlap in the probabilities of both classes. The model achieves its best 𝐹1 score at a
threshold of 0.688.
   As the threshold moves from 0 to 1 there is a tradeoff between FPs and FNs; the maximum 𝐹1 score
for each model corresponds to the point where the sum of FPs and FNs are minimal while the number
of TPs are the highest. Being symmetrical in FN and FP, 𝐹1 score reduces their sum, disregarding any
class specific costs.

4.3.2. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for thresholding at different cost ratios
The proposed πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric allows the tuning of model parameters based on a cost ratio (ratio of the cost
of false negatives to the cost of false positives). This cost ratio is variable and dependent on the specific
impacts these errors have on end users. For example, in scenarios where missing a true attack could lead
to significant financial losses, the cost of a false negative is higher. Conversely, in resource-constrained
environments, a high rate of false positives can considerably burden the evaluation process.
   To illustrate the tuning differences, we applied three distinct cost ratios to each dataset: 0.1 (where a
false positive is ten times more costly than a false negative), 1 (equal cost for both false positives and
false negatives), and 10 (where a false negative is ten times more costly than a false positive). These
cost ratios are used solely to demonstrate the model’s behavior when tuned with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and may not
correspond to practical applications of the data.
   Figure 4 displays the optimal thresholds derived from πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for each dataset across these cost ratios.
The histograms in each plot show the distribution of the ground truth classes within the probability
intervals. The πΆπ‘ π‘π‘œπ‘Ÿπ‘’ reflects the classification cost, resulting in a curve shape inverse of that of the 𝐹1
score, with the optimal threshold at the minimum πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . Similar to the 𝐹1 score plot, the flat portions
of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve correspond to probability intervals with fewer data points. At a threshold of 0, the
πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is constant regardless of the cost ratio, as the recall is 1 and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is 𝑃 π‘Ÿπ‘’π‘
                                                                                     1
                                                                                        βˆ’ 1.
   Table 6 summarizes the experimental results, comparing the performance improvements of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at
different thresholds with those of the 𝐹1 score. Cost score (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ ) is computed for thresholds based on
maximizing 𝐹1 score and minimizing πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for each of the cost-ratios. There is only one threshold
for each dataset based on best 𝐹1 score but the threshold based on πΆπ‘ π‘π‘œπ‘Ÿπ‘’ varies for each cost ratio.
Although the actual cost is a multiple of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , the percentage improvement over the 𝐹1 score
                                               (a) UNSW-NB15




                                               (b) Phishing data




                                             (c) Credit card fraud




                                                (d) KDD cup 99




                                               (e) Internal data




        Figure 4: Variation of thresholds with different cost ratios for each dataset.


reflects the reduction in actual cost.
   For the UNSW-NB15 dataset, the optimal threshold is 0.89 for a cost ratio of 0.1, minimizing false
positives at the expense of some true positives becoming false negatives (Figure 4a). At a cost ratio of
1, the threshold decreases to 0.65, balancing false positives and false negatives, aligning closely with
the best 𝐹1 score threshold. At a cost ratio of 10, the threshold further decreases to 0.42, significantly
reducing false negatives despite an increase in false positives.
   In the phishing dataset, the probability distribution of both classes is similar (Figure 4b). At a cost
ratio of 1, the threshold is 0.54, identical to the best 𝐹1 score threshold. For a cost ratio of 0.1, the
threshold increases to 0.73 to reduce false positives. Conversely, at a cost ratio of 10, the threshold
decreases to 0.17 to significantly reduce false negatives. The spike in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for a cost ratio of 10 is
proportional to the number of true class 1 instances within the probability interval.
   For the credit card fraud and KDD Cup 99 datasets, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve remains mostly flat. In the case
of credit card fraud data, we applied a logarithmic transformation (Figure 4c) to highlight the differences
in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ due to the significant class imbalance. For the KDD Cup 99 dataset, the trained model achieves
good class separation (Figure 4d), resulting in a relatively flat πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve in the middle region, with a
spike towards a probability interval of 1 as the cost ratio increases.
   In the internal dataset, as we saw earlier, there is substantial overlap between the probability intervals
of both classes, increasing the significance of false positives and false negatives (Figure 4e). At a cost
ratio of 0.1, the threshold is set at 0.95, nearly eliminating false positives. At a cost ratio of 1, the
threshold is 0.92, which is only slightly different from the threshold for a cost ratio of 0.1 and results
in almost similar rates of false positives. This slight decrease can be attributed to significant class
imbalances, where further threshold reduction could significantly increase false positives due to the
higher count of instances in class 0. As the cost ratio increases to 10, the threshold decreases to 0.42,
considerably reducing false negatives (as indicated by the reduced proportion of class 1 instances to the
left of the threshold). The spike in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at this cost ratio corresponds to the interval with a significant
count of class 1 instances.
   Table 6 compares the cost improvements achieved by πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at different cost ratios to the costs at
optimal thresholds based on the 𝐹1 score. Performance improvements with respect to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ range from
10% to 85% in most scenarios for cost ratios of 0.1 and 10, with an average cost improvement of 49%.
At a cost ratio of 1, the improvement is minimal, except in datasets with significant class imbalances,
indicating the similarity between 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at this ratio. For the internal dataset with a cost
ratio of 10, the cost improvement at the optimal πΆπ‘ π‘π‘œπ‘Ÿπ‘’ threshold compared to the 𝐹1 score is 86%.
Additionally, there is over 50% improvement in cost at a cost ratio of 0.1 for the UNSW-NB15, credit
card fraud, and KDD Cup 99 datasets, underscoring the substantial benefits of tuning models using the
πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric. These findings demonstrate how πΆπ‘ π‘π‘œπ‘Ÿπ‘’ effectively adjusts the threshold to balance false
positives and false negatives based on the specified cost ratio.

         (a) PR Curve (UNSW-NB15) (b) PR Curve (Credit card fraud) (c) PR Curve (Phishing data)




                          (d) PR Curve (KDD cup 99)       (e) PR Curve (Internal data)




        Figure 5: Precision-Recall curve for different cost ratios
    Table 6
    Misclassification costs based on using 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for thresholding for three different cost ratios
    for the five datasets
                                  𝐹1 score based threshold                             πΆπ‘ π‘π‘œπ‘Ÿπ‘’ based threshold
 Dataset             Cost ratio
                                  Optimal threshold    Precision     Recall   πΆπ‘ π‘π‘œπ‘Ÿπ‘’   Optimal threshold    Precision   Recall   πΆπ‘ π‘π‘œπ‘Ÿπ‘’   Percentage Improvement in Cost
                            0.1                                                0.056                0.890       0.992    0.868    0.020                            64.1%
 UNSW-NB15                   1                  0.65         0.949    0.961    0.091                0.650       0.949    0.961    0.091                             0.0%
                            10                                                 0.441                0.420       0.885    0.993    0.203                            53.2%
                            0.1                                                0.199                0.900       0.976    0.417    0.069                            65.3%
 Credit card fraud           1                  0.27         0.815    0.781    0.396                0.640       0.931    0.698    0.354                            10.6%
                            10                                                 2.365                0.130       0.757    0.812    2.135                             9.7%
                            0.1                                                0.006                0.540       0.999    0.986    0.002                            66.7%
 KDD cup 99                  1                  0.33         0.995    0.994    0.011                0.330       0.995    0.994    0.011                             0.0%
                            10                                                 0.065                0.170       0.982    0.998    0.034                            47.7%
                            0.1                                                0.027                0.730       0.997    0.873    0.015                            44.4%
 Phishing data               1                  0.54         0.980    0.915    0.104                0.540       0.980    0.915    0.104                             0.0%
                            10                                                 0.876                0.170       0.764    0.970    0.595                            32.1%
                            0.1                                                0.597                0.948       0.971    0.230    0.084                            85.9%
 Internal data               1                  0.69         0.532    0.637    0.923                0.923       0.942    0.252    0.764                            17.2%
                            10                                                 4.186                0.424       0.292    0.886    3.289                            21.4%




4.3.3. Precision-Recall trade-off using πΆπ‘ π‘π‘œπ‘Ÿπ‘’
Cost score’s ability to balance false negatives and false positives based on varying cost ratios is further
demonstrated by the changes in precision and recall (Figure 5 and Table 6). The results indicate that as
the cost ratio shifts from 1 to 0.1, precision increases while recall decreases. For example, in the case of
UNSW-NB15 data, precision reaches 0.992 at a cost ratio of 0.1 which highlights a reduction in false
positives. Conversely, when the cost ratio increases to 10, there is a significant improvement in recall.
For example, in the case of UNSW-NB15 data, the recall is improved 10% compared to that at best 𝐹1
score at a cost ratio of 10, indicating a reduction in false negatives.
   When comparing the results of the new cost score at a cost ratio of 1 with the 𝐹1 score, the outcomes
are similar in most cases. This similarity is evident from the Precision-Recall curves (Figure 5), where
the precision and recall for the optimal 𝐹1 score overlap with those of the new cost score at a cost
ratio of 1. This underscores the fact that the 𝐹1 score consistently assigns equal weights to both false
negatives and false positives, irrespective of their real-world cost impacts. The notable difference in
precision and recall between the 𝐹1 score and the new cost score is observed in the credit card fraud
data (Figure 5b) and internal data (Figure 5e), which can be attributed to the significant class imbalance
in this dataset and also to the fact that 𝐹1 score is also proportional to true positives and thereby strives
for a better recall compared to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at cost ratio 1.
   These results demonstrate that the proposed πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric offers substantial performance enhance-
ments over the 𝐹1 score across a range of cost ratios. Unlike the 𝐹1 score, which typically assigns
equal penalties to false positives and false negatives, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric accommodates variations in the
costs associated with these errors. At a cost ratio of 1, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ achieves performance comparable
to the 𝐹1 score, illustrating its versatility. The flexibility of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to tune models based on cost
ratios, particularly demonstrated by the results on internal data, has proven it to be a valuable metric
for fine-tuning our models to meet the varying cost demands of end users. These findings suggest
that the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can effectively replace the 𝐹1 score for tasks such as model thresholding and selection,
particularly in scenarios where the costs of false positives and false negatives differ. Figure 6 shows the
πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocost contours overlaid with the PR curve for the UNSW-NB15 dataset. The minimum cost
point on the PR curve corresponds to the point where it intersects the contour with the least cost.


5. Conclusions
How organizations handle errors from machine learning models is highly dependent on context and
application. In the cybersecurity domain, the cost of a security analyst’s time and effort spent in
          (a) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 0.1               (b) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 1               (c) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 10




        Figure 6: πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocost contours with precision-recall curve for the UNSW-NB15 dataset for three
        different cost ratios.


reviewing and investigating a false positive varies considerably from the cost of a model’s failure to
detect a real security incident (a false negative). However, widely used metrics like 𝐹1 score assign them
equal costs. In this paper, we derived a new cost-aware metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ defined in terms of precision,
recall, and a cost ratio, which can be used for model evaluation and serve as a replacement for 𝐹1
score. In particular, it can be used for thresholding probabilistic classifiers to achieve minimum cost. To
demonstrate the effectiveness of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in cybersecurity applications, we applied it to threshold models
built on five different datasets assuming multiple cost ratios. The results showed substantial savings in
cost through the use of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ over 𝐹1 score. At cost ratio 1, the results are similar, however, as the
cost ratio is increased or decreased, the gap in costs between using πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹1 score increases. All
datasets show consistent improvements in cost. Through this work, we hope to raise awareness among
machine learning practitioners building cybersecurity applications regarding the use of cost-aware
metrics such as πΆπ‘ π‘π‘œπ‘Ÿπ‘’ instead of cost-oblivious ones like 𝐹1 score.


Acknowledgments
We thank the anonymous reviewers and the CAMLIS 2024 attendees for their feedback.


References
 [1] C. J. Van Rijsbergen, Information retrieval. 2nd. newton, ma, 1979.
 [2] S. Puthiya Parambath, N. Usunier, Y. Grandvalet, Optimizing f-measures by cost-sensitive classifi-
     cation, Advances in neural information processing systems 27 (2014).
 [3] P. D. Turney, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree
     induction algorithm, Journal of artificial intelligence research 2 (1994) 369–409.
 [4] M. Kukar, I. Kononenko, et al., Cost-sensitive learning with neural networks., in: ECAI, volume 15,
     Citeseer, 1998, pp. 88–94.
 [5] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in: Proceedings of
     the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999,
     pp. 155–164.
 [6] C. Elkan, The foundations of cost-sensitive learning, in: International joint conference on artificial
     intelligence, volume 17, Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978.
 [7] V. S. Sheng, C. X. Ling, Thresholding for making classifiers cost-sensitive, in: Aaai, volume 6,
     2006, pp. 476–481.
 [8] B. Krishnapuram, S. Yu, R. B. Rao, Cost-sensitive machine learning, CRC Press, 2011.
 [9] D. Hand, P. Christen, A note on using the f-measure for evaluating record linkage algorithms,
     Statistics and Computing 28 (2018) 539–547.
[10] D. M. Powers, Evaluation: from precision, recall and f-measure to roc, informedness, markedness
     and correlation, arXiv preprint arXiv:2010.16061 (2020).
[11] M. Sitarz, Extending f1 metric, probabilistic approach, arXiv preprint arXiv:2210.11997 (2022).
[12] B. W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage
     lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442–451.
[13] D. Chicco, G. Jurman, The advantages of the matthews correlation coefficient (mcc) over f1 score
     and accuracy in binary classification evaluation, BMC genomics 21 (2020) 1–13.
[14] W. Lee, W. Fan, M. Miller, S. J. Stolfo, E. Zadok, Toward cost-sensitive modeling for intrusion
     detection and response, Journal of computer security 10 (2002) 5–22.
[15] M. Liu, L. Miao, D. Zhang, Two-stage cost-sensitive learning for software defect prediction, IEEE
     Transactions on Reliability 63 (2014) 676–686.
[16] I. Bruha, S. KočkovÑ, A support for decision-making: Cost-sensitive learning system, Artificial
     Intelligence in Medicine 6 (1994) 67–82.
[17] Wikipedia, Hyperparameter tuning, Accessed: May 2024. URL: https://en.wikipedia.org/wiki/
     Hyperparameter_optimization.
[18] N. Moustafa, J. Slay, UNSW-NB15: a comprehensive data set for network intrusion detection
     systems (unsw-nb15 network data set), in: 2015 military communications and information systems
     conference (MilCIS), IEEE, 2015, pp. 1–6.
[19] N. Moustafa, J. Slay, The evaluation of network anomaly detection systems: Statistical analysis of
     the unsw-nb15 data set and the comparison with the kdd99 data set, Information Security Journal:
     A Global Perspective 25 (2016) 18–31.
[20] scikitlearn, Real world datasets, Accessed: May 2024. URL: https://scikitlearn.org/stable/datasets/
     real_world.html.
[21] G. Pang, C. Shen, A. Van Den Hengel, Deep anomaly detection with deviation networks, in:
     Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data
     mining, 2019, pp. 353–362.
[22] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach
     for phishing websites using url and html features, Scientific Reports 12 (2022) 8842.
A. Appendix
A.1. Slopes of the 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocurves
𝐹1 score is defined as:
                                                  2 Β· 𝑃 π‘Ÿπ‘’π‘ Β· 𝑅
                                           𝐹1 =
                                                    𝑃 π‘Ÿπ‘’π‘ Β· 𝑅
Rearranging:
                                                      𝐹1 Β· 𝑅
                                           𝑃 π‘Ÿπ‘’π‘ =
                                                     2𝑅 βˆ’ 𝐹1
  Slope of 𝐹1 isocurves can be calculated as:

                                  πœ•π‘ƒ π‘Ÿπ‘’π‘     𝐹1        2𝑅𝐹1
                                         =         βˆ’
                                   πœ•π‘…      2𝑅 βˆ’ 𝐹1 (2𝑅 βˆ’ 𝐹1 )2
                                              βˆ’πΉ12
                                         =
                                           (2𝑅 βˆ’ 𝐹1 )2

Thus, slope of 𝐹1 isocurves is always negative.
   πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is defined as:
                                               1
                                  πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = (       βˆ’ π‘Ÿπ‘ βˆ’ 1) Β· 𝑅 + π‘Ÿπ‘
                                             𝑃 π‘Ÿπ‘’π‘
It can be rearranged as:
                                                       𝑅
                                   𝑃 π‘Ÿπ‘’π‘ =
                                           πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘
Slope of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocurves can be computed as:

                  πœ•π‘ƒ π‘Ÿπ‘’π‘              1                    (π‘Ÿπ‘ + 1)𝑅
                         =                         βˆ’
                   πœ•π‘…      πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2
                                   πΆπ‘ π‘π‘œπ‘Ÿπ‘’ βˆ’ π‘Ÿπ‘
                         =
                           (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2

As described in Section 3.4, these curves can have negative, positive or zero slopes depending on the
value of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ .

A.2. Improvements in cost for different cost ratios
Figure 7 depicts the improvement in cost for different values of cost ratios. The x-axis of the plot
is π‘™π‘œπ‘”10 (π‘π‘œπ‘ π‘‘_π‘Ÿπ‘Žπ‘‘π‘–π‘œ) and y-axis is percentage improvement in cost when compared to the threshold
selected using 𝐹1 score. It can be observed that the general trend across all datasets is that the
improvement is minimum near a cost ratio of one where πΆπ‘ π‘π‘œπ‘Ÿπ‘’ behaves similar to 𝐹1 score. The cost
improvement increases as we move away from one in either direction. The percentage increase for
higher cost ratio depends on the proportion of the positive class (class 1) in the data. Since 𝑃 (𝑉 ) is
very small for credit card fraud dataset, there is only a slight increase in cost savings for cost ratios
greater than one (Figure 7b).
                 (a) UNSW-NB15                           (b) Credit card fraud




                 (c) Phishing data                         (d) KDD cup 99




                                     (e) Internal data




Figure 7: Percentage improvement in cost for threshold obtained by minimizing the new cost score
compared to the threshold obtained from maximizing 𝐹1 score.