=Paper=
{{Paper
|id=Vol-3920/paper11
|storemode=property
|title=Is πΉ1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ ππππ, a Cost-Aware Alternative for Model Assessment
|pdfUrl=https://ceur-ws.org/Vol-3920/paper11.pdf
|volume=Vol-3920
|authors=Manish Marwah,Asad Narayanan,Stephan Jou,Martin Arlitt,Maria Pospelova
|dblpUrl=https://dblp.org/rec/conf/camlis/MarwahNJAP24
}}
==Is πΉ1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ ππππ, a Cost-Aware Alternative for Model Assessment==
Is πΉ1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ ππππ, a Cost-Aware Alternative for Model Assessment Manish Marwah1,* , Asad Narayanan2 , Stephan Jou2 , Martin Arlitt2 and Maria Pospelova2 1 OpenText, USA 2 OpenText, Canada Abstract The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. Various design choices during machine learning model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between these two errors. However, most of the commonly used metrics to evaluate model quality, such as πΉ1 score, which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, πΆπ ππππ , based on precision and recall that can replace πΉ1 score for model evaluation and selection. It includes a cost ratio that takes into account the differing costs of handling false positives and false negatives. We derive and characterize the new cost metric and compare it to πΉ1 score. Further, we use this metric for model thresholding for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%. Keywords machine learning, cybersecurity, πΉ1 score, πΆπ ππππ , misclassification, cost-sensitive machine learning, false positive, false negative 1. Introduction Applications of machine learning in cybersecurity are widespread and rapidly growing, with models being deployed to prevent, detect, and respond to threats such as malware, intrusion, fraud, and phishing. The main metric for assessing the performance of classification models is πΉ1 score [1], also known as πΉ1 measure, which is the harmonic mean of precision and recall. While πΉ1 score is used for assessing models, it is not directly used as a loss function since it is not differentiable (or convex). A simpler and commonly used approach is a two-stage optimization process. First, a model is trained using a conventional loss function such as cross-entropy, and then an optimal threshold is selected based on πΉ1 score [2]. πΉ1 score works particularly well in highly imbalanced settings prevalent in cybersecurity. However, it treats both kinds of errors a machine learning classifier can make β false positives (FPs) and false negatives (FNs) β equally. Usually, the cost of these errors is unequal and depends on the application context. For example, in cybersecurity applications, while both these errors can have severe negative consequences, one error might be preferred over the other. Specifically, false positives lead to alarm fatigue, a phenomenon where a high frequency of false alarms causes operators to ignore or dismiss all alarms. This problem is often exacerbated by base rate fallacy, where people underestimate the potential volume of false positives due to a high true positive rate while ignoring a low base rate.1 False negatives, on the other hand, imply that a vulnerability or attack has gone undetected. While ideally CAMLISβ24: Conference on Applied Machine Learning for Information Security, October 24β25, 2024, Arlington, VA * Corresponding author. $ mmarwah@opentext.com (M. Marwah); anarayanan@opentext.com (A. Narayanan); sjou@opentext.com (S. Jou); marlitt@opentext.com (M. Arlitt); mpospelova@opentext.com (M. Pospelova) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Even when the true positive rate (TPR), that is, π (π΄|π ) is high, the probability that an alarm corresponds to a real threat or vulnerability, that is, π (π |π΄), is usually very low. This follows directly from Bayes rule: π (π |π΄) β π (π΄|π ) Β· π (π ) and the fact that the base rate, π (π ), is usually very low. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings one would want to minimize both these errors, in practice there is a trade-off between the number of FPs and FNs. An organization, based on its goals and requirements may assign differing costs to these errors. For example, for a ransomware detection model the damage caused by a FN may be several orders of magnitude greater than the cost of a security analyst handling a FP; while for other models, e.g., port scanning detection, the costs may be similar or even higher for handling a FP. By using πΉ1 score such cost considerations are usually ignored.2 So a natural question to ask is: given the cost difference (or ratio) between the consequences of a FP and a FN for a particular use case, how can an organization incorporate that information while building machine learning models for that application? There is considerable prior work on cost-sensitive learning [3, 4, 5, 6, 7, 8]. These aim to modify the model learning process, e.g., by altering the loss function to incorporate cost, adding weights to the training samples, or readjusting class frequencies in the training set such that the trained model intrinsically produces results that are cost sensitive. In this paper, we do not change the underlying learning process and instead propose a new cost-aware metric as a replacement of πΉ1 score that can be used for model thresholding, comparison and selection. It is defined in terms of recall, precision, and a cost ratio, and can be used, for example, in determining the minimum cost point on a precision-recall curve. We applied the new metric, called cost score, πΆπ ππππ , to several cybersecurity related datasets and discovered significant cost differences between using πΉ1 score and πΆπ ππππ . While cost score applies to any classification problem, it is especially relevant in cybersecurity since the mismatch in the costs of misclassification can be significant. The main purpose of cost score is to make it easier for practitioners to incorporate cost during model thresholding and selection. It is an easy replacement for πΉ1 score since πΆπ ππππ is also defined in terms of precision and recall (and an additional cost ratio). The key contributions of the paper are: β’ Introduction of a new cost-based metric, πΆπ ππππ , defined as ( Precision 1 β 1 β ππ ) Β· Recall + ππ , where ππ is the cost ratio, which incorporates the differing costs of misclassification and can be used as a cost-aware alternative to πΉ1 score. β’ Characterization and derivation of the new metric, and its comparison with πΉ1 score. β’ Application of πΆπ ππππ to five cybersecurity related datasets, four of which are publicly available and one is private, for multiple values of cost ratio. The results show a cost saving of up to 86% with an average saving of 49% over using πΉ1 score in situations where costs are unequal. 2. Related Work 2.1. Drawbacks of πΉ1 score and alternatives While πΉ1 score is preferred to any one of accuracy, precision, or recall, especially for an imbalanced dataset, its primary drawback in our context is that all misclassifications are considered equal [9]. The other drawbacks [10, 11] include 1) lack of symmetry with respect to class labels, e.g., changing the positive class in a binary classifier produces a different result; and, 2) no dependence on true negatives. A more robust though not as popular alternative addressing some of these problems while still working well for imbalanced datasets is the Matthews Correlation Coefficient (MCC) [12], which in many cases is preferred to πΉ1 score [13]. It is symmetric and produces a high score only if all four confusion matrix entries (see Table 2) show good results [13]. However, it treats FPs and FNs equally. Unlike πΉ1 score and MCC, our proposed metric is not symmetric with respect to FNs and FPs, taking their distinct impacts into consideration through a cost ratio. Further, like MCC but unlike πΉ1 score, our metric is symmetric in the treatment of true positives and true negatives. Our metric is not normalized like MCC and πΉ1 score, and varies between 0 (best) and β (worst). This does not impact model thresholding, or comparison, however, the actual value of the cost metric in itself is not very meaningful, but can be converted to the corresponding recall and precision values. Since neither MCC nor πΉ1 score considers 2 A weighted version of πΉ1 score exists, however, it is usually not used, since it is not obvious how precision and recall should be weighed to incorporate the differing costs of FNs and FPs. differing costs of errors, and the latter is more widely used, we compare our proposed metric with πΉ1 score in the rest of the paper. 2.2. Cost sensitive learning and assessment Since in real-world applications cost differences between types of errors can be large, cost-sensitive machine learning has been an active area of research in the past few decades [3, 4, 5, 7], especially in areas such as security [14, 15] and medicine [16]. For example, Lee et al. [14] proposed cost models for intrusion detection; Liu et al. [15] incorporate cost considerations both in feature selection and classification for software defect prediction. Some of this and similar work could be used to estimate cost ratios for our proposed cost metric. At a high-level, cost-sensitive machine learning [8] can be categorized into two different approaches: 1) where the machine learning methods are modified to incorporate the unequal costs of errors; and, 2) where existing machine learning models β trained with cost oblivious methods β are converted into cost-sensitive ones using a wrapper [5, 7]. In this paper, we focus on the second approach, which is also referred to as cost-sensitive meta learning. While there are various methods to implement this approach, we will focus on thresholding or threshold adjusting [7], where the decision threshold of a probabilistic model is selected based on a cost function. Sheng et al. [7] showed that thresholding outperforms several other cost sensitive meta learning methods such as MetaCost [5]. In the most general case, the cost function for thresholding can be constructed from the entries of a confusion matrix with a weight attached to each of them, that is, FPs, FNs, TPs and TNs [6]. Our proposed cost metric uses a similar formulation, however, it is expressed in terms of precision and recall, metrics that data scientists already know well and understand. We are not aware of any existing cost metric defined in terms of precision, recall, and a cost ratio. Unlike πΉ1 score or MCC, the proposed metric is directly proportional to the total cost of misclassification. We believe it can serve as a cost-aware replacement for πΉ1 score or MCC. 3. Proposed Metric: Cost Score While the proposed metric is applicable to any machine learning classification model, including multi- class and multilabel settings, for simplicity we will assume a binary classification task in the following discussion. The notation used is summarized in Table 1. Starting with the cost of misclassifications, we derive expressions for cost score that can replace πΉ1 score. In particular, we derive two equivalent expressions β one in terms of TPR (recall) and FPR; the other in terms of precision and recall. They both include an error cost ratio (ππ ), which is a ratio between the cost of a FN to that of a FP. The first one is dependent on the base rate (π (π )), while the second, similar to F1-score, does not directly depend on it. The basic evaluation metrics for a binary classifier can be defined from a confusion matrix, shown in Table 2. One can also look at a confusion matrix from a probabilistic perspective, where the four possible outcomes define a probabilistic space, with each outcome a joint probability, as shown in Table 3. The total probability along a row or a column are the corresponding marginal probabilities. Conditional Probabilistic Definitions of Classifier Metrics False positive rate ( πΉππ ): π (π΄|Β¬π ) False negative rate ( πΉππ ): π (Β¬π΄|π ) True positive rate (recall) ( πππ ): π (π΄|π ) True negative rate ( πππ ): π (Β¬π΄|Β¬π ) Table 1 Notation Symbol Description Β¬ logical not π vulnerability or threat, or in general positive class π΄ positive classification by a detector, which may result in an alarm ππ true positive ππ true negative πΉπ false positive πΉπ false negative π total number of data points ππΉ π number of false positives ππΉ π number of false negatives π total number of positives π ^ total number of predicted positives π total number of negatives π ^ total number of predicted negatives πΆπΉ π cost of a false positive πΆπΉ π cost of a false negative πΆ total cost of misclassification ππ Error cost ratio, defined as πΆ πΉπ πΆπΉ π π Recall π πππ Precision Table 2 Confusion Matrix Ground Truth π (or π ) Β¬π (or πΉ ) π΄ (or π ) TP FP π ^ Prediction Β¬π΄ (or πΉ ) FN TN π ^ p n Table 3 Confusion Matrix β probabilistic view Ground Truth π Β¬π π΄ π (π΄, π ) π (π΄, Β¬π ) π (π΄) Prediction Β¬π΄ π (Β¬π΄, π ) π (Β¬π΄, Β¬π ) π (Β¬π΄) π (π ) π (Β¬π ) Precision ( ππ^π ): π (π |π΄) False discovery rate (or 1 - precision) ( πΉπ^π ): π (Β¬π |π΄) 3.1. Cost function The cost incurred as a result of misclassification is composed of the cost of false positives and that of false negatives. π (π΄, Β¬π ) and π (Β¬π΄, π ) represent the probability of false positives and false negatives, respectively. Thus, their number can be expressed as: ππΉ π = π Β· π (π΄, Β¬π ) ππΉ π = π Β· π (Β¬π΄, π ) Multiplying by the corresponding costs gives us the total cost of errors: πΆ = πΆπΉ π Β· ππΉ π + πΆπΉ π Β· ππΉ π = πΆπΉ π Β· π Β· π (π΄, Β¬π ) + πΆπΉ π Β· π Β· π (Β¬π΄, π ) Factoring out the common terms and introducing a cost ratio, ππ , gives: πΆ = πΆπΉ π Β· π (π (π΄, Β¬π ) + ππ Β· π (Β¬π΄, π )) (1) = πΎ Β· [π (π΄, Β¬π ) + ππ Β· π (Β¬π΄, π )] (2) 3.2. Cost score in terms of TPR and FPR Here we model the cost function in terms of TPR (recall) and FPR. Data scientists frequently evaluate a model in terms of TPR, which corresponds to the fraction of positive cases detected and FPR, which is the fraction of the negatives that were misclassified as positives. In fact, an ROC curve (a plot between TPR and FPR) is widely used for thresholding a probabilistic classifier. Using the product rule, we rewrite the probability distribution for false positives in terms of FPR and π (π ). π (π΄, Β¬π ) = π (π΄|Β¬π ) Β· π (Β¬π ) (3) = πΉ π π Β· (1 β π (π )) (4) Similarly, we rewrite the joint distribution for false negatives in terms of TPR and π (π ). π (Β¬π΄, π ) = π (Β¬π΄|π ) Β· π (π ) (5) = (1 β π (π΄|π )) Β· π (π ) (6) = (1 β π π π ) Β· π (π ) (7) Substituting 4 and 7 in the cost expression, 2, and rearranging, we get: πΆ = πΎ Β· [πΉ π π + π (π )(ππ β ππ Β· π π π β πΉ π π )] To minimize the cost, we can ignore πΎ, and thus the cost score becomes: πΆπ ππππ = πΉ π π + π (π ) Β· (ππ β ππ Β· π π π β πΉ π π ) (8) 3.3. Cost score in terms of precision and recall While FPR is a useful metric as it captures the number of false positives, it can be tricky to understand, especially when the base rate, π (π ), is low, which is usually the case in cybersecurity problems. For problems such as intrusion or threat detection, FPs add overhead to the workflow of a security analyst. For phishing website detection, a FP may result in a website being blocked in error for an end user. In either case, setting a target FPR requires knowledge of the base rate and would change as the base rate changes. In other words, even a seemingly low FPR may not be good enough, given a low base rate. Further, variance in base rate would affect overhead of a security analyst in case of intrusion detection or the fraction of erroneously blocked websites for a user in case of phishing detection even if the FPR stays constant. Precision on the other hand directly captures the operator overhead or fraction of erroneously blocked websites independent of the base rate. A main attraction of πΉ1 score is its use of precision instead of FPR. When the costs of FP and FN are similar, πΉ1 score is an effective evaluation metric, however, with unequal costs of misclassification, we can usually find a better solution by incorporating this cost differential in the metric. Below, we derive an expression for πΆπ ππππ in terms of precision and recall, similar to πΉ1 score, but that also includes a cost ratio. We can rewrite the probability of a false positive in terms of precision (π πππ) and marginal probability of alarm. π (π΄, Β¬π ) = π (Β¬π |π΄) Β· π (π΄) (9) = (1 β π πππ) Β· π (π΄) (10) P(A) can be expressed in terms of π (π ), π πππ and π (recall) using Bayes rule: π (π |π΄) Β· π (π΄) = π (π΄|π ) Β· π (π ) π (π΄|π ) π (π΄) = Β· π (π ) π (π |π΄) π = Β· π (π ) π πππ Substituting this value of π (π΄) in Equation 10, we get: 1 β π πππ π (π΄, Β¬π ) = Β· π Β· π (π ) (11) π πππ As in the previous section (Equation 7), the probability of a false negative can be written as: π (Β¬π΄, π ) = (1 β π ) Β· π (π ) (12) Therefore, substituting the values of probabilities of a false positive and a false negative from Equations 11 and 12, respectively, into the cost expression (Equation 1), we get 1 β π πππ πΆπ ππππ = π Β· πΆπΉ π Β· π (π ) Β· [ Β· π + ππ (1 β π )] π πππ Since π , πΆπΉ π and π (π ) are constant for a given dataset, we can rewrite the cost expression as: 1 πΆπ ππππ = ( β 1) Β· π + ππ (1 β π ) (13) π πππ This expression defines the cost in terms of precision, recall and cost ratio and can be used instead of πΉ1 score for any tasks that require model comparison such as model thresholding, hyperparameter tuning, model selection and feature selection. πΆπ ππππ goes to zero for π πππ = 1 and π = 1, as expected. As π πππ β 0 and π β 0, πΆπ ππππ β β. We have derived two equivalent cost expressions β one involving TPR and FPR (quantities used in an ROC curve) and the second involving precision and recall (quantities used in computing πΉ1 score). Similarly, it may be possible to derive additional equivalent cost expressions in terms of other commonly used metrics. In the remainder of the paper, we will only consider the cost expression πΆπ ππππ defined in terms of precision and recall (similar to πΉ1 score). This definition of πΆπ ππππ is not directly dependent on the base rate (π (π )), unlike the one in the previous section. 3.4. πΆπ ππππ Isocost Contours To better understand the cost score metric, we will examine its dependence on precision and recall, and compare it with πΉ1 score. Figure 1 shows a precision-recall (PR) plot with πΉ1 score isocurves or contours. Each curve corresponds to a constant value of πΉ1 score as specified next to the curve. If recall and precision are identical, πΉ1 score computes to the same value. However, if there is a wide gap between them, πΉ1 tends to be closer to the lower value, as can be seen in the top-left and bottom-right regions of the plot. As expected, the highest (best) value contours are towards the top-right corner of the plot (that is, towards perfect recall and precision). Further, the slope of the curves is always negative (as shown in Appendix A), implying there is always a trade-off between recall and precision. Figure 1: Precision-Recall plot of F1-score iso-curves. We can similarly obtain isocost curves (or contour lines) for cost score assuming a particular cost ratio, ππΆ . The cost score (Equation 13) can written as: 1 πΆπ ππππ = ( β 1 β ππ ) Β· π + ππ (14) π πππ and plotted for constant values of πΆπ ππππ on a PR plot. Figure 2 shows the isocost curves for three cost ratios: ππ = 1, that is, FN and FP cost the same; ππ = 10, that is, FN are ten times as expensive as FP; and ππ = 0.1, that is, FN are one-tenth as expensive as FP. There are three distinct regions in the plot, based on the slope of the curves. From the above equation, we can compute the slope (see Appendix, A for details). ππ πππ πΆπ ππππ β ππ = ππ (πΆπ ππππ + π (ππ + 1) β ππ )2 Depending on the value of πΆπ ππππ , the slope can be positive, negative or zero as shown below. βͺ> 0 if πΆπ ππππ > ππ β§ ππ πππ β¨ = < 0 if πΆπ ππππ < ππ ππ = 0 if πΆπ ππππ = ππ βͺ β© For lower (better) values of πΆπ ππππ , when πΆπ ππππ < ππ , the slope is negative and the isocost curves are similar to the isocurves for πΉ1 score. The horizontal line corresponds to πΆπ ππππ = ππ , and the curves below it have a positive slope with πΆπ ππππ > ππ . The isocurves closest to the top-right corner have the lowest costs. While the isocost contours are plotted assuming π πππ and π are independent, that is obviously not the case for a particular model. In fact, π πππ, π (π |π΄), and π , π (π΄|π ), are related by Bayes rule: π πππ = ππ (π ) (π΄) Β· π . The feasible π πππ-π pairs obtained by varying model thresholds are given by a PR curve. A hypothetical PR curve is shown as a dotted black line in Figure 2. The cost corresponding to each point on the PR curve is given by the isocost intersecting that point. The minimum cost point on the PR curve is the one that intersects the lowest cost contour. If the PR curve is convex, the minimum cost contour will touch the PR curve only at one point where their tangents have equal slope.3 However, in practice empirically constructed PR curves are not always convex and thus the minimum cost point 3 Under assumption of convexity, this can be proved by contradiction. Assume the minimum isocost touches a PR curve at at least two points; however, since both functions are convex, there must be another lower cost isocost touching the PR curve at at least one point. Thus, the lowest cost isocost must touch the PR curve at exactly one point. (a) ππ = 1 (b) ππ = 10 (c) ππ = 0.1 Figure 2: Isocost contours for πΆπ ππππ for three different cost ratios, ππ . The πΆπ ππππ corresponding to each contour is listed next to it. The black dotted-line is the PR curve for a particular model. may not be unique. In Figure 2 point A, B and C approximately show the minimum cost point for the three cost ratios. What do isocost contours mean in terms of the confusion matrix? πΆπ ππππ remains constant along a contour, and is proportional to πΉ π + ππ πΉ π , which must remain constant as recall and precision change. In Table 4, we have parameterized the confusion matrix entries with π such that as π changes for a particular ππ , precision and recall vary, however, πΆπ ππππ remains constant. This can be seen by computing πΉ π + ππ πΉ π for the Table entries, which is (πΉ π β² + ππ π) + ππ (πΉ π β² β π) and independent of π and thus constant. Table 4 Confusion Matrix β parameterized by π Ground Truth π Β¬π π΄ π π β² + π πΉ π β² + ππ Β· π ^ + ππ Β· π + π π Prediction Β¬π΄ πΉ π β² β π π π β² β ππ Β· π ^ β ππ Β· π β π π p n 3.5. How does πΉ1 score Compare with πΆπ ππππ when ππ = 1? While πΉ1 -score varies from 0 to 1, with 1 indicating perfect performance, πΆπ ππππ is proportional to the actual cost of handing model errors, with a zero-cost indicating perfect performance (that is, no FPs or FNs). πΉ1 score treats FNs and FPs uniformly, as does πΆπ ππππ when ππ = 1. So a natural question is if πΆπ ππππ differs from πΉ1 score when ππ = 1? To compare, we transform πΉ1 -score to a cost metric: 1 πΉ 1πππ π‘ = β1 (15) πΉ1 When πΉ1 is 1, πΉ 1πππ π‘ = 0, and when πΉ1 is 0, πΉ 1πππ π‘ β β, and thus exhibits behavior of a cost function and can be directly compared with πΆπ ππππ . To compare πΆπ ππππ and πΉ 1πππ π‘ , we reduce both in terms of the elements of the confusion matrix and find that: πΆπ ππππ β πΉ π + πΉ π (16) πΉπ + πΉπ πΉ 1πππ π‘ β (17) ππ Thus, when ππ = 1, πΆπ ππππ and πΉ 1πππ π‘ are not identical; while πΆπ ππππ is proportional to the total number of errors, πΉ 1πππ π‘ is also inversely proportional to the number of true positives. πΆπ ππππ only considers the cost of errors; it assigns zero cost to both TPs and TNs. In that sense, it treats TP and TN symmetrically unlike πΉ1 score. 3.6. Multiclass and multilabel classifiers While we derived the cost metric assuming a binary classification problem, its extension to multiclass and multilabel classification problems is straightforward. A cost ratio per class would need to be defined. For a multiclass classifier, a user would have to assign cost ratios considering each class as positive and the rest as negative. Similarly, for a multilabel classifier, a user would have to assign independent ratios for each class. This will allow a πΆπ ππππ to be computed per class. To compute a single cost metric, the per-class πΆπ ππππ would need to be aggregated. The simplest aggregation function is an arithmetic mean, although a class weighted mean based on class importance, or another type of aggregation, e.g., a harmonic mean, can also be performed. The βone class versus the restβ approach is similar to how πΉ1 score and other metrics are computed in a multiclass setting. 3.7. Minimizing πΆπ ππππ based on model threshold and other hyperparameters In Section 3.4, we described use of isocost contours to visually determine the lowest cost point on a PR curve. In practice, to find the minimum cost based on model threshold, and the corresponding precision-recall values, precision and recall can be considered functions of the threshold value (π‘) with the optimal value of π‘ determined by minimizing the cost function with respect to π‘. 1 π‘ = argminπ‘ [( β 1) Β· π (π‘) + ππ Β· (1 β π (π‘))] π πππ(π‘) In addition to model threshold, πΆπ ππππ can also be used for selecting other model hyperparameters such as the number of neighbors in π-NN; number of trees, maximum tree depth, etc. in tree based models; number and type of layers, activation functions, etc. in neural networks; and for model comparison and selection. Hyperparameter tuning [17] is typically performed using methods such as grid search, random search, gradient-based optimization, etc. Typically, cross-validation is used in conjunction to evaluate the quality of a particular choice of a dataset. In all these methods, the proposed πΆπ ππππ can replace another cost-oblivious metric such as πΉ1 score. 4. Experimental Evaluation 4.1. Datasets The datasets used for our experiments were chosen based on their relevance to security and the varying cost of misclassification between target classes. To comprehensively analyze the impact of costs, we selected five different datasets, four publicly available datasets and one privately collected dataset. The publicly available datasets include the UNSW-NB15 intrusion detection data, KDD Cup 99 network intrusion data, credit card transaction data, and phishing URL data. 1. UNSW-NB15 Intrusion Detection Data: This network dataset, developed by the Intelligent Security Group at UNSW Canberra, comprises events categorized into nine distinct types of attacks including normal traffic. To suit the experimental requirements of our study, the dataset was transformed into a binary classification setting, where a subset of attack classes (Backdoor, Exploits, Reconnaissance) are consolidated into class 1, while normal traffic is represented as class 0. There are a total of 93,000 events in class 0 and 60,841 events in class 1. For our research, we utilized the CSV version of the dataset, which comes pre-partitioned into training and testing sets [18][19]. 2. KDD Cup 99 Network Intrusion Data: This dataset originated from packet traces captured during the 1998 DARPA Intrusion Detection System Evaluation. It encompasses 145,585 unique records categorized into 23 distinct classes, which include various types of attacks alongside normal network traffic. Each record is characterized by 41 features that are derived from the packet traces. For this research, the dataset has been adapted to focus on a binary classification task: class 0 represents normal instances, while class 1 aggregates all other attack types. Also, to explore the impact of different thresholds on the modelβs performance, training was conducted using only 1% of the dataset. The dataset is accessed through the datasets available in the Python sklearn package [20]. 3. Credit Card Transactions Data: This dataset contains credit card transaction logs with 29 features that are tagged into legitimate and fraudulent transactions. There are a total of 284,315 transactions out of which 492 are fraudulent (class 1) and 56,866 are legitimate (class 0) [21]. Of the five datasets, this one has the highest skew. 4. Phishing data: This dataset is a collection of 60,252 webpages along with their URL and HTML sources. Out of these, 27,280 are phishing sites (class 1) whereas 32,972 are benign (class 0) [22]. We only use the URLs for building the model. 5. Internal data: This is a private dataset, used within an organization that represents the results of an extensive audit conducted on vulnerabilities in source code. Each vulnerability is classified into two classes, class 0 or class 1 (actual class names are masked for anonymity) by human auditors during the auditing process. The model is trained on this manually audited data and predicts if a given vulnerability belongs to class 0 or class 1. There are a total of 144,978 instances out of which 18,738 belong to class 1. Each vulnerability has 58 features which encompass a wide array of metrics that were generated during the analysis of the codebase. The information about each dataset is summarized in Table 5. It is important to note that not all datasets used are balanced. For instance, the credit card fraud data has less than 1% of instances in class 1. Similarly, the internal data has only 15% instances in class 1. 4.2. Experiment Setup We train a classification model using a RandomForest algorithm for each dataset. The goal is not to train the best possible model for the dataset but to obtain a reasonably good model with a probabilistic output. The steps are as follows: 1. Model Training: A RandomForest classifier is trained on each dataset. Although the training sets have different skews, we effectively used a balanced dataset for training so the classifier gets an equal opportunity to learn both classes. 2. Threshold adjustment using πΉ1 score: The validation dataset is used to identify the best threshold based on the πΉ1 score. The validation dataset was selected by sampling a proportion of the data, ensuring that the class distribution mirrored that of the training data. This approach was Table 5 Summary of datasets Number of instances Dataset Number of features Class 0 Class 1 UNSW-NB15 93,000 60,841 42 Credit card fraud 284,315 492 29 KDD cup 99 87,832 57,753 41 Phishing data 32,972 27,280 188 Internal data 126,240 18,738 58 taken because the actual skew of classes in production deployment is unknown. However, it is important to note that for actual production systems, the validation set should be representative of the true data distribution. Specifically, for the UNSW-NB15 dataset, the validation set was sampled from events in the test data CSV file. 3. Threshold adjustment using πΆπ ππππ : The predictions from the trained model are analyzed across different cost ratios. Using the validation sets, we apply the πΆπ ππππ to determine the optimal threshold for each cost ratio. 4. Comparison: The modelβs cost with thresholds chosen based on the πΉ1 score is compared against the costs with thresholds chosen using the πΆπ ππππ . This setup allows us to evaluate the effectiveness of πΆπ ππππ in optimizing model performance under varying cost conditions. 4.3. Results The proposed cost metric used is tailored for enhancing the performance of machine learning models in scenarios where the cost of false negatives greatly differs from the cost of false positives. This in turn helps in optimizing predictions based on cost considerations, thereby addressing a critical limitation in existing evaluation methods. 4.3.1. πΉ1 Score for thresholding To illustrate the advantages of our approach, we will first use πΉ1 score to adjust a modelβs threshold. Figure 3 depicts the changes in πΉ1 score for different threshold values across each dataset. The histograms in each plot represent the distribution of data within the corresponding probability intervals. Each color in the histogram represents the distribution of the corresponding ground truth class represented by the color. Due to the significant skew of the credit card dataset, the density of class 1 is not visible in the histogram. The πΉ1 score for each dataset starts from a threshold of 0, where all instances are classified as the positive class and recall is 1. It ends at a score of 0 at a threshold of 1, where all instances are tagged as negatives and recall is 0. The rate of change of the πΉ1 score in a threshold interval is proportional to the proportion of data within the interval and their ground truth values. This explains why, for some datasets, the πΉ1 score is flat or nearly flat in the middle range of thresholds. It is clear from Figure 3 that most models do a good job of separating the two classes. The model trained on KDD cup 99 data is able to separate both the classes more distinctively and has most of the data points near probability zero and one. This makes the threshold vs πΉ1 score curve mostly flat in the middle range of the probabilities. The best πΉ1 score is achieved at a threshold of 0.33. Similarly, the phishing, credit card fraud, and intrusion detection datasets perform well on the validation datasets (a) Threshold vs. F1-score (b) Threshold vs. F1-score (c) Threshold vs. F1-score (UNSW-NB15) (Credit card fraud) (Phishing data) (d) Threshold vs. F1-score (e) Threshold vs. F1-score (KDD cup 99) (Internal data) Figure 3: Threshold for best πΉ1 score for each dataset. with well-separated bimodal distributions. The threshold with the highest πΉ1 score is marked with a vertical line in each plot. For the internal dataset, the model struggles to separate both classes as can be seen by the overlap in the probabilities of both classes. The model achieves its best πΉ1 score at a threshold of 0.688. As the threshold moves from 0 to 1 there is a tradeoff between FPs and FNs; the maximum πΉ1 score for each model corresponds to the point where the sum of FPs and FNs are minimal while the number of TPs are the highest. Being symmetrical in FN and FP, πΉ1 score reduces their sum, disregarding any class specific costs. 4.3.2. πΆπ ππππ for thresholding at different cost ratios The proposed πΆπ ππππ metric allows the tuning of model parameters based on a cost ratio (ratio of the cost of false negatives to the cost of false positives). This cost ratio is variable and dependent on the specific impacts these errors have on end users. For example, in scenarios where missing a true attack could lead to significant financial losses, the cost of a false negative is higher. Conversely, in resource-constrained environments, a high rate of false positives can considerably burden the evaluation process. To illustrate the tuning differences, we applied three distinct cost ratios to each dataset: 0.1 (where a false positive is ten times more costly than a false negative), 1 (equal cost for both false positives and false negatives), and 10 (where a false negative is ten times more costly than a false positive). These cost ratios are used solely to demonstrate the modelβs behavior when tuned with πΆπ ππππ and may not correspond to practical applications of the data. Figure 4 displays the optimal thresholds derived from πΆπ ππππ for each dataset across these cost ratios. The histograms in each plot show the distribution of the ground truth classes within the probability intervals. The πΆπ ππππ reflects the classification cost, resulting in a curve shape inverse of that of the πΉ1 score, with the optimal threshold at the minimum πΆπ ππππ . Similar to the πΉ1 score plot, the flat portions of the πΆπ ππππ curve correspond to probability intervals with fewer data points. At a threshold of 0, the πΆπ ππππ is constant regardless of the cost ratio, as the recall is 1 and πΆπ ππππ is π πππ 1 β 1. Table 6 summarizes the experimental results, comparing the performance improvements of πΆπ ππππ at different thresholds with those of the πΉ1 score. Cost score (πΆπ ππππ ) is computed for thresholds based on maximizing πΉ1 score and minimizing πΆπ ππππ for each of the cost-ratios. There is only one threshold for each dataset based on best πΉ1 score but the threshold based on πΆπ ππππ varies for each cost ratio. Although the actual cost is a multiple of the πΆπ ππππ , the percentage improvement over the πΉ1 score (a) UNSW-NB15 (b) Phishing data (c) Credit card fraud (d) KDD cup 99 (e) Internal data Figure 4: Variation of thresholds with different cost ratios for each dataset. reflects the reduction in actual cost. For the UNSW-NB15 dataset, the optimal threshold is 0.89 for a cost ratio of 0.1, minimizing false positives at the expense of some true positives becoming false negatives (Figure 4a). At a cost ratio of 1, the threshold decreases to 0.65, balancing false positives and false negatives, aligning closely with the best πΉ1 score threshold. At a cost ratio of 10, the threshold further decreases to 0.42, significantly reducing false negatives despite an increase in false positives. In the phishing dataset, the probability distribution of both classes is similar (Figure 4b). At a cost ratio of 1, the threshold is 0.54, identical to the best πΉ1 score threshold. For a cost ratio of 0.1, the threshold increases to 0.73 to reduce false positives. Conversely, at a cost ratio of 10, the threshold decreases to 0.17 to significantly reduce false negatives. The spike in πΆπ ππππ for a cost ratio of 10 is proportional to the number of true class 1 instances within the probability interval. For the credit card fraud and KDD Cup 99 datasets, the πΆπ ππππ curve remains mostly flat. In the case of credit card fraud data, we applied a logarithmic transformation (Figure 4c) to highlight the differences in πΆπ ππππ due to the significant class imbalance. For the KDD Cup 99 dataset, the trained model achieves good class separation (Figure 4d), resulting in a relatively flat πΆπ ππππ curve in the middle region, with a spike towards a probability interval of 1 as the cost ratio increases. In the internal dataset, as we saw earlier, there is substantial overlap between the probability intervals of both classes, increasing the significance of false positives and false negatives (Figure 4e). At a cost ratio of 0.1, the threshold is set at 0.95, nearly eliminating false positives. At a cost ratio of 1, the threshold is 0.92, which is only slightly different from the threshold for a cost ratio of 0.1 and results in almost similar rates of false positives. This slight decrease can be attributed to significant class imbalances, where further threshold reduction could significantly increase false positives due to the higher count of instances in class 0. As the cost ratio increases to 10, the threshold decreases to 0.42, considerably reducing false negatives (as indicated by the reduced proportion of class 1 instances to the left of the threshold). The spike in πΆπ ππππ at this cost ratio corresponds to the interval with a significant count of class 1 instances. Table 6 compares the cost improvements achieved by πΆπ ππππ at different cost ratios to the costs at optimal thresholds based on the πΉ1 score. Performance improvements with respect to πΆπ ππππ range from 10% to 85% in most scenarios for cost ratios of 0.1 and 10, with an average cost improvement of 49%. At a cost ratio of 1, the improvement is minimal, except in datasets with significant class imbalances, indicating the similarity between πΉ1 score and πΆπ ππππ at this ratio. For the internal dataset with a cost ratio of 10, the cost improvement at the optimal πΆπ ππππ threshold compared to the πΉ1 score is 86%. Additionally, there is over 50% improvement in cost at a cost ratio of 0.1 for the UNSW-NB15, credit card fraud, and KDD Cup 99 datasets, underscoring the substantial benefits of tuning models using the πΆπ ππππ metric. These findings demonstrate how πΆπ ππππ effectively adjusts the threshold to balance false positives and false negatives based on the specified cost ratio. (a) PR Curve (UNSW-NB15) (b) PR Curve (Credit card fraud) (c) PR Curve (Phishing data) (d) PR Curve (KDD cup 99) (e) PR Curve (Internal data) Figure 5: Precision-Recall curve for different cost ratios Table 6 Misclassification costs based on using πΉ1 score and πΆπ ππππ for thresholding for three different cost ratios for the five datasets πΉ1 score based threshold πΆπ ππππ based threshold Dataset Cost ratio Optimal threshold Precision Recall πΆπ ππππ Optimal threshold Precision Recall πΆπ ππππ Percentage Improvement in Cost 0.1 0.056 0.890 0.992 0.868 0.020 64.1% UNSW-NB15 1 0.65 0.949 0.961 0.091 0.650 0.949 0.961 0.091 0.0% 10 0.441 0.420 0.885 0.993 0.203 53.2% 0.1 0.199 0.900 0.976 0.417 0.069 65.3% Credit card fraud 1 0.27 0.815 0.781 0.396 0.640 0.931 0.698 0.354 10.6% 10 2.365 0.130 0.757 0.812 2.135 9.7% 0.1 0.006 0.540 0.999 0.986 0.002 66.7% KDD cup 99 1 0.33 0.995 0.994 0.011 0.330 0.995 0.994 0.011 0.0% 10 0.065 0.170 0.982 0.998 0.034 47.7% 0.1 0.027 0.730 0.997 0.873 0.015 44.4% Phishing data 1 0.54 0.980 0.915 0.104 0.540 0.980 0.915 0.104 0.0% 10 0.876 0.170 0.764 0.970 0.595 32.1% 0.1 0.597 0.948 0.971 0.230 0.084 85.9% Internal data 1 0.69 0.532 0.637 0.923 0.923 0.942 0.252 0.764 17.2% 10 4.186 0.424 0.292 0.886 3.289 21.4% 4.3.3. Precision-Recall trade-off using πΆπ ππππ Cost scoreβs ability to balance false negatives and false positives based on varying cost ratios is further demonstrated by the changes in precision and recall (Figure 5 and Table 6). The results indicate that as the cost ratio shifts from 1 to 0.1, precision increases while recall decreases. For example, in the case of UNSW-NB15 data, precision reaches 0.992 at a cost ratio of 0.1 which highlights a reduction in false positives. Conversely, when the cost ratio increases to 10, there is a significant improvement in recall. For example, in the case of UNSW-NB15 data, the recall is improved 10% compared to that at best πΉ1 score at a cost ratio of 10, indicating a reduction in false negatives. When comparing the results of the new cost score at a cost ratio of 1 with the πΉ1 score, the outcomes are similar in most cases. This similarity is evident from the Precision-Recall curves (Figure 5), where the precision and recall for the optimal πΉ1 score overlap with those of the new cost score at a cost ratio of 1. This underscores the fact that the πΉ1 score consistently assigns equal weights to both false negatives and false positives, irrespective of their real-world cost impacts. The notable difference in precision and recall between the πΉ1 score and the new cost score is observed in the credit card fraud data (Figure 5b) and internal data (Figure 5e), which can be attributed to the significant class imbalance in this dataset and also to the fact that πΉ1 score is also proportional to true positives and thereby strives for a better recall compared to πΆπ ππππ at cost ratio 1. These results demonstrate that the proposed πΆπ ππππ metric offers substantial performance enhance- ments over the πΉ1 score across a range of cost ratios. Unlike the πΉ1 score, which typically assigns equal penalties to false positives and false negatives, the πΆπ ππππ metric accommodates variations in the costs associated with these errors. At a cost ratio of 1, the πΆπ ππππ achieves performance comparable to the πΉ1 score, illustrating its versatility. The flexibility of the πΆπ ππππ to tune models based on cost ratios, particularly demonstrated by the results on internal data, has proven it to be a valuable metric for fine-tuning our models to meet the varying cost demands of end users. These findings suggest that the πΆπ ππππ can effectively replace the πΉ1 score for tasks such as model thresholding and selection, particularly in scenarios where the costs of false positives and false negatives differ. Figure 6 shows the πΆπ ππππ isocost contours overlaid with the PR curve for the UNSW-NB15 dataset. The minimum cost point on the PR curve corresponds to the point where it intersects the contour with the least cost. 5. Conclusions How organizations handle errors from machine learning models is highly dependent on context and application. In the cybersecurity domain, the cost of a security analystβs time and effort spent in (a) πΆππ π‘π ππ‘ππ = 0.1 (b) πΆππ π‘π ππ‘ππ = 1 (c) πΆππ π‘π ππ‘ππ = 10 Figure 6: πΆπ ππππ isocost contours with precision-recall curve for the UNSW-NB15 dataset for three different cost ratios. reviewing and investigating a false positive varies considerably from the cost of a modelβs failure to detect a real security incident (a false negative). However, widely used metrics like πΉ1 score assign them equal costs. In this paper, we derived a new cost-aware metric, πΆπ ππππ defined in terms of precision, recall, and a cost ratio, which can be used for model evaluation and serve as a replacement for πΉ1 score. In particular, it can be used for thresholding probabilistic classifiers to achieve minimum cost. To demonstrate the effectiveness of πΆπ ππππ in cybersecurity applications, we applied it to threshold models built on five different datasets assuming multiple cost ratios. The results showed substantial savings in cost through the use of πΆπ ππππ over πΉ1 score. At cost ratio 1, the results are similar, however, as the cost ratio is increased or decreased, the gap in costs between using πΆπ ππππ and πΉ1 score increases. All datasets show consistent improvements in cost. Through this work, we hope to raise awareness among machine learning practitioners building cybersecurity applications regarding the use of cost-aware metrics such as πΆπ ππππ instead of cost-oblivious ones like πΉ1 score. Acknowledgments We thank the anonymous reviewers and the CAMLIS 2024 attendees for their feedback. References [1] C. J. Van Rijsbergen, Information retrieval. 2nd. newton, ma, 1979. [2] S. Puthiya Parambath, N. Usunier, Y. Grandvalet, Optimizing f-measures by cost-sensitive classifi- cation, Advances in neural information processing systems 27 (2014). [3] P. D. Turney, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of artificial intelligence research 2 (1994) 369β409. [4] M. Kukar, I. Kononenko, et al., Cost-sensitive learning with neural networks., in: ECAI, volume 15, Citeseer, 1998, pp. 88β94. [5] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 155β164. [6] C. Elkan, The foundations of cost-sensitive learning, in: International joint conference on artificial intelligence, volume 17, Lawrence Erlbaum Associates Ltd, 2001, pp. 973β978. [7] V. S. Sheng, C. X. Ling, Thresholding for making classifiers cost-sensitive, in: Aaai, volume 6, 2006, pp. 476β481. [8] B. Krishnapuram, S. Yu, R. B. Rao, Cost-sensitive machine learning, CRC Press, 2011. [9] D. Hand, P. Christen, A note on using the f-measure for evaluating record linkage algorithms, Statistics and Computing 28 (2018) 539β547. [10] D. M. Powers, Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation, arXiv preprint arXiv:2010.16061 (2020). [11] M. Sitarz, Extending f1 metric, probabilistic approach, arXiv preprint arXiv:2210.11997 (2022). [12] B. W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442β451. [13] D. Chicco, G. Jurman, The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation, BMC genomics 21 (2020) 1β13. [14] W. Lee, W. Fan, M. Miller, S. J. Stolfo, E. Zadok, Toward cost-sensitive modeling for intrusion detection and response, Journal of computer security 10 (2002) 5β22. [15] M. Liu, L. Miao, D. Zhang, Two-stage cost-sensitive learning for software defect prediction, IEEE Transactions on Reliability 63 (2014) 676β686. [16] I. Bruha, S. KoΔkovΓ‘, A support for decision-making: Cost-sensitive learning system, Artificial Intelligence in Medicine 6 (1994) 67β82. [17] Wikipedia, Hyperparameter tuning, Accessed: May 2024. URL: https://en.wikipedia.org/wiki/ Hyperparameter_optimization. [18] N. Moustafa, J. Slay, UNSW-NB15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set), in: 2015 military communications and information systems conference (MilCIS), IEEE, 2015, pp. 1β6. [19] N. Moustafa, J. Slay, The evaluation of network anomaly detection systems: Statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set, Information Security Journal: A Global Perspective 25 (2016) 18β31. [20] scikitlearn, Real world datasets, Accessed: May 2024. URL: https://scikitlearn.org/stable/datasets/ real_world.html. [21] G. Pang, C. Shen, A. Van Den Hengel, Deep anomaly detection with deviation networks, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 353β362. [22] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach for phishing websites using url and html features, Scientific Reports 12 (2022) 8842. A. Appendix A.1. Slopes of the πΉ1 score and πΆπ ππππ isocurves πΉ1 score is defined as: 2 Β· π πππ Β· π πΉ1 = π πππ Β· π Rearranging: πΉ1 Β· π π πππ = 2π β πΉ1 Slope of πΉ1 isocurves can be calculated as: ππ πππ πΉ1 2π πΉ1 = β ππ 2π β πΉ1 (2π β πΉ1 )2 βπΉ12 = (2π β πΉ1 )2 Thus, slope of πΉ1 isocurves is always negative. πΆπ ππππ is defined as: 1 πΆπ ππππ = ( β ππ β 1) Β· π + ππ π πππ It can be rearranged as: π π πππ = πΆπ ππππ + π (ππ + 1) β ππ Slope of the πΆπ ππππ isocurves can be computed as: ππ πππ 1 (ππ + 1)π = β ππ πΆπ ππππ + π (ππ + 1) β ππ (πΆπ ππππ + π (ππ + 1) β ππ )2 πΆπ ππππ β ππ = (πΆπ ππππ + π (ππ + 1) β ππ )2 As described in Section 3.4, these curves can have negative, positive or zero slopes depending on the value of πΆπ ππππ . A.2. Improvements in cost for different cost ratios Figure 7 depicts the improvement in cost for different values of cost ratios. The x-axis of the plot is πππ10 (πππ π‘_πππ‘ππ) and y-axis is percentage improvement in cost when compared to the threshold selected using πΉ1 score. It can be observed that the general trend across all datasets is that the improvement is minimum near a cost ratio of one where πΆπ ππππ behaves similar to πΉ1 score. The cost improvement increases as we move away from one in either direction. The percentage increase for higher cost ratio depends on the proportion of the positive class (class 1) in the data. Since π (π ) is very small for credit card fraud dataset, there is only a slight increase in cost savings for cost ratios greater than one (Figure 7b). (a) UNSW-NB15 (b) Credit card fraud (c) Phishing data (d) KDD cup 99 (e) Internal data Figure 7: Percentage improvement in cost for threshold obtained by minimizing the new cost score compared to the threshold obtained from maximizing πΉ1 score.