Is 𝐹1 Score Suboptimal for Cybersecurity Models? Introducing πΆπ‘ π‘π‘œπ‘Ÿπ‘’, a Cost-Aware Alternative for Model Assessment Manish Marwah1,* , Asad Narayanan2 , Stephan Jou2 , Martin Arlitt2 and Maria Pospelova2 1 OpenText, USA 2 OpenText, Canada Abstract The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. Various design choices during machine learning model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between these two errors. However, most of the commonly used metrics to evaluate model quality, such as 𝐹1 score, which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , based on precision and recall that can replace 𝐹1 score for model evaluation and selection. It includes a cost ratio that takes into account the differing costs of handling false positives and false negatives. We derive and characterize the new cost metric and compare it to 𝐹1 score. Further, we use this metric for model thresholding for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%. Keywords machine learning, cybersecurity, 𝐹1 score, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , misclassification, cost-sensitive machine learning, false positive, false negative 1. Introduction Applications of machine learning in cybersecurity are widespread and rapidly growing, with models being deployed to prevent, detect, and respond to threats such as malware, intrusion, fraud, and phishing. The main metric for assessing the performance of classification models is 𝐹1 score [1], also known as 𝐹1 measure, which is the harmonic mean of precision and recall. While 𝐹1 score is used for assessing models, it is not directly used as a loss function since it is not differentiable (or convex). A simpler and commonly used approach is a two-stage optimization process. First, a model is trained using a conventional loss function such as cross-entropy, and then an optimal threshold is selected based on 𝐹1 score [2]. 𝐹1 score works particularly well in highly imbalanced settings prevalent in cybersecurity. However, it treats both kinds of errors a machine learning classifier can make – false positives (FPs) and false negatives (FNs) – equally. Usually, the cost of these errors is unequal and depends on the application context. For example, in cybersecurity applications, while both these errors can have severe negative consequences, one error might be preferred over the other. Specifically, false positives lead to alarm fatigue, a phenomenon where a high frequency of false alarms causes operators to ignore or dismiss all alarms. This problem is often exacerbated by base rate fallacy, where people underestimate the potential volume of false positives due to a high true positive rate while ignoring a low base rate.1 False negatives, on the other hand, imply that a vulnerability or attack has gone undetected. While ideally CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA * Corresponding author. $ mmarwah@opentext.com (M. Marwah); anarayanan@opentext.com (A. Narayanan); sjou@opentext.com (S. Jou); marlitt@opentext.com (M. Arlitt); mpospelova@opentext.com (M. Pospelova) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Even when the true positive rate (TPR), that is, 𝑃 (𝐴|𝑉 ) is high, the probability that an alarm corresponds to a real threat or vulnerability, that is, 𝑃 (𝑉 |𝐴), is usually very low. This follows directly from Bayes rule: 𝑃 (𝑉 |𝐴) ∝ 𝑃 (𝐴|𝑉 ) Β· 𝑃 (𝑉 ) and the fact that the base rate, 𝑃 (𝑉 ), is usually very low. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings one would want to minimize both these errors, in practice there is a trade-off between the number of FPs and FNs. An organization, based on its goals and requirements may assign differing costs to these errors. For example, for a ransomware detection model the damage caused by a FN may be several orders of magnitude greater than the cost of a security analyst handling a FP; while for other models, e.g., port scanning detection, the costs may be similar or even higher for handling a FP. By using 𝐹1 score such cost considerations are usually ignored.2 So a natural question to ask is: given the cost difference (or ratio) between the consequences of a FP and a FN for a particular use case, how can an organization incorporate that information while building machine learning models for that application? There is considerable prior work on cost-sensitive learning [3, 4, 5, 6, 7, 8]. These aim to modify the model learning process, e.g., by altering the loss function to incorporate cost, adding weights to the training samples, or readjusting class frequencies in the training set such that the trained model intrinsically produces results that are cost sensitive. In this paper, we do not change the underlying learning process and instead propose a new cost-aware metric as a replacement of 𝐹1 score that can be used for model thresholding, comparison and selection. It is defined in terms of recall, precision, and a cost ratio, and can be used, for example, in determining the minimum cost point on a precision-recall curve. We applied the new metric, called cost score, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , to several cybersecurity related datasets and discovered significant cost differences between using 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . While cost score applies to any classification problem, it is especially relevant in cybersecurity since the mismatch in the costs of misclassification can be significant. The main purpose of cost score is to make it easier for practitioners to incorporate cost during model thresholding and selection. It is an easy replacement for 𝐹1 score since πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is also defined in terms of precision and recall (and an additional cost ratio). The key contributions of the paper are: β€’ Introduction of a new cost-based metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , defined as ( Precision 1 βˆ’ 1 βˆ’ π‘Ÿπ‘ ) Β· Recall + π‘Ÿπ‘ , where π‘Ÿπ‘ is the cost ratio, which incorporates the differing costs of misclassification and can be used as a cost-aware alternative to 𝐹1 score. β€’ Characterization and derivation of the new metric, and its comparison with 𝐹1 score. β€’ Application of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to five cybersecurity related datasets, four of which are publicly available and one is private, for multiple values of cost ratio. The results show a cost saving of up to 86% with an average saving of 49% over using 𝐹1 score in situations where costs are unequal. 2. Related Work 2.1. Drawbacks of 𝐹1 score and alternatives While 𝐹1 score is preferred to any one of accuracy, precision, or recall, especially for an imbalanced dataset, its primary drawback in our context is that all misclassifications are considered equal [9]. The other drawbacks [10, 11] include 1) lack of symmetry with respect to class labels, e.g., changing the positive class in a binary classifier produces a different result; and, 2) no dependence on true negatives. A more robust though not as popular alternative addressing some of these problems while still working well for imbalanced datasets is the Matthews Correlation Coefficient (MCC) [12], which in many cases is preferred to 𝐹1 score [13]. It is symmetric and produces a high score only if all four confusion matrix entries (see Table 2) show good results [13]. However, it treats FPs and FNs equally. Unlike 𝐹1 score and MCC, our proposed metric is not symmetric with respect to FNs and FPs, taking their distinct impacts into consideration through a cost ratio. Further, like MCC but unlike 𝐹1 score, our metric is symmetric in the treatment of true positives and true negatives. Our metric is not normalized like MCC and 𝐹1 score, and varies between 0 (best) and ∞ (worst). This does not impact model thresholding, or comparison, however, the actual value of the cost metric in itself is not very meaningful, but can be converted to the corresponding recall and precision values. Since neither MCC nor 𝐹1 score considers 2 A weighted version of 𝐹1 score exists, however, it is usually not used, since it is not obvious how precision and recall should be weighed to incorporate the differing costs of FNs and FPs. differing costs of errors, and the latter is more widely used, we compare our proposed metric with 𝐹1 score in the rest of the paper. 2.2. Cost sensitive learning and assessment Since in real-world applications cost differences between types of errors can be large, cost-sensitive machine learning has been an active area of research in the past few decades [3, 4, 5, 7], especially in areas such as security [14, 15] and medicine [16]. For example, Lee et al. [14] proposed cost models for intrusion detection; Liu et al. [15] incorporate cost considerations both in feature selection and classification for software defect prediction. Some of this and similar work could be used to estimate cost ratios for our proposed cost metric. At a high-level, cost-sensitive machine learning [8] can be categorized into two different approaches: 1) where the machine learning methods are modified to incorporate the unequal costs of errors; and, 2) where existing machine learning models – trained with cost oblivious methods – are converted into cost-sensitive ones using a wrapper [5, 7]. In this paper, we focus on the second approach, which is also referred to as cost-sensitive meta learning. While there are various methods to implement this approach, we will focus on thresholding or threshold adjusting [7], where the decision threshold of a probabilistic model is selected based on a cost function. Sheng et al. [7] showed that thresholding outperforms several other cost sensitive meta learning methods such as MetaCost [5]. In the most general case, the cost function for thresholding can be constructed from the entries of a confusion matrix with a weight attached to each of them, that is, FPs, FNs, TPs and TNs [6]. Our proposed cost metric uses a similar formulation, however, it is expressed in terms of precision and recall, metrics that data scientists already know well and understand. We are not aware of any existing cost metric defined in terms of precision, recall, and a cost ratio. Unlike 𝐹1 score or MCC, the proposed metric is directly proportional to the total cost of misclassification. We believe it can serve as a cost-aware replacement for 𝐹1 score or MCC. 3. Proposed Metric: Cost Score While the proposed metric is applicable to any machine learning classification model, including multi- class and multilabel settings, for simplicity we will assume a binary classification task in the following discussion. The notation used is summarized in Table 1. Starting with the cost of misclassifications, we derive expressions for cost score that can replace 𝐹1 score. In particular, we derive two equivalent expressions β€” one in terms of TPR (recall) and FPR; the other in terms of precision and recall. They both include an error cost ratio (π‘Ÿπ‘ ), which is a ratio between the cost of a FN to that of a FP. The first one is dependent on the base rate (𝑃 (𝑉 )), while the second, similar to F1-score, does not directly depend on it. The basic evaluation metrics for a binary classifier can be defined from a confusion matrix, shown in Table 2. One can also look at a confusion matrix from a probabilistic perspective, where the four possible outcomes define a probabilistic space, with each outcome a joint probability, as shown in Table 3. The total probability along a row or a column are the corresponding marginal probabilities. Conditional Probabilistic Definitions of Classifier Metrics False positive rate ( 𝐹𝑛𝑃 ): 𝑃 (𝐴|¬𝑉 ) False negative rate ( 𝐹𝑝𝑁 ): 𝑃 (¬𝐴|𝑉 ) True positive rate (recall) ( 𝑇𝑝𝑃 ): 𝑃 (𝐴|𝑉 ) True negative rate ( 𝑇𝑛𝑁 ): 𝑃 (¬𝐴|¬𝑉 ) Table 1 Notation Symbol Description Β¬ logical not 𝑉 vulnerability or threat, or in general positive class 𝐴 positive classification by a detector, which may result in an alarm 𝑇𝑃 true positive 𝑇𝑁 true negative 𝐹𝑃 false positive 𝐹𝑁 false negative 𝑁 total number of data points 𝑁𝐹 𝑃 number of false positives 𝑁𝐹 𝑁 number of false negatives 𝑝 total number of positives 𝑝 ^ total number of predicted positives 𝑛 total number of negatives 𝑛 ^ total number of predicted negatives 𝐢𝐹 𝑃 cost of a false positive 𝐢𝐹 𝑁 cost of a false negative 𝐢 total cost of misclassification π‘Ÿπ‘ Error cost ratio, defined as 𝐢 𝐹𝑁 𝐢𝐹 𝑃 𝑅 Recall 𝑃 π‘Ÿπ‘’π‘ Precision Table 2 Confusion Matrix Ground Truth 𝑉 (or 𝑇 ) ¬𝑉 (or 𝐹 ) 𝐴 (or 𝑇 ) TP FP 𝑝 ^ Prediction ¬𝐴 (or 𝐹 ) FN TN 𝑛 ^ p n Table 3 Confusion Matrix – probabilistic view Ground Truth 𝑉 ¬𝑉 𝐴 𝑃 (𝐴, 𝑉 ) 𝑃 (𝐴, ¬𝑉 ) 𝑃 (𝐴) Prediction ¬𝐴 𝑃 (¬𝐴, 𝑉 ) 𝑃 (¬𝐴, ¬𝑉 ) 𝑃 (¬𝐴) 𝑃 (𝑉 ) 𝑃 (¬𝑉 ) Precision ( 𝑇𝑝^𝑃 ): 𝑃 (𝑉 |𝐴) False discovery rate (or 1 - precision) ( 𝐹𝑝^𝑃 ): 𝑃 (¬𝑉 |𝐴) 3.1. Cost function The cost incurred as a result of misclassification is composed of the cost of false positives and that of false negatives. 𝑃 (𝐴, ¬𝑉 ) and 𝑃 (¬𝐴, 𝑉 ) represent the probability of false positives and false negatives, respectively. Thus, their number can be expressed as: 𝑁𝐹 𝑃 = 𝑁 Β· 𝑃 (𝐴, ¬𝑉 ) 𝑁𝐹 𝑁 = 𝑁 Β· 𝑃 (¬𝐴, 𝑉 ) Multiplying by the corresponding costs gives us the total cost of errors: 𝐢 = 𝐢𝐹 𝑃 Β· 𝑁𝐹 𝑃 + 𝐢𝐹 𝑁 Β· 𝑁𝐹 𝑁 = 𝐢𝐹 𝑃 Β· 𝑁 Β· 𝑃 (𝐴, ¬𝑉 ) + 𝐢𝐹 𝑁 Β· 𝑁 Β· 𝑃 (¬𝐴, 𝑉 ) Factoring out the common terms and introducing a cost ratio, π‘Ÿπ‘ , gives: 𝐢 = 𝐢𝐹 𝑃 Β· 𝑁 (𝑃 (𝐴, ¬𝑉 ) + π‘Ÿπ‘ Β· 𝑃 (¬𝐴, 𝑉 )) (1) = 𝐾 Β· [𝑃 (𝐴, ¬𝑉 ) + π‘Ÿπ‘ Β· 𝑃 (¬𝐴, 𝑉 )] (2) 3.2. Cost score in terms of TPR and FPR Here we model the cost function in terms of TPR (recall) and FPR. Data scientists frequently evaluate a model in terms of TPR, which corresponds to the fraction of positive cases detected and FPR, which is the fraction of the negatives that were misclassified as positives. In fact, an ROC curve (a plot between TPR and FPR) is widely used for thresholding a probabilistic classifier. Using the product rule, we rewrite the probability distribution for false positives in terms of FPR and 𝑃 (𝑉 ). 𝑃 (𝐴, ¬𝑉 ) = 𝑃 (𝐴|¬𝑉 ) Β· 𝑃 (¬𝑉 ) (3) = 𝐹 𝑃 𝑅 Β· (1 βˆ’ 𝑃 (𝑉 )) (4) Similarly, we rewrite the joint distribution for false negatives in terms of TPR and 𝑃 (𝑉 ). 𝑃 (¬𝐴, 𝑉 ) = 𝑃 (¬𝐴|𝑉 ) Β· 𝑃 (𝑉 ) (5) = (1 βˆ’ 𝑃 (𝐴|𝑉 )) Β· 𝑃 (𝑉 ) (6) = (1 βˆ’ 𝑇 𝑃 𝑅) Β· 𝑃 (𝑉 ) (7) Substituting 4 and 7 in the cost expression, 2, and rearranging, we get: 𝐢 = 𝐾 Β· [𝐹 𝑃 𝑅 + 𝑃 (𝑉 )(π‘Ÿπ‘ βˆ’ π‘Ÿπ‘ Β· 𝑇 𝑃 𝑅 βˆ’ 𝐹 𝑃 𝑅)] To minimize the cost, we can ignore 𝐾, and thus the cost score becomes: πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = 𝐹 𝑃 𝑅 + 𝑃 (𝑉 ) Β· (π‘Ÿπ‘ βˆ’ π‘Ÿπ‘ Β· 𝑇 𝑃 𝑅 βˆ’ 𝐹 𝑃 𝑅) (8) 3.3. Cost score in terms of precision and recall While FPR is a useful metric as it captures the number of false positives, it can be tricky to understand, especially when the base rate, 𝑃 (𝑉 ), is low, which is usually the case in cybersecurity problems. For problems such as intrusion or threat detection, FPs add overhead to the workflow of a security analyst. For phishing website detection, a FP may result in a website being blocked in error for an end user. In either case, setting a target FPR requires knowledge of the base rate and would change as the base rate changes. In other words, even a seemingly low FPR may not be good enough, given a low base rate. Further, variance in base rate would affect overhead of a security analyst in case of intrusion detection or the fraction of erroneously blocked websites for a user in case of phishing detection even if the FPR stays constant. Precision on the other hand directly captures the operator overhead or fraction of erroneously blocked websites independent of the base rate. A main attraction of 𝐹1 score is its use of precision instead of FPR. When the costs of FP and FN are similar, 𝐹1 score is an effective evaluation metric, however, with unequal costs of misclassification, we can usually find a better solution by incorporating this cost differential in the metric. Below, we derive an expression for πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in terms of precision and recall, similar to 𝐹1 score, but that also includes a cost ratio. We can rewrite the probability of a false positive in terms of precision (𝑃 π‘Ÿπ‘’π‘) and marginal probability of alarm. 𝑃 (𝐴, ¬𝑉 ) = 𝑃 (¬𝑉 |𝐴) Β· 𝑃 (𝐴) (9) = (1 βˆ’ 𝑃 π‘Ÿπ‘’π‘) Β· 𝑃 (𝐴) (10) P(A) can be expressed in terms of 𝑃 (𝑉 ), 𝑃 π‘Ÿπ‘’π‘ and 𝑅 (recall) using Bayes rule: 𝑃 (𝑉 |𝐴) Β· 𝑃 (𝐴) = 𝑃 (𝐴|𝑉 ) Β· 𝑃 (𝑉 ) 𝑃 (𝐴|𝑉 ) 𝑃 (𝐴) = Β· 𝑃 (𝑉 ) 𝑃 (𝑉 |𝐴) 𝑅 = Β· 𝑃 (𝑉 ) 𝑃 π‘Ÿπ‘’π‘ Substituting this value of 𝑃 (𝐴) in Equation 10, we get: 1 βˆ’ 𝑃 π‘Ÿπ‘’π‘ 𝑃 (𝐴, ¬𝑉 ) = Β· 𝑅 Β· 𝑃 (𝑉 ) (11) 𝑃 π‘Ÿπ‘’π‘ As in the previous section (Equation 7), the probability of a false negative can be written as: 𝑃 (¬𝐴, 𝑉 ) = (1 βˆ’ 𝑅) Β· 𝑃 (𝑉 ) (12) Therefore, substituting the values of probabilities of a false positive and a false negative from Equations 11 and 12, respectively, into the cost expression (Equation 1), we get 1 βˆ’ 𝑃 π‘Ÿπ‘’π‘ πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = 𝑁 Β· 𝐢𝐹 𝑃 Β· 𝑃 (𝑉 ) Β· [ Β· 𝑅 + π‘Ÿπ‘ (1 βˆ’ 𝑅)] 𝑃 π‘Ÿπ‘’π‘ Since 𝑁 , 𝐢𝐹 𝑃 and 𝑃 (𝑉 ) are constant for a given dataset, we can rewrite the cost expression as: 1 πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = ( βˆ’ 1) Β· 𝑅 + π‘Ÿπ‘ (1 βˆ’ 𝑅) (13) 𝑃 π‘Ÿπ‘’π‘ This expression defines the cost in terms of precision, recall and cost ratio and can be used instead of 𝐹1 score for any tasks that require model comparison such as model thresholding, hyperparameter tuning, model selection and feature selection. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ goes to zero for 𝑃 π‘Ÿπ‘’π‘ = 1 and 𝑅 = 1, as expected. As 𝑃 π‘Ÿπ‘’π‘ β†’ 0 and 𝑅 β†’ 0, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ β†’ ∞. We have derived two equivalent cost expressions – one involving TPR and FPR (quantities used in an ROC curve) and the second involving precision and recall (quantities used in computing 𝐹1 score). Similarly, it may be possible to derive additional equivalent cost expressions in terms of other commonly used metrics. In the remainder of the paper, we will only consider the cost expression πΆπ‘ π‘π‘œπ‘Ÿπ‘’ defined in terms of precision and recall (similar to 𝐹1 score). This definition of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is not directly dependent on the base rate (𝑃 (𝑉 )), unlike the one in the previous section. 3.4. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ Isocost Contours To better understand the cost score metric, we will examine its dependence on precision and recall, and compare it with 𝐹1 score. Figure 1 shows a precision-recall (PR) plot with 𝐹1 score isocurves or contours. Each curve corresponds to a constant value of 𝐹1 score as specified next to the curve. If recall and precision are identical, 𝐹1 score computes to the same value. However, if there is a wide gap between them, 𝐹1 tends to be closer to the lower value, as can be seen in the top-left and bottom-right regions of the plot. As expected, the highest (best) value contours are towards the top-right corner of the plot (that is, towards perfect recall and precision). Further, the slope of the curves is always negative (as shown in Appendix A), implying there is always a trade-off between recall and precision. Figure 1: Precision-Recall plot of F1-score iso-curves. We can similarly obtain isocost curves (or contour lines) for cost score assuming a particular cost ratio, π‘ŸπΆ . The cost score (Equation 13) can written as: 1 πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = ( βˆ’ 1 βˆ’ π‘Ÿπ‘ ) Β· 𝑅 + π‘Ÿπ‘ (14) 𝑃 π‘Ÿπ‘’π‘ and plotted for constant values of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ on a PR plot. Figure 2 shows the isocost curves for three cost ratios: π‘Ÿπ‘ = 1, that is, FN and FP cost the same; π‘Ÿπ‘ = 10, that is, FN are ten times as expensive as FP; and π‘Ÿπ‘ = 0.1, that is, FN are one-tenth as expensive as FP. There are three distinct regions in the plot, based on the slope of the curves. From the above equation, we can compute the slope (see Appendix, A for details). πœ•π‘ƒ π‘Ÿπ‘’π‘ πΆπ‘ π‘π‘œπ‘Ÿπ‘’ βˆ’ π‘Ÿπ‘ = πœ•π‘… (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2 Depending on the value of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , the slope can be positive, negative or zero as shown below. βŽͺ> 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ > π‘Ÿπ‘ ⎧ πœ•π‘ƒ π‘Ÿπ‘’π‘ ⎨ = < 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ < π‘Ÿπ‘ πœ•π‘… = 0 if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = π‘Ÿπ‘ βŽͺ ⎩ For lower (better) values of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , when πΆπ‘ π‘π‘œπ‘Ÿπ‘’ < π‘Ÿπ‘ , the slope is negative and the isocost curves are similar to the isocurves for 𝐹1 score. The horizontal line corresponds to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = π‘Ÿπ‘ , and the curves below it have a positive slope with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ > π‘Ÿπ‘ . The isocurves closest to the top-right corner have the lowest costs. While the isocost contours are plotted assuming 𝑃 π‘Ÿπ‘’π‘ and 𝑅 are independent, that is obviously not the case for a particular model. In fact, 𝑃 π‘Ÿπ‘’π‘, 𝑃 (𝑉 |𝐴), and 𝑅, 𝑃 (𝐴|𝑉 ), are related by Bayes rule: 𝑃 π‘Ÿπ‘’π‘ = 𝑃𝑃 (𝑉 ) (𝐴) Β· 𝑅 . The feasible 𝑃 π‘Ÿπ‘’π‘-𝑅 pairs obtained by varying model thresholds are given by a PR curve. A hypothetical PR curve is shown as a dotted black line in Figure 2. The cost corresponding to each point on the PR curve is given by the isocost intersecting that point. The minimum cost point on the PR curve is the one that intersects the lowest cost contour. If the PR curve is convex, the minimum cost contour will touch the PR curve only at one point where their tangents have equal slope.3 However, in practice empirically constructed PR curves are not always convex and thus the minimum cost point 3 Under assumption of convexity, this can be proved by contradiction. Assume the minimum isocost touches a PR curve at at least two points; however, since both functions are convex, there must be another lower cost isocost touching the PR curve at at least one point. Thus, the lowest cost isocost must touch the PR curve at exactly one point. (a) π‘Ÿπ‘ = 1 (b) π‘Ÿπ‘ = 10 (c) π‘Ÿπ‘ = 0.1 Figure 2: Isocost contours for πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for three different cost ratios, π‘Ÿπ‘ . The πΆπ‘ π‘π‘œπ‘Ÿπ‘’ corresponding to each contour is listed next to it. The black dotted-line is the PR curve for a particular model. may not be unique. In Figure 2 point A, B and C approximately show the minimum cost point for the three cost ratios. What do isocost contours mean in terms of the confusion matrix? πΆπ‘ π‘π‘œπ‘Ÿπ‘’ remains constant along a contour, and is proportional to 𝐹 𝑃 + π‘Ÿπ‘ 𝐹 𝑁 , which must remain constant as recall and precision change. In Table 4, we have parameterized the confusion matrix entries with π‘˜ such that as π‘˜ changes for a particular π‘Ÿπ‘ , precision and recall vary, however, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ remains constant. This can be seen by computing 𝐹 𝑃 + π‘Ÿπ‘ 𝐹 𝑁 for the Table entries, which is (𝐹 𝑃 β€² + π‘Ÿπ‘ π‘˜) + π‘Ÿπ‘ (𝐹 𝑁 β€² βˆ’ π‘˜) and independent of π‘˜ and thus constant. Table 4 Confusion Matrix – parameterized by π‘˜ Ground Truth 𝑉 ¬𝑉 𝐴 𝑇 𝑃 β€² + π‘˜ 𝐹 𝑃 β€² + π‘Ÿπ‘ Β· π‘˜ ^ + π‘Ÿπ‘ Β· π‘˜ + π‘˜ 𝑝 Prediction ¬𝐴 𝐹 𝑁 β€² βˆ’ π‘˜ 𝑇 𝑁 β€² βˆ’ π‘Ÿπ‘ Β· π‘˜ ^ βˆ’ π‘Ÿπ‘ Β· π‘˜ βˆ’ π‘˜ 𝑛 p n 3.5. How does 𝐹1 score Compare with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ when π‘Ÿπ‘ = 1? While 𝐹1 -score varies from 0 to 1, with 1 indicating perfect performance, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is proportional to the actual cost of handing model errors, with a zero-cost indicating perfect performance (that is, no FPs or FNs). 𝐹1 score treats FNs and FPs uniformly, as does πΆπ‘ π‘π‘œπ‘Ÿπ‘’ when π‘Ÿπ‘ = 1. So a natural question is if πΆπ‘ π‘π‘œπ‘Ÿπ‘’ differs from 𝐹1 score when π‘Ÿπ‘ = 1? To compare, we transform 𝐹1 -score to a cost metric: 1 𝐹 1π‘π‘œπ‘ π‘‘ = βˆ’1 (15) 𝐹1 When 𝐹1 is 1, 𝐹 1π‘π‘œπ‘ π‘‘ = 0, and when 𝐹1 is 0, 𝐹 1π‘π‘œπ‘ π‘‘ β†’ ∞, and thus exhibits behavior of a cost function and can be directly compared with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . To compare πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹 1π‘π‘œπ‘ π‘‘ , we reduce both in terms of the elements of the confusion matrix and find that: πΆπ‘ π‘π‘œπ‘Ÿπ‘’ ∝ 𝐹 𝑃 + 𝐹 𝑁 (16) 𝐹𝑃 + 𝐹𝑁 𝐹 1π‘π‘œπ‘ π‘‘ ∝ (17) 𝑇𝑃 Thus, when π‘Ÿπ‘ = 1, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹 1π‘π‘œπ‘ π‘‘ are not identical; while πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is proportional to the total number of errors, 𝐹 1π‘π‘œπ‘ π‘‘ is also inversely proportional to the number of true positives. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ only considers the cost of errors; it assigns zero cost to both TPs and TNs. In that sense, it treats TP and TN symmetrically unlike 𝐹1 score. 3.6. Multiclass and multilabel classifiers While we derived the cost metric assuming a binary classification problem, its extension to multiclass and multilabel classification problems is straightforward. A cost ratio per class would need to be defined. For a multiclass classifier, a user would have to assign cost ratios considering each class as positive and the rest as negative. Similarly, for a multilabel classifier, a user would have to assign independent ratios for each class. This will allow a πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to be computed per class. To compute a single cost metric, the per-class πΆπ‘ π‘π‘œπ‘Ÿπ‘’ would need to be aggregated. The simplest aggregation function is an arithmetic mean, although a class weighted mean based on class importance, or another type of aggregation, e.g., a harmonic mean, can also be performed. The β€œone class versus the rest” approach is similar to how 𝐹1 score and other metrics are computed in a multiclass setting. 3.7. Minimizing πΆπ‘ π‘π‘œπ‘Ÿπ‘’ based on model threshold and other hyperparameters In Section 3.4, we described use of isocost contours to visually determine the lowest cost point on a PR curve. In practice, to find the minimum cost based on model threshold, and the corresponding precision-recall values, precision and recall can be considered functions of the threshold value (𝑑) with the optimal value of 𝑑 determined by minimizing the cost function with respect to 𝑑. 1 𝑑 = argmin𝑑 [( βˆ’ 1) Β· 𝑅(𝑑) + π‘Ÿπ‘ Β· (1 βˆ’ 𝑅(𝑑))] 𝑃 π‘Ÿπ‘’π‘(𝑑) In addition to model threshold, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can also be used for selecting other model hyperparameters such as the number of neighbors in π‘˜-NN; number of trees, maximum tree depth, etc. in tree based models; number and type of layers, activation functions, etc. in neural networks; and for model comparison and selection. Hyperparameter tuning [17] is typically performed using methods such as grid search, random search, gradient-based optimization, etc. Typically, cross-validation is used in conjunction to evaluate the quality of a particular choice of a dataset. In all these methods, the proposed πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can replace another cost-oblivious metric such as 𝐹1 score. 4. Experimental Evaluation 4.1. Datasets The datasets used for our experiments were chosen based on their relevance to security and the varying cost of misclassification between target classes. To comprehensively analyze the impact of costs, we selected five different datasets, four publicly available datasets and one privately collected dataset. The publicly available datasets include the UNSW-NB15 intrusion detection data, KDD Cup 99 network intrusion data, credit card transaction data, and phishing URL data. 1. UNSW-NB15 Intrusion Detection Data: This network dataset, developed by the Intelligent Security Group at UNSW Canberra, comprises events categorized into nine distinct types of attacks including normal traffic. To suit the experimental requirements of our study, the dataset was transformed into a binary classification setting, where a subset of attack classes (Backdoor, Exploits, Reconnaissance) are consolidated into class 1, while normal traffic is represented as class 0. There are a total of 93,000 events in class 0 and 60,841 events in class 1. For our research, we utilized the CSV version of the dataset, which comes pre-partitioned into training and testing sets [18][19]. 2. KDD Cup 99 Network Intrusion Data: This dataset originated from packet traces captured during the 1998 DARPA Intrusion Detection System Evaluation. It encompasses 145,585 unique records categorized into 23 distinct classes, which include various types of attacks alongside normal network traffic. Each record is characterized by 41 features that are derived from the packet traces. For this research, the dataset has been adapted to focus on a binary classification task: class 0 represents normal instances, while class 1 aggregates all other attack types. Also, to explore the impact of different thresholds on the model’s performance, training was conducted using only 1% of the dataset. The dataset is accessed through the datasets available in the Python sklearn package [20]. 3. Credit Card Transactions Data: This dataset contains credit card transaction logs with 29 features that are tagged into legitimate and fraudulent transactions. There are a total of 284,315 transactions out of which 492 are fraudulent (class 1) and 56,866 are legitimate (class 0) [21]. Of the five datasets, this one has the highest skew. 4. Phishing data: This dataset is a collection of 60,252 webpages along with their URL and HTML sources. Out of these, 27,280 are phishing sites (class 1) whereas 32,972 are benign (class 0) [22]. We only use the URLs for building the model. 5. Internal data: This is a private dataset, used within an organization that represents the results of an extensive audit conducted on vulnerabilities in source code. Each vulnerability is classified into two classes, class 0 or class 1 (actual class names are masked for anonymity) by human auditors during the auditing process. The model is trained on this manually audited data and predicts if a given vulnerability belongs to class 0 or class 1. There are a total of 144,978 instances out of which 18,738 belong to class 1. Each vulnerability has 58 features which encompass a wide array of metrics that were generated during the analysis of the codebase. The information about each dataset is summarized in Table 5. It is important to note that not all datasets used are balanced. For instance, the credit card fraud data has less than 1% of instances in class 1. Similarly, the internal data has only 15% instances in class 1. 4.2. Experiment Setup We train a classification model using a RandomForest algorithm for each dataset. The goal is not to train the best possible model for the dataset but to obtain a reasonably good model with a probabilistic output. The steps are as follows: 1. Model Training: A RandomForest classifier is trained on each dataset. Although the training sets have different skews, we effectively used a balanced dataset for training so the classifier gets an equal opportunity to learn both classes. 2. Threshold adjustment using 𝐹1 score: The validation dataset is used to identify the best threshold based on the 𝐹1 score. The validation dataset was selected by sampling a proportion of the data, ensuring that the class distribution mirrored that of the training data. This approach was Table 5 Summary of datasets Number of instances Dataset Number of features Class 0 Class 1 UNSW-NB15 93,000 60,841 42 Credit card fraud 284,315 492 29 KDD cup 99 87,832 57,753 41 Phishing data 32,972 27,280 188 Internal data 126,240 18,738 58 taken because the actual skew of classes in production deployment is unknown. However, it is important to note that for actual production systems, the validation set should be representative of the true data distribution. Specifically, for the UNSW-NB15 dataset, the validation set was sampled from events in the test data CSV file. 3. Threshold adjustment using πΆπ‘ π‘π‘œπ‘Ÿπ‘’ : The predictions from the trained model are analyzed across different cost ratios. Using the validation sets, we apply the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to determine the optimal threshold for each cost ratio. 4. Comparison: The model’s cost with thresholds chosen based on the 𝐹1 score is compared against the costs with thresholds chosen using the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . This setup allows us to evaluate the effectiveness of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in optimizing model performance under varying cost conditions. 4.3. Results The proposed cost metric used is tailored for enhancing the performance of machine learning models in scenarios where the cost of false negatives greatly differs from the cost of false positives. This in turn helps in optimizing predictions based on cost considerations, thereby addressing a critical limitation in existing evaluation methods. 4.3.1. 𝐹1 Score for thresholding To illustrate the advantages of our approach, we will first use 𝐹1 score to adjust a model’s threshold. Figure 3 depicts the changes in 𝐹1 score for different threshold values across each dataset. The histograms in each plot represent the distribution of data within the corresponding probability intervals. Each color in the histogram represents the distribution of the corresponding ground truth class represented by the color. Due to the significant skew of the credit card dataset, the density of class 1 is not visible in the histogram. The 𝐹1 score for each dataset starts from a threshold of 0, where all instances are classified as the positive class and recall is 1. It ends at a score of 0 at a threshold of 1, where all instances are tagged as negatives and recall is 0. The rate of change of the 𝐹1 score in a threshold interval is proportional to the proportion of data within the interval and their ground truth values. This explains why, for some datasets, the 𝐹1 score is flat or nearly flat in the middle range of thresholds. It is clear from Figure 3 that most models do a good job of separating the two classes. The model trained on KDD cup 99 data is able to separate both the classes more distinctively and has most of the data points near probability zero and one. This makes the threshold vs 𝐹1 score curve mostly flat in the middle range of the probabilities. The best 𝐹1 score is achieved at a threshold of 0.33. Similarly, the phishing, credit card fraud, and intrusion detection datasets perform well on the validation datasets (a) Threshold vs. F1-score (b) Threshold vs. F1-score (c) Threshold vs. F1-score (UNSW-NB15) (Credit card fraud) (Phishing data) (d) Threshold vs. F1-score (e) Threshold vs. F1-score (KDD cup 99) (Internal data) Figure 3: Threshold for best 𝐹1 score for each dataset. with well-separated bimodal distributions. The threshold with the highest 𝐹1 score is marked with a vertical line in each plot. For the internal dataset, the model struggles to separate both classes as can be seen by the overlap in the probabilities of both classes. The model achieves its best 𝐹1 score at a threshold of 0.688. As the threshold moves from 0 to 1 there is a tradeoff between FPs and FNs; the maximum 𝐹1 score for each model corresponds to the point where the sum of FPs and FNs are minimal while the number of TPs are the highest. Being symmetrical in FN and FP, 𝐹1 score reduces their sum, disregarding any class specific costs. 4.3.2. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for thresholding at different cost ratios The proposed πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric allows the tuning of model parameters based on a cost ratio (ratio of the cost of false negatives to the cost of false positives). This cost ratio is variable and dependent on the specific impacts these errors have on end users. For example, in scenarios where missing a true attack could lead to significant financial losses, the cost of a false negative is higher. Conversely, in resource-constrained environments, a high rate of false positives can considerably burden the evaluation process. To illustrate the tuning differences, we applied three distinct cost ratios to each dataset: 0.1 (where a false positive is ten times more costly than a false negative), 1 (equal cost for both false positives and false negatives), and 10 (where a false negative is ten times more costly than a false positive). These cost ratios are used solely to demonstrate the model’s behavior when tuned with πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and may not correspond to practical applications of the data. Figure 4 displays the optimal thresholds derived from πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for each dataset across these cost ratios. The histograms in each plot show the distribution of the ground truth classes within the probability intervals. The πΆπ‘ π‘π‘œπ‘Ÿπ‘’ reflects the classification cost, resulting in a curve shape inverse of that of the 𝐹1 score, with the optimal threshold at the minimum πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . Similar to the 𝐹1 score plot, the flat portions of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve correspond to probability intervals with fewer data points. At a threshold of 0, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is constant regardless of the cost ratio, as the recall is 1 and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is 𝑃 π‘Ÿπ‘’π‘ 1 βˆ’ 1. Table 6 summarizes the experimental results, comparing the performance improvements of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at different thresholds with those of the 𝐹1 score. Cost score (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ ) is computed for thresholds based on maximizing 𝐹1 score and minimizing πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for each of the cost-ratios. There is only one threshold for each dataset based on best 𝐹1 score but the threshold based on πΆπ‘ π‘π‘œπ‘Ÿπ‘’ varies for each cost ratio. Although the actual cost is a multiple of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ , the percentage improvement over the 𝐹1 score (a) UNSW-NB15 (b) Phishing data (c) Credit card fraud (d) KDD cup 99 (e) Internal data Figure 4: Variation of thresholds with different cost ratios for each dataset. reflects the reduction in actual cost. For the UNSW-NB15 dataset, the optimal threshold is 0.89 for a cost ratio of 0.1, minimizing false positives at the expense of some true positives becoming false negatives (Figure 4a). At a cost ratio of 1, the threshold decreases to 0.65, balancing false positives and false negatives, aligning closely with the best 𝐹1 score threshold. At a cost ratio of 10, the threshold further decreases to 0.42, significantly reducing false negatives despite an increase in false positives. In the phishing dataset, the probability distribution of both classes is similar (Figure 4b). At a cost ratio of 1, the threshold is 0.54, identical to the best 𝐹1 score threshold. For a cost ratio of 0.1, the threshold increases to 0.73 to reduce false positives. Conversely, at a cost ratio of 10, the threshold decreases to 0.17 to significantly reduce false negatives. The spike in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for a cost ratio of 10 is proportional to the number of true class 1 instances within the probability interval. For the credit card fraud and KDD Cup 99 datasets, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve remains mostly flat. In the case of credit card fraud data, we applied a logarithmic transformation (Figure 4c) to highlight the differences in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ due to the significant class imbalance. For the KDD Cup 99 dataset, the trained model achieves good class separation (Figure 4d), resulting in a relatively flat πΆπ‘ π‘π‘œπ‘Ÿπ‘’ curve in the middle region, with a spike towards a probability interval of 1 as the cost ratio increases. In the internal dataset, as we saw earlier, there is substantial overlap between the probability intervals of both classes, increasing the significance of false positives and false negatives (Figure 4e). At a cost ratio of 0.1, the threshold is set at 0.95, nearly eliminating false positives. At a cost ratio of 1, the threshold is 0.92, which is only slightly different from the threshold for a cost ratio of 0.1 and results in almost similar rates of false positives. This slight decrease can be attributed to significant class imbalances, where further threshold reduction could significantly increase false positives due to the higher count of instances in class 0. As the cost ratio increases to 10, the threshold decreases to 0.42, considerably reducing false negatives (as indicated by the reduced proportion of class 1 instances to the left of the threshold). The spike in πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at this cost ratio corresponds to the interval with a significant count of class 1 instances. Table 6 compares the cost improvements achieved by πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at different cost ratios to the costs at optimal thresholds based on the 𝐹1 score. Performance improvements with respect to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ range from 10% to 85% in most scenarios for cost ratios of 0.1 and 10, with an average cost improvement of 49%. At a cost ratio of 1, the improvement is minimal, except in datasets with significant class imbalances, indicating the similarity between 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at this ratio. For the internal dataset with a cost ratio of 10, the cost improvement at the optimal πΆπ‘ π‘π‘œπ‘Ÿπ‘’ threshold compared to the 𝐹1 score is 86%. Additionally, there is over 50% improvement in cost at a cost ratio of 0.1 for the UNSW-NB15, credit card fraud, and KDD Cup 99 datasets, underscoring the substantial benefits of tuning models using the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric. These findings demonstrate how πΆπ‘ π‘π‘œπ‘Ÿπ‘’ effectively adjusts the threshold to balance false positives and false negatives based on the specified cost ratio. (a) PR Curve (UNSW-NB15) (b) PR Curve (Credit card fraud) (c) PR Curve (Phishing data) (d) PR Curve (KDD cup 99) (e) PR Curve (Internal data) Figure 5: Precision-Recall curve for different cost ratios Table 6 Misclassification costs based on using 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ for thresholding for three different cost ratios for the five datasets 𝐹1 score based threshold πΆπ‘ π‘π‘œπ‘Ÿπ‘’ based threshold Dataset Cost ratio Optimal threshold Precision Recall πΆπ‘ π‘π‘œπ‘Ÿπ‘’ Optimal threshold Precision Recall πΆπ‘ π‘π‘œπ‘Ÿπ‘’ Percentage Improvement in Cost 0.1 0.056 0.890 0.992 0.868 0.020 64.1% UNSW-NB15 1 0.65 0.949 0.961 0.091 0.650 0.949 0.961 0.091 0.0% 10 0.441 0.420 0.885 0.993 0.203 53.2% 0.1 0.199 0.900 0.976 0.417 0.069 65.3% Credit card fraud 1 0.27 0.815 0.781 0.396 0.640 0.931 0.698 0.354 10.6% 10 2.365 0.130 0.757 0.812 2.135 9.7% 0.1 0.006 0.540 0.999 0.986 0.002 66.7% KDD cup 99 1 0.33 0.995 0.994 0.011 0.330 0.995 0.994 0.011 0.0% 10 0.065 0.170 0.982 0.998 0.034 47.7% 0.1 0.027 0.730 0.997 0.873 0.015 44.4% Phishing data 1 0.54 0.980 0.915 0.104 0.540 0.980 0.915 0.104 0.0% 10 0.876 0.170 0.764 0.970 0.595 32.1% 0.1 0.597 0.948 0.971 0.230 0.084 85.9% Internal data 1 0.69 0.532 0.637 0.923 0.923 0.942 0.252 0.764 17.2% 10 4.186 0.424 0.292 0.886 3.289 21.4% 4.3.3. Precision-Recall trade-off using πΆπ‘ π‘π‘œπ‘Ÿπ‘’ Cost score’s ability to balance false negatives and false positives based on varying cost ratios is further demonstrated by the changes in precision and recall (Figure 5 and Table 6). The results indicate that as the cost ratio shifts from 1 to 0.1, precision increases while recall decreases. For example, in the case of UNSW-NB15 data, precision reaches 0.992 at a cost ratio of 0.1 which highlights a reduction in false positives. Conversely, when the cost ratio increases to 10, there is a significant improvement in recall. For example, in the case of UNSW-NB15 data, the recall is improved 10% compared to that at best 𝐹1 score at a cost ratio of 10, indicating a reduction in false negatives. When comparing the results of the new cost score at a cost ratio of 1 with the 𝐹1 score, the outcomes are similar in most cases. This similarity is evident from the Precision-Recall curves (Figure 5), where the precision and recall for the optimal 𝐹1 score overlap with those of the new cost score at a cost ratio of 1. This underscores the fact that the 𝐹1 score consistently assigns equal weights to both false negatives and false positives, irrespective of their real-world cost impacts. The notable difference in precision and recall between the 𝐹1 score and the new cost score is observed in the credit card fraud data (Figure 5b) and internal data (Figure 5e), which can be attributed to the significant class imbalance in this dataset and also to the fact that 𝐹1 score is also proportional to true positives and thereby strives for a better recall compared to πΆπ‘ π‘π‘œπ‘Ÿπ‘’ at cost ratio 1. These results demonstrate that the proposed πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric offers substantial performance enhance- ments over the 𝐹1 score across a range of cost ratios. Unlike the 𝐹1 score, which typically assigns equal penalties to false positives and false negatives, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ metric accommodates variations in the costs associated with these errors. At a cost ratio of 1, the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ achieves performance comparable to the 𝐹1 score, illustrating its versatility. The flexibility of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ to tune models based on cost ratios, particularly demonstrated by the results on internal data, has proven it to be a valuable metric for fine-tuning our models to meet the varying cost demands of end users. These findings suggest that the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ can effectively replace the 𝐹1 score for tasks such as model thresholding and selection, particularly in scenarios where the costs of false positives and false negatives differ. Figure 6 shows the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocost contours overlaid with the PR curve for the UNSW-NB15 dataset. The minimum cost point on the PR curve corresponds to the point where it intersects the contour with the least cost. 5. Conclusions How organizations handle errors from machine learning models is highly dependent on context and application. In the cybersecurity domain, the cost of a security analyst’s time and effort spent in (a) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 0.1 (b) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 1 (c) πΆπ‘œπ‘ π‘‘π‘…π‘Žπ‘‘π‘–π‘œ = 10 Figure 6: πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocost contours with precision-recall curve for the UNSW-NB15 dataset for three different cost ratios. reviewing and investigating a false positive varies considerably from the cost of a model’s failure to detect a real security incident (a false negative). However, widely used metrics like 𝐹1 score assign them equal costs. In this paper, we derived a new cost-aware metric, πΆπ‘ π‘π‘œπ‘Ÿπ‘’ defined in terms of precision, recall, and a cost ratio, which can be used for model evaluation and serve as a replacement for 𝐹1 score. In particular, it can be used for thresholding probabilistic classifiers to achieve minimum cost. To demonstrate the effectiveness of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ in cybersecurity applications, we applied it to threshold models built on five different datasets assuming multiple cost ratios. The results showed substantial savings in cost through the use of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ over 𝐹1 score. At cost ratio 1, the results are similar, however, as the cost ratio is increased or decreased, the gap in costs between using πΆπ‘ π‘π‘œπ‘Ÿπ‘’ and 𝐹1 score increases. All datasets show consistent improvements in cost. Through this work, we hope to raise awareness among machine learning practitioners building cybersecurity applications regarding the use of cost-aware metrics such as πΆπ‘ π‘π‘œπ‘Ÿπ‘’ instead of cost-oblivious ones like 𝐹1 score. Acknowledgments We thank the anonymous reviewers and the CAMLIS 2024 attendees for their feedback. References [1] C. J. Van Rijsbergen, Information retrieval. 2nd. newton, ma, 1979. [2] S. Puthiya Parambath, N. Usunier, Y. Grandvalet, Optimizing f-measures by cost-sensitive classifi- cation, Advances in neural information processing systems 27 (2014). [3] P. D. Turney, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of artificial intelligence research 2 (1994) 369–409. [4] M. Kukar, I. Kononenko, et al., Cost-sensitive learning with neural networks., in: ECAI, volume 15, Citeseer, 1998, pp. 88–94. [5] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 155–164. [6] C. Elkan, The foundations of cost-sensitive learning, in: International joint conference on artificial intelligence, volume 17, Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978. [7] V. S. Sheng, C. X. Ling, Thresholding for making classifiers cost-sensitive, in: Aaai, volume 6, 2006, pp. 476–481. [8] B. Krishnapuram, S. Yu, R. B. Rao, Cost-sensitive machine learning, CRC Press, 2011. [9] D. Hand, P. Christen, A note on using the f-measure for evaluating record linkage algorithms, Statistics and Computing 28 (2018) 539–547. [10] D. M. Powers, Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation, arXiv preprint arXiv:2010.16061 (2020). [11] M. Sitarz, Extending f1 metric, probabilistic approach, arXiv preprint arXiv:2210.11997 (2022). [12] B. W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442–451. [13] D. Chicco, G. Jurman, The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation, BMC genomics 21 (2020) 1–13. [14] W. Lee, W. Fan, M. Miller, S. J. Stolfo, E. Zadok, Toward cost-sensitive modeling for intrusion detection and response, Journal of computer security 10 (2002) 5–22. [15] M. Liu, L. Miao, D. Zhang, Two-stage cost-sensitive learning for software defect prediction, IEEE Transactions on Reliability 63 (2014) 676–686. [16] I. Bruha, S. KočkovΓ‘, A support for decision-making: Cost-sensitive learning system, Artificial Intelligence in Medicine 6 (1994) 67–82. [17] Wikipedia, Hyperparameter tuning, Accessed: May 2024. URL: https://en.wikipedia.org/wiki/ Hyperparameter_optimization. [18] N. Moustafa, J. Slay, UNSW-NB15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set), in: 2015 military communications and information systems conference (MilCIS), IEEE, 2015, pp. 1–6. [19] N. Moustafa, J. Slay, The evaluation of network anomaly detection systems: Statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set, Information Security Journal: A Global Perspective 25 (2016) 18–31. [20] scikitlearn, Real world datasets, Accessed: May 2024. URL: https://scikitlearn.org/stable/datasets/ real_world.html. [21] G. Pang, C. Shen, A. Van Den Hengel, Deep anomaly detection with deviation networks, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 353–362. [22] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach for phishing websites using url and html features, Scientific Reports 12 (2022) 8842. A. Appendix A.1. Slopes of the 𝐹1 score and πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocurves 𝐹1 score is defined as: 2 Β· 𝑃 π‘Ÿπ‘’π‘ Β· 𝑅 𝐹1 = 𝑃 π‘Ÿπ‘’π‘ Β· 𝑅 Rearranging: 𝐹1 Β· 𝑅 𝑃 π‘Ÿπ‘’π‘ = 2𝑅 βˆ’ 𝐹1 Slope of 𝐹1 isocurves can be calculated as: πœ•π‘ƒ π‘Ÿπ‘’π‘ 𝐹1 2𝑅𝐹1 = βˆ’ πœ•π‘… 2𝑅 βˆ’ 𝐹1 (2𝑅 βˆ’ 𝐹1 )2 βˆ’πΉ12 = (2𝑅 βˆ’ 𝐹1 )2 Thus, slope of 𝐹1 isocurves is always negative. πΆπ‘ π‘π‘œπ‘Ÿπ‘’ is defined as: 1 πΆπ‘ π‘π‘œπ‘Ÿπ‘’ = ( βˆ’ π‘Ÿπ‘ βˆ’ 1) Β· 𝑅 + π‘Ÿπ‘ 𝑃 π‘Ÿπ‘’π‘ It can be rearranged as: 𝑅 𝑃 π‘Ÿπ‘’π‘ = πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ Slope of the πΆπ‘ π‘π‘œπ‘Ÿπ‘’ isocurves can be computed as: πœ•π‘ƒ π‘Ÿπ‘’π‘ 1 (π‘Ÿπ‘ + 1)𝑅 = βˆ’ πœ•π‘… πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2 πΆπ‘ π‘π‘œπ‘Ÿπ‘’ βˆ’ π‘Ÿπ‘ = (πΆπ‘ π‘π‘œπ‘Ÿπ‘’ + 𝑅(π‘Ÿπ‘ + 1) βˆ’ π‘Ÿπ‘ )2 As described in Section 3.4, these curves can have negative, positive or zero slopes depending on the value of πΆπ‘ π‘π‘œπ‘Ÿπ‘’ . A.2. Improvements in cost for different cost ratios Figure 7 depicts the improvement in cost for different values of cost ratios. The x-axis of the plot is π‘™π‘œπ‘”10 (π‘π‘œπ‘ π‘‘_π‘Ÿπ‘Žπ‘‘π‘–π‘œ) and y-axis is percentage improvement in cost when compared to the threshold selected using 𝐹1 score. It can be observed that the general trend across all datasets is that the improvement is minimum near a cost ratio of one where πΆπ‘ π‘π‘œπ‘Ÿπ‘’ behaves similar to 𝐹1 score. The cost improvement increases as we move away from one in either direction. The percentage increase for higher cost ratio depends on the proportion of the positive class (class 1) in the data. Since 𝑃 (𝑉 ) is very small for credit card fraud dataset, there is only a slight increase in cost savings for cost ratios greater than one (Figure 7b). (a) UNSW-NB15 (b) Credit card fraud (c) Phishing data (d) KDD cup 99 (e) Internal data Figure 7: Percentage improvement in cost for threshold obtained by minimizing the new cost score compared to the threshold obtained from maximizing 𝐹1 score.