Firenze: Model Evaluation Using Weak Signals Bhavna Soman1 , Ali Torkamani1 , Michael J. Morais1 , Jeffrey Bickford1 and Baris Coskun1 1 Amazon Web Services Abstract Data labels in the security field are frequently noisy, limited, or biased towards a subset of the population. As a result, commonplace evaluation methods such as accuracy, precision and recall metrics, or analysis of performance curves computed from labeled datasets do not provide sufficient confidence in the real-world performance of a machine learning (ML) model. This has slowed the adoption of machine learning in the field. In the industry today, we rely on domain expertise and lengthy manual evaluation to build this confidence before shipping a new model for security applications. In this paper, we introduce Firenze, a novel framework for comparative evaluation of ML models’ performance using domain expertise, encoded into scalable functions called markers. We show that markers computed and combined over select subsets of samples called regions of interest can provide a strong estimate of their real-world performances. Critically, we use statistical hypothesis testing to ensure that observed differences—and therefore conclusions emerging from our framework—are larger than those observable from noise alone. Using simulations and two real-world datasets for malware and domain-name-service reputation, we illustrate the effectiveness, limitations, and insights achievable with our approach. Taken together, we propose Firenze as a resource for fast, interpretable, and collaborative model development and evaluation by mixed teams of researchers, domain experts, and business owners. 1. Introduction In research areas like information security in which data abounds but domain-expert labels or annotations for those data are uniquely expensive [16], reliable evaluation of a machine learning model’s performance is challenging [32, 36, 17, 22]. When developing models for use in real-world production environments from such restrictive datasets, how can we determine whether a newly-developed model will actually perform better than existing methods when deployed? Problematically, “better” is a problem-specific consideration, e.g. a classifier is better if it provides a higher true positive rate than existing methods without exceeding a maximum-tolerable false positive rate. This remains an active research problem and barrier to the productionization of machine learning models to solve real-world problems [32]. Machine learning for information security, e.g. for malware or network intrusion detection, is of growing interest in both academia and industry [6, 3, 19, 25, 29, 14, 33]. But, to illustrate the challenge, consider a case study of training a malicious domain name classifier from a list of malicious and benign domain names obtained from third-party threat intelligence providers CAMLIS’22: Conference on Applied Machine Learning in Information Security (CAMLIS), October 20–21, 2022, Arlington, VA " bhsoman@amazon.com (B. Soman); alitor@amazon.com (A. Torkamani); moraismi@amazon.com (M. J. Morais); jbick@amazon.com (J. Bickford); barisco@amazon.com (B. Coskun) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) [15]. Such labeled datasets are often restrictive, non-representative subsamples of much larger populations, and cannot faithfully capture the underlying data distribution. If we computed canonical evaluative metrics like precision and recall on these data, we could arbitrarily mis- estimate our model’s performance on the billions of unlabeled domain names it would observe each day in production. Indeed, empirically, it is not uncommon for a model developed in this way to yield unreasonably high numbers of false positives in production, thereby rendering itself useless despite achieving near-perfect precision and recall scores on labeled data. Worse still, any effortfully-collected labeled data carries a high risk of becoming obsolete, since adversaries frequently change tactics and move targets. Taken together, these pecularities can induce severe concept drift, label drift, and covariate shift [31, 24, 5]. As a result, reliable evaluation of ML models in information security has required extensive manual investigations by qualified experts with highly specialized training. In this paper, we present Firenze, a novel model evaluation framework to automate this investigative process by scalably operationalizing their domain expertise using weak signals, and using these to compare models’ performances without ground-truth labels. Our goal is to accelerate the iterative development process of machine learning models, and empower a collaborative workflow between research scientists, domain experts, and business owners solving emergent problems in information security. Firenze encodes domain expertise into a set of rules, called markers. These markers collectively and scalably represent benchmarks, heuristics, and/or knowledge that would be used by domain experts during manual investigations of the outcomes of an ML-based model. They are weak signals, insofar as they may not be correct for every individual datum, but provide reliable insights over populations of data and/or consensuses of multiple markers. We apply these markers to a [unlabeled] test dataset to make principled judgments of the performance of that model with respect to some existing methods. Specifically, we measure how much higher and lower it is able to rank datapoints associated with malicious and benign markers resp., compared to those other methods; using statistical hypothesis tests on specific regions of the test dataset, that indicate when a proposed model is better than, worse than, or not different than existing methods. By construction, the results from each individual marker readily provide a semantic understanding of why, how, and on which data model improvements or deteriorations are occurring. As a result, Firenze provides a nuanced picture of the comparative model performance along with the overall judgment of which model is more performant. In Section 2, we review related work from semi-supervised, unsupervised, weakly supervised learning, and information security. In Section 3, we describe Firenze in detail, and in Section 4, we investigate the efficacy of our approach using simulated data. In Section 5 and 6, we present two case-study applications of our approach on malware detection with an open-source dataset and domain-name reputation with a real-world dataset. In Section 7, we discuss future directions and intersectional opportunities for our work. 2. Related work Evaluation of machine learning models in the security literature broadly uses canonical metrics like accuracy, precision, and recall on labeled data [32], but such approaches can be inaccurate or limited for data with partial labels, noisy labels, or no labels at all [12, 27]. For example, a metaanalysis of Android malware classifiers by Pendlebury et al. [24] revealed that their accuracies were heavily biased over time and across groups of samples. In turn, there is a growing emphasis on using real-world and/or high-quality datasets [30, 2] and seeking explainable, semantic understanding of model outcomes [4]. Naively, one could improve the quality of evaluative metrics like these by obtaining reliable ground truth labels. For example, image and text corpora for simple labeling tasks like object recognition or text sentiment analysis can be labeled through large-scale crowdsourcing cam- paigns [21] using Mechanical Turk (or similar services) [8]. However, such efforts do not scale to specialized labeling tasks like those in information security, which require investigations by a select few domain experts with rigorous training and experience [16]. For some datasets, “labels” can be retrieved from aggregators like VirusTotal or various threat intelligence feeds, but implicit assumptions of their reliability have been questioned [36, 18]. Even when multiple such labeling sources are available, how to combine them into a single label, i.e. by a hard threshold/criterion [36] or model-based pooling [17], can be highly consequential in the resulting label and model outcomes trained upon them [36]. The nascent field of weak supervision has emerged in response to this problem of intractable, costly, and/or imprecise data labeling [28]. The Snorkel project [26] introduced so-called labeling functions to generate training datasets based on weak domain expert signals, and has been adopted for real-world problems including security to remove the human labeling problem [1, 34]. All of the methods discussed thus far augment the training process in some way; we propose Firenze as a black-box method that can directly evaluate an already-trained model without retraining. Directly targeting model evaluation, AutoEval [10] and density estimation [23] can estimate the accuracy of a classifier on an unlabeled dataset by using feature statistics from the training set and synthetic datasets generated by applying transformations to the training set. Most recently, Joyce et al. [16] define Approximate Ground Truth Refinements (AGTRs) using cluster memberships, which are used to estimate bounded precision and recall in clustering and multi- class algorithms. Though this approach can be used to evaluate models, the authors acknowledge its limitations in comparing models of different mechanical natures since they will naturally correlate to different degrees with the biases of the AGTR construction itself. To the best of our knowledge, Firenze is the first system of its kind to utilize weak signals (markers), to perform comparative evaluation of the effectiveness of machine learning models. Our approach is generalizable to various types of models including supervised, semi-supervised, and unsupervised. We describe its particulars in the next section. 3. Firenze: Model evaluation using weak signals Firenze is a framework for pairwise, comparative model evaluation utilizing weak signals for both supervised (e.g. classification of malicious vs. benign domain names) and unsupervised score-based models (e.g. anomaly detection). Firenze also attempts to bridge the semantic gap by using domain expert weak signals to describe semantically how a model is performing outside of ground-truth labels addressing the recurring concerns of ML models in information Figure 1: An overview of the Firenze system. (1) A domain expert defines the marker functions. (2) Create ranked lists of the samples by each model. (3) Assign samples to regions of interest. (4) Calculate the average marker score per set. (5) For the two sets in each region of interest, determine the better model by comparing the average marker scores using a two-sample unequal-variance t-test. Inset: Examples of marker functions for applications of ML in security security [32]. At a high level, our goal is to compare an existing model (i.e. one in production, Reference Model) with a newly built model (Test Model). Firenze features the following components, as summarized in Figure 1: These constituent parts evaluate and compare two models, which we denote Model 𝑅 (or Reference Model) and Model 𝑇 (or Test Model). These models need only share a common goal/task, e.g. classifying malware or domain names; they may differ in feature representations of their input data, model architectures, etc. Critically, Firenze performs its evaluations strictly on the output scores of these models. Such a black-box treatment of these models permits fast, easy incorporation into [existing] research pipelines and well-posed comparisons of diverse models. 3.1. Marker design and combination A marker is a weak signal that is associated with the maliciousness or benignity of a sample, instance, or event. The weak signal can come from diverse sources, patterns, heuristics and external knowledge bases that operationalize a security expert’s intuition of whether a sample is malicious or not. These intuitions may not be correct for every individual case, but broadly hold true for the population. For example, malware analysts understand that not all packed files are malicious, but the fact that a file is packed is at least suspicious and increases the odds of it being malicious. Similarly, when analyzing a suspicious, recently-queried domain name, analysts may check how popular this domain has been over some time window. We define 𝑀 marker functions 𝑚1 , ..., 𝑚𝑀 where 𝑚𝑗 (𝑠) indicates the verdict of the 𝑗 𝑡ℎ marker (if any) observed for a sample 𝑠. Markers’ verdicts span 𝑚𝑗 (𝑠) ∈ {−1, 0, 1}, where −1 indicates that the marker voted the sample 𝑠 to be benign, 1 indicates malicious and 0 indicates that the marker abstains. Allowing markers to abstain is important as the opposite of a security expert’s intuition does not always indicate a vote for the opposite class. Revisiting our packed malware example above, not being packed is a poor weak signal of benignness—numerous malware aren’t packed. By design, a single marker may not give a conclusive verdict for a sample’s maliciousness or benignness; however, a combination of many such markers can provide a stronger overall verdict, and emulate how human experts build confidence and make inferences. To aggregate individual markers, we define the combined marker score as their majority vote, which itself can “abstain” with 0 for ties. While this is a naive method, past research has shown that in use cases with low signal density (like ours) there is limited room for even an optimal weighting of the signals to diverge much from the majority vote [26]. More sophisticated aggregation based on Dawid-Skene estimators [9] or generative models are planned for future work. We still expect the combined marker score of individual samples to be noisy; which is why we compare populations in our evaluation. Over the subsets/regions of samples considered below, we calculate the average marker score of the samples in a given set, denoted 𝑍(𝑅) and 𝑍(𝑇 ) for models 𝑅 and 𝑇 resp.. Intuitively, if a set contains more samples that are likely malicious, its average marker score will be greater, and vice versa for fewer samples. To guard against biased outcomes due to correlation between markers and model features, we limit the method to only use markers that don’t overlap with model features. This is a strict mode of operation and exploration of the occurrence/impact of such overlap are planned for future work. 3.2. Region-based hypothesis testing ML models in the security domain generally seek a robust separation of malicious and benign samples, but may only use a limited range of their operation. For example, a domain name reputation model may score millions of unique domain names per day, but only a small [fixed- size] subset of those will be sufficiently [confidently] benign to allowlist. Consequentially, which samples a model places in such regions of interest becomes instrumental its real-world performance. Samples for which the assigned “region” changes from one model to the next grant evaluative information about the comparative performance of the two models. Therefore, we propose to perform comparative evaluation on three regions of interest of size 𝐾: one each to explore the “most malicious” samples, “most benign” samples, and most differently-scored samples; other such regions may exist for use-cases not considered here. For each model, these regions are defined by their output scores 𝑝 = Model(𝑠) assigned to input samples from a test dataset, which reflects some confidence or probability that each sample belongs to the malicious or benign class. These scores need not be comparable directly across models, i.e. SVM margin scores or class probabilities. Instead, we sort samples by their scores in each model, such that samples’ ranks are comparable across models. Across a large set of [unlabeled] test data, we can associate each sample with (i) its rank score and (ii) its marker score. These two scores define Firenze’s tests: • Top-𝐾 Test: We hypothesize that, within the top-𝐾-ranked samples by model score for some 𝐾, the test model is better than the reference model if it assigns more likely malicious samples and fewer likely benign samples to this region. As defined by the marker scores above, this is tantamount to testing whether 𝑍(𝑇 ) > 𝑍(𝑅), i.e. whether Model 𝑇 has a higher average marker score in this region. • Bottom-𝐾 Test: Conversely, we hypothesize that, within the bottom-𝐾-ranked samples by model score for some 𝐾, the test model is better than the reference model if it assigns more likely benign samples and fewer likely malicious samples to this region. Likewise, we test whether 𝑍(𝑇 ) < 𝑍(𝑅), i.e. whether Model 𝑇 has a lower average marker score in this region. • Movers Test: We hypothesize that the test model is better than the reference model if it assigns more malicious (as defined by marker score) samples to higher ranks (as defined by model score), and more benign samples to lower ranks. Specifically, for some 𝐾, we use model scores to select the 𝐾 samples with largest increase in rank from Model 𝑅 to Model 𝑇 —“up-movers”—and the 𝐾 samples with largest decrease—“down-movers”. Then, we test 𝑍(𝑈 ) > 𝑍(𝐷), i.e. whether the average marker score of up-movers is higher than that of down-movers. For each of the Top-K, Bottom-K, and Movers Tests, we compare the average marker scores of the samples placed in each region by Models R and T using a two-sample t-test with unequal variance at level 0.05, called Welch’s t-test [35]. This permits us to observe and interpret differences in 𝑍(𝑅) and 𝑍(𝑇 ) sensitive to variability in these estimates, only if we can exclude the uninformative statistical possibility that the observed differences arose by random chance between equally performant models (probability 𝑝 ≤ 0.05). In other words, if the difference in average marker scores is larger than that which could be observed by an overwhelming fraction of chance outcomes, then we conclude the two models are performing statistically differently over that region. In practice, we run these statistical tests as two-sided tests of whether 𝑍(𝑇 ) ̸= 𝑍(𝑅) and 𝑍(𝑈 ) ̸= 𝑍(𝐷), rather than a one-sided test of whether one is greater-than or less-than the other; in doing so, we can also identify when the test model is worse than the reference model, by the same hypotheses above. 4. Assessing Firenze using simulated data To demonstrate Firenze on data and models with known ground-truth labels, we developed an extensive simulated environment that parametrizes and partitions relevant sources of noise endemic to a model training-and-testing pipeline. Our goal is to define qualitative conditions under which Firenze can identify the better model with the proposed region-based hypothesis tests. Key features of this simulation are (i) generation of ground-truth labels as well as noisy generation of training labels and weak signals (markers) of arbitrary accuracy and coverage (with respect to ground-truth), and (ii) model score generation with arbitrary performances with respect to either of the labelsets. With these features, we explore the requirements of a single marker, knowing that these results provide a lower bound on any other use-case. We sketch the generative process here, and give full details in Appendix A: 𝑦true | 𝜋 ∼ ground-truth labels with class prob. 𝜋 𝑚 | 𝑦true , 𝛼, 𝛽 ∼ marker with coverage 𝛽 and accuracy 𝛼 w.r.t. 𝑦true 𝑦train | 𝑦true , 𝛼, 𝛽 ∼ noisy training labels with coverage 𝛽 and accuracy 𝛼 w.r.t. 𝑦true 𝑓 𝑅 | 𝑦train , 𝑦true ∼ model 𝑅 scores, s.t. some fixed decision rule 𝛿(𝑓 𝑅 ) has accuracy 𝑃train 𝑅 w.r.t. 𝑦train and 𝑃true 𝑅 w.r.t. 𝑦true 𝑓 𝑇 | 𝑦train , 𝑦true ∼ model 𝑇 scores, s.t. 𝛿(𝑓 𝑇 ) has accuracy 𝑃train 𝑇 w.r.t. 𝑦train and 𝑃true 𝑇 w.r.t. 𝑦true The parameters for our simulated environment—and their default values—are the positive class prevalence 𝜋 = 0.5, reference and test model performances on ground-truth and training labels 𝑃true 𝑅 = 0.90, 𝑃 𝑅 train = 0.98,𝑃true = 0.95, and 𝑃train = 0.97, the number of samples 𝑇 𝑇 𝑁 = 1000000, the region size 𝐾 = 10000, and the accuracies and coverages of the marker 𝛼 and 𝛽 resp. and the training labels 𝛼 ¯ = 0.95 and 𝛽 ¯ = 0.10 resp.. These choices focus our experiments on the most nefarious case of model evaluation: the training performances are fixed such that 𝑃train 𝑇 𝑅 , the opposite of the true difference on ground-truth labels where < 𝑃train 𝑇 > 𝑃 𝑅 . In specific contrast to real-world datasets, only in simulations like these can we 𝑃true true disambiguate between discrepancies in objective, unobserved ground-truth labels and subjective, observed training labels, and how these propagate to model performance and our evaluation of it. 4.1. Experiments Holding all other parameters to their default values, we explore the role of the ground-truth model performances 𝑃 (Fig. 2, left), training label accuracy 𝛼 ¯ (Fig. 2, right), positive class prevalence 𝜋 (Fig. 3, left), and region size 𝐾 (Fig. 3, right). In each experiment, for a given parameter configuration, we simulate this process for each of 𝑁 samples, generating true labels, then training labels and model scores for reference (𝑅) and test (𝑇 ) models, and finally markers. Using the model scores and markers, we apply the Firenze framework, and observe the outcomes of the three significance tests to identify the model with higher ground-truth performance. Given our goal—to study the minimal requirements of markers—we repeat this simulation on a fine tiling of marker accuracies 𝛼 ∈ (0, 1) and coverages 𝛽 ∈ (0, 1), and plot the result of each at the coordinate (𝛼, 𝛽) in each figure panel to follow. The resulting visualization shows the success, failure, and inconclusive regimes of the Firenze tests, as a function of the marker’s parameters. In each figure, each column of panels reflects a certain region/test (Top-K, Bottom-K, Movers), and row of panels reflects a certain parameter configuration (annotated accordingly). Ground-truth model performance. For fixed model performances on training data, we varied the model performances on ground-truth data 𝑃true 𝑇 and 𝑃 𝑅 (Fig. 2, left). Relative to the true default model, a larger difference (𝑃true = 0.95 vs. 𝑃true = 0.80) with low generalization error 𝑇 𝑅 (𝑃true 𝑇 = 0.95 vs. 𝑃 𝑅 = 0.98) increases sensitivity of all tests. A small difference (𝑃 𝑇 = 0.95 train true vs. 𝑃true 𝑅 = 0.94) decreases sensitivity of all tests, and to a lesser degree of the Movers test. The default difference with high generalization error (𝑃true𝑇 = 0.75 vs. 𝑃 𝑅 = 0.70) strongly true decreases sensitivity of all tests. 𝑇 𝑅 Figure 2: Varying difference in model performances 𝑃true and 𝑃true (left) and feed accuracies 𝛼 ¯ (right). Using the default simulation parameters as a guide (top, in boxes), in all panels we observe test Success (green) as marker accuracy increases 𝛼 > 0.5 and Failure (red) as marker accuracy decreases 𝛼 < 0.5 (x-axis). The interior region is Inconclusive (yellow), and that region widens—the test becomes less sensitive—as marker coverage decreases (y-axis). Left, all tests become more (less) sensitive as the true difference in performance becomes larger (smaller). Right, Test sensitivity does not depend on the accuracy of the training labels. Training label accuracy. We then varied the reliability of training labels, which in turn varies the generalization errors of our two models (Fig. 2, right). Because markers are independent of the noise level in the training labels, this does not impact test sensitivity for any test nor any accuracy level. We emphasize that the lack of dependence on training label accuracy underpins the power of these tests. Positive class prevalence and region size. Finally, we varied class prevalences 𝜋 and region sizes 𝐾 to explore dependence on the sample data balance and size (Fig. 3). As the positive (malicious) class becomes more rare in the dataset, the Top-K test remains sensitive, as the top-K samples can still contain adequate sample counts for both positive and negative classes; the Bottom-K and Movers Tests both lose sensitivity for the converse reason, as their samples will be overwhelmingly negative. As the region size 𝐾 decreases (reducing training and evaluation set sizes equally), all tests lose sensitivity, though least so for the Movers Test. 4.2. Qualitative conditions for successful tests Varying the parameters of this simulated environment modulates the sensitivity of the tests in the Firenze framework. Importantly, none of these regimes bias the tests, therefore as long as the markers have accuracy 𝛼 > 0.5 and are independently generated from the training data, Firenze can yield at best a successful identification of the better model, and at worst Figure 3: Varying class prevalence 𝜋 (left) and ROI set size 𝐾 (right; cf. Fig. 2 for how to interpret the panels). Both parameters have asymmetric effects on the regions. Left, as positive class prevalence decreases, the Bottom-K and Movers Tests lose sensitivity, while the Top-K gains sensitivity. Right, as region size decreases, all tests uniformly lose sensitivity. an inconclusive result. Qualitatively, we observe that, when evaluating highly-performant, incrementally-different models (all 𝑃 > 0.9), a single marker with accuracy 𝛼 > 0.7 and coverage 𝛽 > 0.5 can successfully identify the better model with reasonable probability. The other parameters we varied suggests a loose “operating regime” for evaluation with Firenze. Within user control, large(r) region-of-interest sizes 𝐾 yield more sensitive tests. Outside user control, low positive-class sample size 𝜋, significant generalization errors 𝑃true ≪ 𝑃train , and/or small differences in ground-truth performance 𝑃true ≈ 𝑃train yield less sensitive tests, especially for Top- and Bottom-K Tests. Taken together with the higher sensitivity of the Movers Test throughout, these observations suggest that regions-of-interest yield successful tests when they have a heterogeneity of labels, i.e. a propensity for non-zero differences in marker score to emerge. We are optimistic that future work can affirm these relationships and insights analytically and provide a broader theory of evaluative weak signals. 5. Evaluating malware detection models using Firenze To illustrate how Firenze can be used in practice, we share a first case study, a replicable proof- of-concept comparing two models for ML-based malware detection which use the EMBER open-source malware dataset [2]. The EMBER dataset is a curated set of malicious and benign Windows PE files for static analysis. The default feature representation of these data spans features from file headers, section information, file imports and exports, directory information, and byte entropy statistics. To construct an ecologically valid case study, we sort the EMBER data by the date/time at which each sample was first observed, to train our reference and test models on “past” data (pre-December 2017), perform preliminary tests on “present” data (December 2017), and evaluate with Firenze on “future” data (2018) [24]. For this purpose, we specifically use the unlabeled samples from the 2018 period. The reference model is a neural network classifier with the same architecture of Erdemir et al. in their experiments with the EMBER dataset [11]). The test model is a gradient-boosted decision tree with the same hyperparameters of Anderson et al. in the original EMBER paper [2]. On “present” data, performances of the reference and test models appear highly comparable (AUC𝑅 = 0.9981 versus AUC𝑇 = 0.9984), but on future data, model performances are known to degrade—and will our estimates of them—as covariate shift, concept drift, and/or label drift mount [31, 24, 5]. Using Firenze on the unlabeled dataset, we investigate to what extent the change in model architecture improves performance by (i) increasing true malicious file identifications (true positives) by the model, without increasing false positives and (ii) improving identification of benign files without increasing false negatives. 5.1. Markers for malware detection We designed five markers to evaluate these models; like above, we outline them here, and discuss details in Appendix B.1. Recall that a verdict of 1 is malicious, −1 is benign, and 0 is null/abstain: • Suspicious Section Properties: If the sample contains more than one executable or any writable-and-executable section, then 1, else 0 • Unusual Number of Imported Functions: If the same contains fewer than 25 imports—less than the usual packed sample—then 1, else 0 • Nonsensical Section Names: If the sample contains a nonsensical section name, as deter- mined by nostril [13], then 1, else 0 • Import of suspicious functions: If sample imports functions and libraries associated with common malicious functionality (see Appendix B.2 for details), then 1, else 0 • Signed: If the sample is signed by a trusted source, then −1, else 0 Consider the second marker, unusual number of imports, and how it reflects our definition of markers as weak signals. Though very few imports—common for packed/obfuscated samples—is a good signal of suspiciousness, numerous imports is not a signal of legitimacy by negation. Likewise, many malicious samples aren’t packed, and could contain any number of imports. To ensure our markers do not overlap with model features and lead to biased results, data fields pertaining to these properties are withheld from the training process. 5.2. Region-based testing with 𝐾 = 50,000 samples Region-based hypothesis testing with Firenze is mechanistically amenable to malware detection. Suppose the models’ predictions are triaged by a security operations team with a limited band- width of manual investigations. If a detection count of 𝐾 = 50k samples filled that bandwidth approximately, it would be sensible to evaluate model performance only over that region within which impactful security decisions are made. We apply Firenze to evaluate our two malware detectors on regions of 50k samples, e.g. for a blocklist (malicious verdicts, Top-𝐾), allowlist (benign verdicts, Bottom-𝐾), or investigative list (Movers). The reference and test models are pretrained, and we report the outcomes of Firenze’s region- based hypothesis tests in Table 1 below. Each table reports their combined marker scores 𝑍(·) (abbrev. CMS) on each region and the 𝑝-value of the t-test that tests each hypothesis by which the test model would be better than the reference model (or the reference better than the test; see Section 3.2). We summarize the outcome of each test with an S (Success) to show "test model out-performs reference model" (𝑝 ≤0.05), an F (Failure) to show "reference model out-performs test model" (𝑝 ≤0.05 for the opposite outcome), and a U (Undetermined) to show an inconclusive outcome (𝑝 >0.05). We see that all of the Top-K, Bottom-K, and Movers tests succeed, i.e. the test model is uniformly better at scoring malicious and benign samples, as well as moving malicious/benign samples to higher/lower ranks. These results are more granular, interpretable, and therefore trustworthy than miniscule differences in AUC on labeled data, such that a security expert would recommend the test model over the reference. The EMBER dataset presents two explicit opportunities to verify conclusions drawn from Firenze’s tests. First, because the dataset also contains 800k labeled samples from the 2018 period used for evaluation, we can verify with classical metrics that the test model is, indeed, out- performing the reference model (AUC𝑅 = 0.9166 versus AUC𝑇 = 0.9371), though both show degredation of performance over time. Second, because we could manually retrieve VirusTotal reports and labels (0 verdicts ⇒ benign, ≥40 verdicts ⇒ malicious [2]) on these once-unlabeled samples now—four years later—we can verify our conclusions once more (Accuracy𝑅 = 0.90 versus Accuracy𝑇 = 0.94). 6. Evaluating domain name reputation models using Firenze We follow up with a second case study from a mature real-world use-case comparing two models for domain name reputation, which use fully anonymized passive DNS data obtained from a large cloud service provider. The exact details of these models are not the focus of this paper, but can be assumed similar to previous related work in this space [6, 3, 19, 25]. These domain name reputation models are used to identify malicious domains for threat detection as well as benign domains for false positive mitigation. The reference model is an already-in-use production version of the model. The test model is a proposed update to the model which adds additional features. Both models would score as many as one billion domains per day, but are only trained on a few million domains with known labels. This large discrepancy makes model improvements difficult to evaluate, since precision and recall across model versions compared to labels stays relatively stable (here, the test model scores slightly better; area under the ROC curve AUC𝑅 = 0.98387 versus AUC𝑇 = 0.98527). Using Firenze on all domains, we investigate to what extent the new feature addition achieved our goals to (i) increase true malicious domain identifications (true positives) by the model without increasing false positives and (ii) improve identification of benign domains without increasing false negatives. We designed seven markers to evaluate these models; we outline them here, and discuss the relevant background information and domain expertise that motivated them in Appendix C.1. • Abused Domain: If the domain is associated with a curated list of known-abused domains, then 1, else 0 • Sinkholed Domain: If the domain is associated with a curated list of known-sinkhole IP addresses, then 1, else 0 • Honeypot Domain: If the domain appears in in-house honeypot logs, then 1, else 0 • Domain Popularity: If the domain is considered popular based on query counts, then −1, else 0 • Number of IPs: If the domain maps to more than 50 unique IP addresses, then −1, else 0 • Number of TTLs: If the domain appears with more than 500 TTLs (Time to Live), then −1, else 0 • Known Future Label: If the domain is labeled malicious in the future labels, then 1, if it is labeled benign, then −1, else 0 Region-based hypothesis testing with Firenze is mechanistically amenable to the domain- name reputation problem as well. Analogously, suppose we applied Firenze to evaluate our two domain-name reputation models on regions with 𝐾 = 10k or 100k samples. The reference and test models are pretrained, and we report the outcomes of Firenze’s region-based hypothesis tests In Table 2 below. In Table 2, we first see that both Top-K tests fail, i.e. the test model is worse than the reference model at scoring malicious domains, but both Bottom-K tests succeed, i.e. it is better at scoring benign domains. The Movers test fails for 10k and is inconclusive for 100k, i.e. the test model does not move malicious samples to higher ranks, and benign samples to lower ranks. We explicitly note that “Success” and “Failure”, as noted in the tables, qualifies whether the test model successfully outperforms the reference model. Overall, we conclude that we failed to develop a better model, but succeeded in identifying it so with Firenze. Using traditional metrics over labeled data like AUC above, we observed that the test model was doing marginally better. But, with Firenze, we reveal a more nuanced picture of better benign detection and worse malicious detection, which reflects the plausible situation in which one model is not uniformly better than another; instead, they each have regimes in which they perform better or worse. The granularity of these insights are what a security expert would need to recommend that a business owner not ship the new model, citing likely false negatives for a customer. Defining markers and regions helps identify these aspects of performance, automate their evaluation with the robustness of statistical tests, and make confident business decisions based on the outcomes. 7. Discussion and conclusion With this paper, we introduced Firenze as a modular, extensible framework for post-hoc compar- ative model evaluation, that constitutes a novel approach to the problem of learning from data with noisy, unreliable, or absent labels (see Section 1). We showed how Firenze is driven by markers, weak signals encoding domain expertise, and how we can leverage their aggregate in- formation over subsets of samples with statistical significance tests to compare the performance of two models without requiring ground-truth labels (Section 3). We demonstrated its efficacy on simulated data (Section 4), as well as on two real-world case studies from malware detection and domain-name reputation, which illustrate how a user should construct and interpret markers and regions (Sections 6 and 5). The framework also allows for flexibility in defining both markers and regions of interest to specialize performance [improvements] users want to measure. Once these are implemented for a given use-case, they can be seamlessly reused across arbitrary model refinements and changes—small hyperparameter adjustments or even complete architectural overhauls. In all cases, the outcomes from Firenze are explainable. Since comparative performance can be viewed for each marker and test, this can be used to provide feedback for targeted model refinements. This said, Firenze does not remove the need to acquire labels; high-quality labeled datasets remain the premier means to develop effective models. Firenze gives comparative insights into model performance, and cannot infer the absolute performance differences that are achievable with fully labeled data. Moreover, these insights hinges on the quality of markers designed by domain experts. A reasonable ground truth dataset will help the expert ensure that the markers meet or exceed the quality conditions we lay down in sections 3 and 4. We suggest that effective applications of ML in the security domain require both: datasets with a high-quality [sub]set of labels for model training, and improved evaluation methods (like Firenze) to estimate improvement in performance on real-world data, much of which is unlabeled. Future work includes exploring statistical techniques to move from comparative analysis to single model analysis including threshold selection to achieve desired false positive rate; estimating uncertainty of the outcome based on test parameters; and expansion of statistical tests to explain how one model may be doing better. We are optimistic that Firenze can enable more holistic machine learning model development for research problems in information security by creating opportunities for direct participation by security researchers and business owners, as well as the usual ML scientists. A security researcher may design markers, regions, and/or tests like those proposed in section 3.2 to curate the aspects over which one model may be performing better than another, adaptively with their [evolving] domain expertise. The business owner can survey customers for desired improvements to product or model performance, which can motivate additional markers, and so on. References [1] Snorkel case studies. https://snorkel.ai/case-studies/. [2] H. S. Anderson and P. Roth. Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637, 2018. [3] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for dns. In USENIX security symposium, pages 273–290, 2010. [4] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. Siemens. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, volume 14, pages 23–26, 2014. [5] F. Barbero, F. Pendlebury, F. Pierazzi, and L. Cavallaro. Transcending transcend: Revisiting malware classification in the presence of concept drift. In IEEE Symposium on Security and Privacy, 2022. [6] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. Exposure: Finding malicious domains using passive dns analysis. In Ndss, pages 1–17, 2011. [7] M. Corporation. Programming reference for the win32 api, 2022. URL https://docs. microsoft.com/en-us/windows/win32/api/. [8] K. Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. In A. Bhattacherjee and B. Fitzgerald, editors, Shaping the Future of ICT Research. Methods and Approaches, pages 210–221, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-35142-6. [9] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346806. [10] W. Deng and L. Zheng. Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15069–15078, June 2021. [11] E. Erdemir, J. Bickford, L. Melis, and S. Aydore. Adversarial robustness with non-uniform perturbations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19147–19159. Curran Associates, Inc., 2021. [12] M. Fedorchuk and B. Lamiroy. Binary classifier evaluation without ground truth. In 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), 2017. [13] M. Hucka. Nostril: A nonsense string evaluator written in python. Journal of Open Source Software, 3(25):596, 2018. doi: 10.21105/joss.00596. URL https://doi.org/10.21105/joss.00596. [14] I. n. Íncer Romeo, M. Theodorides, S. Afroz, and D. Wagner. Adversarially robust malware detection using monotonic classification. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, IWSPA ’18, page 54–63, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356343. doi: 10.1145/3180445. 3180449. URL https://doi.org/10.1145/3180445.3180449. [15] Intra2Net. Blacklist monitor, 2022. URL https://www.intra2net.com/en/support/antispam/. [16] R. J. Joyce, E. Raff, and C. Nicholas. A framework for cluster and classifier evaluation in the absence of reference labels. Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, Nov 2021. doi: 10.1145/3474369.3486867. URL http://dx.doi.org/ 10.1145/3474369.3486867. [17] A. Kantchelian, M. C. Tschantz, S. Afroz, B. Miller, V. Shankar, R. Bachwani, A. D. Joseph, and J. D. Tygar. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, AISec ’15, page 45–56, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450338264. doi: 10.1145/2808769.2808780. URL https://doi.org/10.1145/2808769. 2808780. [18] V. G. Li, M. Dunn, P. Pearce, D. McCoy, G. M. Voelker, and S. Savage. Reading the tea leaves: A comparative analysis of threat intelligence. In 28th USENIX Security Symposium (USENIX Security 19), pages 851–867, 2019. [19] P. Lison and V. Mavroeidis. Neural reputation models learned from passive dns data. In 2017 IEEE International Conference on Big Data (Big Data), pages 3662–3671. IEEE, 2017. [20] A. H. M Sikorski. Practical Malware Analysis. William Pollock, 2012. [21] A. Marcus and A. Parameswaran. Crowdsourced Data Management: Industry and Academic Perspectives (Book). Foundations and Trends® in Databases, December 2015. [22] A. T. Nguyen, E. Raff, C. Nicholas, and J. Holt. Leveraging uncertainty for improved static malware detection under extreme false positive constraints, 2021. URL https://arxiv.org/ abs/2108.04081. [23] M. Novák, J. Mírovský, K. Rysová, and M. Rysová. Exploiting Large Unlabeled Data in Automatic Evaluation of Coherence in Czech, pages 197–210. 08 2019. ISBN 978-3-030- 27946-2. doi: 10.1007/978-3-030-27947-9_17. [24] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro. Tesseract: Eliminating experimental bias in malware classification across space and time. In 28th USENIX Security Symposium (USENIX Security 19), pages 729–746, 2019. [25] Z. Ramzan, V. Seshadri, and C. Nachenberg. Reputation-based security. https://docs.broadcom.com/doc/reputation-based-security-en. [26] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel. Proceedings of the VLDB Endowment, 11(3):269–282, Nov 2017. ISSN 2150-8097. doi: 10.14778/3157794.3157797. URL http://dx.doi.org/10.14778/3157794.3157797. [27] A. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly, 2017. [28] A. Ratner, S. Bach, P. Varma, and C. Ré. Weak supervision: the new programming paradigm for machine learning. Hazy Research. Available via https://dawn. cs. stanford. edu//2017/07/16/weak-supervision/. Accessed, pages 05–09, 2019. [29] J. Saxe and K. Berlin. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pages 11–20, 2015. doi: 10.1109/MALWARE.2015.7413680. [30] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. computers & security, 31 (3):357–374, 2012. [31] A. Singh, A. Walenstein, and A. Lakhotia. Tracking concept drift in malware families. In Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence, AISec ’12, page 81–92, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450316644. doi: 10.1145/2381896.2381910. URL https://doi.org/10.1145/2381896. 2381910. [32] R. Sommer and V. Paxson. Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE Symposium on Security and Privacy, 2010. [33] A. Tang, S. Sethumadhavan, and S. J. Stolfo. Unsupervised anomaly-based malware detec- tion using hardware features. In A. Stavrou, H. Bos, and G. Portokalidis, editors, Research in Attacks, Intrusions and Defenses, pages 109–129, Cham, 2014. Springer International Publishing. ISBN 978-3-319-11379-1. [34] P. Tully, M. Haigh, J. Gibble, and M. Sikorski. Learning to rank relevant malware strings using weak supervision. CAMLIS, 2019. [35] B. L. Welch. The generalization of "Student’s" Problem when several different populations variances are involved. Biometrika, 34(1-2):28–35, 01 1947. ISSN 0006-3444. doi: 10.1093/ biomet/34.1-2.28. URL https://doi.org/10.1093/biomet/34.1-2.28. [36] S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, and G. Wang. Measuring and modeling the label dynamics of online anti-malware engines. In 29th USENIX Security Symposium (USENIX Security 20), pages 2361–2378, 2020. A. Supplementary methods for simulated data experiments Here, we outline the generative process used in Section 4 to define qualitative conditions under which Firenze can identify the better model with its region-based hypothesis tests. As a reminder, the parameters for our simulated environment are the positive class prevalence 𝜋, model performances on ground-truth and training labels 𝑃true 𝑅 , 𝑃 𝑅 ,𝑃 𝑇 , and 𝑃 𝑇 , the train true train number of samples 𝑁 , the region size 𝐾, and the accuracies and coverages of the marker 𝛼 and 𝛽 resp. and the training labels 𝛼 ¯ resp.. Their default values, unless specified otherwise, ¯ and 𝛽 are 𝑃train = 0.97, 𝑃train = 0.98, 𝑃true = 0.95, 𝑃true 𝑇 𝑅 𝑇 𝑅 = 0.90, 𝜋 = 0.5, 𝛼 = 0.95, 𝛽 = 0.10, 𝐾 = 10000, and 𝑁 = 1000000. A.1. Label and weak signal generation Let 𝑦true ∈ {−1, 1} be the unobserved ground-truth label for a sample 𝑠, generated as a Bernoulli random variable with probability/bias 𝜋. Let 𝑚 ∈ {−1, 0, 1} be the weak label assigned to this sample by a marker. The marker provides a noisy observation of this ground-truth label generated as two more Bernoulli random variables. The first determines whether the marker yields a label; with probability/bias 𝛽, the marker provides a label, otherwise a null-value. The second determines whether the marker yields the correct label; with probability/bias 𝛼, the marker takes the actual label 𝑦true , otherwise it flips it. Parametrized this way, 𝜋 defines the prevalence of the positive class, 𝛽 defines the coverage of the marker, and 𝛼 defines its accuracy. The resulting data-generating process is given by if 𝑏 = 1 {︂ 𝑎 · 𝑦true 𝑚 | 𝑦true , 𝑎, 𝑏 = (1) 0 otherwise 𝑏 ∼ Bernoulli(𝛽) (2) 𝑎 ∼ Bernoulli(𝛼) (3) 𝑦true ∼ Bernoulli(𝜋) (4) We simulate this process for each of the 𝑖 = 1, ..., 𝑁 samples 𝑠𝑖 and the single marker 𝑚(𝑠𝑖 ). For this simulated environment, we simulate a single marker, which emulates the most conservative regime of the Firenze framework. Let 𝑦 be the observed label used for model training. We can simulate the same process for this label, subject to its own coverage 𝛽 ¯ and accuracy 𝛼 ¯ . By design, the accuracy of these labels is much higher than that of any markers, 𝛼 ¯ ≫ 𝛼𝑗 , but still subject to noise and discrepancies from ground-truth labels: if ¯𝑏 = 1 {︂ ¯ · 𝑦true 𝑎 𝑦 | 𝑦true = (5) 0 otherwise ¯𝑏 ∼ Bernoulli(𝛽 ¯) (6) ¯ ∼ Bernoulli(𝛼 𝑎 ¯) (7) Neither training labels nor model training are part of Firenze itself; they are part of our simulation as a means to generate model scores with fully specified performances in the next subsection. These scores become the “input” to Firenze. A.2. Model score generation Let 𝑝 ∈ (0, 1) be the model score of a sample 𝑠, and let 𝑦^ = sign(𝑝 − 0.5) be a decision function that yields a class estimate from that score. This estimate has an observed performance with respect to the training [feed] labels, and an unobserved performance with respect to the ground-truth labels. We emulate the training process by generating scores as uniform random variables, with model performance enforced by a Bernoulli random variable with bias/probability 𝑃 . The uniform variable generates noise without affecting the class estimate, drawn between (0, 0.49) if 𝑦 = −1 and (0.51, 1) if 𝑦 = 1; for samples without training labels (𝑦 = 0), we use 𝑦true in its place. The Bernoulli variable determines whether the class estimate is correct; with probability/bias 𝑃 , the sample remains consistent with the ground-truth or training label, otherwise it flips to the other half-interval. Parametrized this way, 𝑃 defines the performance of the “trained” models; for models 𝑅 and 𝑇 , across samples with a training label, we have performances 𝑃train 𝑅 and 𝑃 𝑇 , and for samples without a training label, we have 𝑃 𝑅 and 𝑃 𝑇 . train true true The resulting data-generating process—identically for models 𝑅 and 𝑇 —is given by if 𝑐 = 1 {︂ 𝑓 𝑝 | 𝑓, 𝑐 = (8) 1 − 𝑓 otherwise Uniform(0.5, 1) if 𝑦 = 1 or 𝑦true = 1 ∧ 𝑦 = 0 {︂ 𝑓 | 𝑦, 𝑦true ∼ (9) Uniform(0, 0.5) otherwise if ¯𝑏 = 1 {︂ ¯ Bernoulli(𝑃train ) 𝑐|𝑏∼ (10) Bernoulli(𝑃true ) otherwise B. Supplementary materials for malware detection case study B.1. Marker design rationale for malware detection EMBER includes the following groups of raw data describing PE files– general properties, header information, import functions, export functions section information, byte histogram, byte entropy, and string information. In keeping with our requirement to not use signals which are used for training as markers, here we [artificially] split the available data as follows. We used the general information, sectional header and imports information to design the markers, while the remaining features were used to train the models. This split was necessary; we restrict ourselves to the data available in EMBER for all parts of the experiment to ensure that the study is replicable. The following markers were designed with these fields: • Suspicious Section Properties: In a binary, if we have more than one executable section or we have any sections that are writable and executable, then the file is likely bad. In benign files, it is expected that only the .text section will hold code and be executable. Deviation from this rule of thumb warrants suspicion. And if a file has sections that are writable and executable then that can indicate the presence of self modifying code, which is (malware-like behavior). Thus if a file has more than one section that is Readable/Executable, or any sections that are Writeable/Executable then it is likely malicious. • Unusual Number of Imported Funtions: Most binaries import multiple libraries and functions. A very low number of imports can indicate packing or some other type of obfuscation. There are exceptions ofcourse; an important one being managed code (written in .Net) where mscoree.dll is often the only import. Looking at a random sampling of benign files we observed that most have on average more than 100 imported functions. Whereas, looking at samples of binaries packed with UPX (a common packing utility used frequently by malware to thwart static analysis and signature matching) we see 5-25 imports. Thus if a file has less than 25 import functions, then we deem it as likely malware. • Nonsensical Section Names: Windows binaries usually contain multiple sections. Most commonly, one or more of the following are present. .text, .rdata, .data, .rsrc, .reloc. Less frequently, but still prevalent are others like .idata, .edata, .pdata or CODE. On the other hand, there are section names that warrant suspicion. For example .UPX (and variants thereof) are added by UPX. While not all files packed with UPX are malware, from past experience, we know that it is heavily used by malware authors. Additionally, files with nonsensical section names are also likely to be malware and not legitimate software. To detect whether a section name is “nonsensical” we use nostril [reference: https://joss.theoj.org/papers/10.21105/joss.00596]. If we see known suspicious section names, or nonsensical section names in a file, we deem it likely malware. • Import of suspicious functions: There are certain functions and libraries that are used by binaries to implement functionality that is likely to be associated with malware. Thus presence of these functions in the imports of a binary makes it suspicious. We use the presence of such functions as a test of suspiciousness in this marker. For example, process injection which is commonly used by malware to elevate privileges or access resources belonging to another process exhibits some peculiar function call patterns. The malware might call a series of Process32First/Next and Thread32First/Next to identify the process or thread it wants to inject in and then call VirtualAllocEx to allocate memory in the remote process. Thus the presence of these functions in the imports section of a binary makes it suspicious. Ofcourse, there are behavior that a malware might exhibit that will also be common amongst benign files. CreateFile is such a function that is used broadly by malware and benignware. Expertise and experience is required to design this marker. In Appendix A we share a table of the functions we used in this markers and how they are used by malware. • Signed: A signed PE file indicates that the file is from a trusted source and is likely benign. While there are samples of signed malware (for example, expired certificates stolen during NVIDIA’s compromise by the Lapsus$ group were used to sign malware. B.2. Table of Functions Used in the Suspicious Imports Marker This table describes the functions used to define the Suspicious Imports Marker. The descriptions were obtained from MSDN [7]. Existing resources like [20] can be used to obtain such a list. C. Supplementary materials for domain name reputation case study C.1. Marker design rationale for domain reputation Recall the seven markers introduced in the main text: Abused Domain If the domain is associated with curated abused domain list, then 1, else 0 Sinkholed Domain If the domain is associated with curated sinkhole IP address list, then 1, else 0 Honeypot Domain If the domain appears in in-house honeypot logs, then 1, else 0 Domain Popularity If the domain is considered popular based on query counts, then -1, else 0 Number of IPs If the domain maps to more than 50 unique IP addresses, then -1, else 0 Number of TTLs If the domain appears with more than 500 TTLs, then -1, else 0 Known Future Label If the domain is labeled malicious in the future labels, then 1, if it is labeled benign, then -1, else 0 Based on past manual analysis, one interesting signal of maliciousness we found is the association with abused top-level domains (TLDs) and effective second level domain names (e2LD, the smallest unit of a domain name that can be registered by Internet users). Owners of these TLDs and e2LDs allow actors to register domain names for free or minimal cost. Though being related to an abused TLD is a good signal of suspiciousness, legitimate domains also exist within these name spaces and not all domains associated with abused TLDs and e2LDs should be considered malicious. Thus, while this principle would be bad for labeling domains, it is a great marker. Another example we use is association of a domain with a manually curated list of sinkhole IP addresses. Along with malicious markers, we also utilize benign markers. For example, we expect that highly popular domains based on query counts will more likely be benign compared to malicious domains. Popularity itself is not a guarantee of benignity, but a decent signal, and therefore is another good marker. Further examples of benign markers include those domains that resolve to a very high number of IP addresses with multiple different TTLs (Time To Live) — based on our observations, these tend to be associated with Content Delivery Networks (or CDNs) and heavily skew benign. While designing these markers, we also looked at domains that may resolve to a high number of IP addresses due to fast-flux behavior (and therefore are likely malicious), but we observed that the number of unique IPs observed for those domains over a day were far lower than we see for domains associated with CDNs. The thresholds for these markers effectively separate these types of activity. Thus we can see that designing markers is a combination of domain expertise verified by data, art and science. Marker functions will return 1 when the marker considers a domain to be likely malicious and -1 when the marker considers a domain to be likely benign. 0 indicates the marker did not vote. It is important to note that in these experiments, the markers are not used as label sources or features in training. For example, though we have a manually curated list of known sinkhole IP addresses and abused TLDs, these are not used for training as manually maintaining a fully accurate list over time is challenging and we do not want this model to overfit on those types of domain names. The Known Future Label marker is based on what the labels say about a domain one week after the training time. Usually in the security domain we see that we don’t have perfect signal about new entities, but within a few days, labels get updated— whether through manual investigations, correlations, or gathering external intelligence. Since these are "Future Labels", they can’t be used for training, but are excellent for evaluation. C.2. Fine-grained investigations of Results based on individual markers We explore two tables of detailed test results here to illustrate how individual markers can be used to explain and dive deeper into the results seen in the summary view provided by combined marker scores. For completeness, we provide all six such tables for 2 × 3 cases of 𝐾 = 10k and 100k as well as the three region-based hypothesis tests. Looking at the detail view of the Top𝐾 test over 10𝑘 region in table 5, we see that the average marker score for the malicious markers (Abused domains and Sinkholed domains) is higher for the reference model than the test model. The differences pass the statistical significance test. This shows that the test model is finding fewer likely malicious domains of these types in its K-most-malicious-domains region than the reference model. We observe a similar outcome in the Known Future Labels Marker as well, where the reference mean is higher than the test mean, and the difference is statistically significant. This implies that the test model is detecting fewer domains that will be likely labeled malicious in the future than the reference model. Thus we can say that the test model does not accomplish our stated goal of increasing detection value. On the benign marker side (Domain Popularity, Number of IPs and Number of TTLs), we observe that the Top𝐾 regions of both models are not significantly different. This implies that there are no likely benign domains that are deemed highly malicious by either model. The results from these markers indicate that the real-world FP rate for the detections from both models are likely to be similar, and the test model preserves the low-FP quality of the reference model. With these data points we can show with a high degree of explainability that the test model is not performing better than the reference model at scoring malicious domains. Looking at the malicious markers (Abused domains and Sinkholed domains) in the detail view of the Bottom𝐾 test results over the 10𝑘 region in table 6 , we see that the while the means for the test model are consistently lower than the reference model, often the averages for both models are close to zero, and not statistically significant. This implies the FN rate from the benign list generated by both models will be similar. For the benign markers in the same test, we see that the test mean is significantly lower than the reference mean for all markers. This indicates that the test model is finding more likely benign domains than the reference model. Thus, we can once again illustrate our summary result that the test model is better at scoring benign domains. Test Avg CMS Avg CMS p-value Result Reference Model Test Model TopK Test, 50k 0.11456 0.68445 <10-16 S BottomK Test, 50k 0.09788 -0.16862 <10-16 S Test Avg CMS Avg CMS p-value Result Up-Movers Down-Movers Movers Test, 50k 0.42884 0.00868 <10-16 S Table 1 Outcomes of Firenze’s evaluative comparison of reference and test malware detection models with the Top-K, Bottom-K, and Movers tests for 𝐾 = 50k Test Avg CMS Avg CMS p-value Result Reference Model Test Model TopK Test, 10k 0.617138 0.516348 <10-16 F TopK Test, 100k 0.570214 0.405806 <10-16 F BottomK Test, 10k -0.5795 -0.9835 <10-16 S BottomK Test, 100k -0.54655 -0.67804 <10-16 S Test Avg CMS Avg CMS p-value Result Up-Movers Down-Movers Movers Test, 10k 0.0026 0.0074 0.011 F Movers Test, 100k 0.00036 0.00016 0.296 U Table 2 Outcomes of Firenze’s evaluative comparison of reference and test domain name reputation models with the Top-K, Bottom-K, and Movers tests for 𝐾 = 10k and 100k Table 3 Functions Used in the Suspicious Imports Marker Function Description Tactic or Type of Malware Associated createprocessasuser Creates a new process and its primary thread. The new process Injection runs in the security context of the user represented by the specified token. createservice After openscmanager, this is used to create the service which Persistence will run the malware functionality at startup cryptbinarytostring Converts an array of bytes into a formatted string Ransomware cryptcreatehash Initiates the hashing of a stream of data Ransomware cryptdestroyhash Destroys the hash object Ransomware cryptgethashparam Get the hashed value after applying an algorithm Ransomware crypthashdata The CryptHashData function adds data to a specified hash Ransomware object encryptfile Encrypt a file or directory Ransomware getadaptersinfo Used to obtain information about network adapters. Can be Anti VM Functionality recon, or check for anti-vm functionality getforegroundwindow Returns Handle to the window that is in the foreground. Used Keylogger by keyloggers to determine which window the user is entering key strokes into internetopen Initializes internet access functions from WinINet C2 functionality mapvirtualkey Translates virtual keycode into a character value Keylogger process32first Used to enumerate processes by malware prior to injection Process Injection process32next Used to enumerate processes by malware prior to injection Process Injection regopenkey Opens a handle to read or edit a registry key which is a common Persistence persistence mechanism regsavekey Saves the specified key and all of its subkeys and values to a Persistence new file, in the standard format. setprop Used by malware to register a property and wait for its invoca- Process Injection tion to execute malicious commands. thread32first Used to enumerate threads prior to injection Process Injection thread32next Used to enumerate threads prior to injection Process Injection urldownloadtofile Download a file from a webserver Downloader virtualallocex Allocates memory in a remote process Process Injection virtualprotectex Changes the protection on a memory region to make it exe- Process Injection cutable winexec Execute a new program Downloader Table 5 Top K test (K=10,000) Marker Avg Marker Score Avg Marker Score p-value Result Reference Model Test Model AbusedDomain 0.310569 0.294771 2.07E-02 F SinkholedDomain 0.062194 0.020898 1.26E-47 F HoneypotDomain 0 0 NaN U DomainPopularity 0 0 NaN U NumberIPs 0 0 NaN U NumberTTLs 0 0 NaN U KnownFutureLabel 0.272273 0.217278 6.88E-19 F CombinedMarkerScore 0.617138 0.516348 4.79E-46 F Table 6 Bottom K Test (K=10,000) Marker Avg Marker Score Avg Marker Score p-value Result Reference Model Test Model AbusedDomain 0 0 NaN U SinkholedDomain 0 0 NaN U HoneypotDomain 0.0001 0 0.241959 U DomainPopularity -0.1875 -0.7806 0.00E+00 S NumberIPs -0.2614 -0.669 0.00E+00 S NumberTTLs -0.0631 -0.3632 0.00E+00 S KnownFutureLabel -0.4275 -0.7642 0.00E+00 S CombinedMarkerScore -0.5795 -0.9835 0.00E+00 S Table 7 Up-Movers and Down-Movers Test (K=10,000) Marker Avg Marker Score Avg Marker Score p-value Result Up-Movers Down-Movers AbusedDomain 0 0 NaN U SinkholedDomain 0 0 NaN U HoneypotDomain 0 0 NaN U DomainPopularity -0.0006 -0.0038 3.44E-06 S NumberIPs 0 -0.0003 8.90E-02 U NumberTTLs -0.0006 -0.004 1.35E-06 S KnownFutureLabel 0.0033 0.0133 4.32E-09 F CombinedMarkerScore 0.0026 0.0074 1.06E-02 F Table 8 Top K test (K=100,000) Marker Avg Marker Score Avg Marker Score p-value Result Reference Model Test Model AbusedDomain 0.088239 0.063599 4.39E-95 F SinkholedDomain 0.237948 0.125499 0.00E+00 F HoneypotDomain 0 0 NaN U DomainPopularity -0.00017 -0.00014 3.45E-01 U NumberIPs 0 -0.00001 2.42E-01 U NumberTTLs -0.00018 -0.00014 3.11E-01 U KnownFutureLabel 0.268607 0.235748 2.86E-63 F CombinedMarkerScore 0.570214 0.405806 0.00E+00 F Table 9 Bottom K Test (K=100,000) Marker Avg Marker Score Avg Marker Score p-value Result Reference Model Test Model AbusedDomain 0 0 NaN U SinkholedDomain 0 0.00001 2.42E-01 U HoneypotDomain 0.00003 0.00002 3.61E-01 U DomainPopularity -0.23787 -0.49372 0.00E+00 S NumberIPs -0.16872 -0.19044 6.84E-36 S NumberTTLs -0.08835 -0.22822 0.00E+00 S KnownFutureLabel -0.43925 -0.46809 1.46E-37 S CombinedMarkerScore -0.54655 -0.67804 0.00E+00 S Table 10 Up-Movers and Down-Movers Test (K=100,000) Marker Avg Marker Score Avg Marker Score p-value Result Up-Movers Down-Movers AbusedDomain 0 0 NaN U SinkholedDomain 0 0.00002 1.47E-01 U HoneypotDomain 0 0 NaN U DomainPopularity -0.00008 -0.00059 1.47E-09 S NumberIPs -0.00003 -0.00005 3.11E-01 U NumberTTLs -0.00006 -0.00091 2.62E-17 S KnownFutureLabel 0.00048 0.00138 2.88E-04 F CombinedMarkerScore 0.00036 0.00016 2.96E-01 U