Efficient Malware Analysis Using Metric Embeddings Ethan M. Rudd1 , David Krisiloff1 , Daniel Olszewski2 , Edward Raff3,4 and James Holt4 1 Mandiant Inc. 2 University of Florida 3 Booz Allen Hamilton 4 Laboratory for Physical Sciences, University of Maryland Abstract Machine learning-based malware classification has become a key component of modern defense-in-depth strate- gies, with focus placed on the binary classification task of malware detection. These detection models are typically combined with other toolchains, which provide additional context necessary for triage and remediation, including detection names, capability, and type information. The resulting systems are often complex and interconnected, incurring significant technical debt, infrastructure costs, and inevitable errors. In this paper, we examine the feasibility of using machine learning to streamline malware analysis pipelines in a manner which minimizes potential risks and costs while preserving flexibility and functionality. To this end, we explore the use of metric learning to embed malicious and benign samples in a low-dimensional vector space with enriched capability information for downstream use in a variety of applications, including detection, family classification, and malware attribute classification. Specifically, we enrich labeling on malicious and benign PE files from the EMBER dataset using Mandiant’s CAPA tool, an open-source toolchain which uses disassembly and subject matter expert (SME) derived rules and heuristics to determine malicious capabilities. Using these CAPA labels, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation on malware similarity, and combinations thereof. We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets. We show that for a variety of transfer tasks, we are able to utilize relatively low-dimensional metric embeddings with little decay in performance, which offers the potential to quickly retrain for a variety of transfer tasks. The low-dimensional representations offer added potential to significantly reduce training and storage overhead when performing retrains or transferring to additional downstream tasks. Keywords Metric Embeddings, Machine Learning, Malware Analysis, Information Security 1. Introduction Malware analysis is a complex process involving highly skilled experts and many person-hours. Given the number of new files seen each day (more than 500,000 on VirusTotal alone [1]) automation of malware analysis is a necessity. Development of new analysis tools provides an avenue for more efficient malware analysis teams. Fortunately, malware analysis tasks are often amenable to machine learning (ML) solutions. The tasks (e.g., malware detection or malware family classification) are complex enough that traditional rules-based approaches remain brittle and require frequent updating. At the same time, it is possible to acquire and label large data sets to train ML models using threat feeds and crowdsourcing services, like VirusTotal or Reversing Labs. One notable downside, however, is that ML models come with significant technical debt: they need to be retrained as malware evolves and often interdependencies between various model components can be hard to understand or predict [2]. Even with a consistent feature vector representation, when training on industry-scale datasets, feature stores may require tens of terabytes and model re-training can take multiple weeks. CAMLIS’22: Conference on Applied Machine Learning in Information Security (CAMLIS), October 20–21, 2022, Arlington, VA $ ethan.rudd@mandiant.com (E. M. Rudd); david.krisiloff@mandiant.com (D. Krisiloff); dolszewski@ufl.edu (D. Olszewski); raff_edward@bah.com (E. Raff); holt@lps.umd.edu (J. Holt) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) This motivates the question: what if we can use ML to derive low-dimensional representations which capture semantic behavior of malware/goodware that can be used to speed up and reduce resource requirements for downstream tasks? This could significantly enhance capability for training classifiers for novel applications, performing rapid iteration/experimentation, and efficiently updating deployed models, all at reduced processing and storage requirements. Since other applications of applied ML, including biometrics and information retrieval (IR) systems, have utilized metric learning to derive low-dimensional embeddings for similar downstream tasks, in this paper, we explore whether we can apply metric learning in a similar vein towards various malware analysis-oriented ML tasks. Using metric embeddings, we aim to simplify some of the engineering costs associated with running and maintaining a suite of different downstream ML tooling. Contrary to other applications of ML classification, where data can trivially be assigned labels corresponding to one or more classes/attributes, labeling for malware analysis tasks can be more difficult [3]. Moreover, data for malware analysis often includes telemetry and metadata beyond hard labels which could ideally be used to enrich our metric embeddings. In this paper, we explore techniques to enrich our embeddings with complex semantic information provided by computationally-expensive tools (e.g., disassembly). This allows us to explore whether it is possible to approximate more expensive analysis with lower-overhead static representations. When generating our embeddings, we utilize Mandiant’s CAPA tool; an open source tool, which utilizes rules and heuristics in conjunction with disassembly to yield capability labels (e.g., file read/write, registry key generation, process creation, data send/receive over networks, socket connection, base64/XOR encoding) associated with a given PE, ELF, and .NET file as well as shellcode snippets. We enrich sam- ples from the EMBER dataset with capability labels, and using these labels generate different types of embeddings, including a Siamese embedding, which utilizes a contrastive loss over clusters of CAPA attributes, as well as a novel ranking embedding, which uses the Spearman Rank Correlation Coefficient as a loss and aims to embed ranked degree of similarity between different CAPA attribute clusters. We then perform comparisons of these different metric embedding loss functions across two different datasets: EMBER and SOREL-20M, on three different downstream transfer tasks: malware detection, malware family classification and malware attribute classification, making comparisons to original dataset benchmarks where applicable. Finally, we perform an analysis of adversarial robustness on the three respective embedding types. 2. Background and Related Work Metric learning is a machine learning task that focuses on learning distances (i.e., metrics and/or measures) between objects that captures some semantically meaningful notion of similarity. These learned metric functions play an important role in fields including information retrieval, ranking, and recommendation systems [4]. The key property of a similarity metric/measure is that it maps similar objects close together and dissimilar objects far apart within the learned metric space. In practice, the objects are represented by a set of features, the metric function is the transformation of the features into a common metric space, and the learning process finds a transformation such that the similarity/dissimilarity behavior is correct with respect to the labels provided during training. Various learning architectures have been proposed, including Siamese networks that learn from the distances between pairs of objects, and triplet networks that use three samples to capture both similarity and dissimilarity to an anchor [5], [6]. There are also different loss functions for each architecture along with other subtle modifications of the learning process that can be applied to improve results (e.g., specialized algorithms for stochastic gradient descent [4]). The Spearman embedding technique that we introduce is not the first to incorporate ranking infor- mation into a metric Embedding. The triplet based approach to learning was originally proposed as a method to learn a function for object ranking [7] and used an anchor 𝑎 with a positive and negative pair 𝑎𝑝 , and 𝑎𝑛 respectively. This work has been extended and become popular in deep learning, with most works focusing on the embedding constraints (e.g., L2 normalized or unconstrained) and the approach for finding triplet pairs [8, 9, 10, 11, 12, 13, 14], however recent work has shown that most of these gains may be attributed to learning algorithm improvements and model architecture (e.g., invention of batch-norm), rather than improvements in the triplet learning procedure [15]. In contrast our work comes full-circle, developing a ranking based loss based on the Spearman coefficient of correlation to try and capture more fine-grained information than the standard triplet loss. To perform metric learning, we require a dataset of objects along with their similarities, which act as labels in a supervised ML setting. Since our objects are binary portable executables (PEs) we need to define a meaningful notion of similarity among binaries. Binary similarity gauges the likeness between two binary files and can be defined in several different ways. Perhaps the cleanest definition is that two binaries are similar if they were compiled from the same source code or contain a large fraction of the same source code [16], [17]. This definition is particularly useful from a malware reverse engineering standpoint: knowing that a file contains source code from a known piece of malware can significantly speed up analysis. However, this definition introduces problems when labeling a dataset of binary files built from unknown source code. More broadly, binary similarity can be defined as two files whose structures are similar. This definition is the one often used by those developing fuzzy or locality sensitive hashing algorithms [18, 19, 20, 21]. While less precise, these functions are computationally easier to compute and some are amenable to fast (𝒪(𝑛 log 𝑛) or faster) database lookups. These hashes are routinely used in the malware analysis space, lacking any better option. Critically, recent work where ground-truth (as assessed by manual reverse engineering) that hashes of this nature can produce true-positive rates near 99.8%, at the cost of many false negatives [22, 23]. Given a sample 𝑥, this allows selecting a similar pair 𝑥𝑝 with high confidence. Selecting a negative pair 𝑥𝑛 can still be done with high confidence by random selection, since only a minority of samples will be similar to any given 𝑥. To the best of our knowledge, ours is the first work to pursue a discriminative style triplet loss as our method of learning a general purpose feature representation. Prior methods generaly fall into the category of being byte-based static processes or heavy lifting processes to extract higher level code representations. In both cases, these methods are not amenable to being feature vectors that can adjust to population change over time (i.e., quarterly retraining) in our deployment scenario (i.e., low overhead). Prior byte-based and compression based approaches [24, 25] to constructing a feature vector have been proposed, which allow for similarity search in a computationally efficient manner. Similarly digital forensic hashes, which produce a hash-code of fixed or variable length that can be used to calculate similarities [26, 27, 28, 29, 30, 31, 26, 32] are another approach that works on raw byte content. These approaches are often fast and low-overhead. However, no learning step occurs for either of these types of approaches, which prevents the method from being able to adapt to changes in the population of malware. The other primary approach are code similarity measures. These tend to work at either the disassembly [33, 34, 35] or disassembly call-graph [36, 37, 38], and use a neural network to train an auto-regressive model thata can produce a fixed-length representation. This allows adapting the model over time, but the reliance on at least disassembling the given executable limits our ability to deploy such representations. Disassembly itself is computationally demanding, and often error-prone, as many times a file will not yield accurate disassembly without unpacking or deobfuscation. These processes may need to be done manually. Combined this makes the approach undesirable for our goals. 3. Approach 3.1. Overview In this paper, we focus on building a model that produces embeddings of Windows PE files. The goal is to learn a representation that can be used for multiple downstream malware analysis tasks. The methodology can be separated into two phases. Figure 1 represents the first phase where an embedding model is trained. The starting point is the raw data, which in our case is the EMBER 2018 dataset [39], a set of 1 million PE files seen in or before 2018. The raw data goes through two steps: (1) a featurization step, and (2) a step to compute file information that aids in determining the pairwise similarity between any two files. For featurization, we focus on subject matter expert (SME)-derived, static features that are efficient to compute and which capture a broad range of malicious signals. These features include things such as the APIs imported, parsing errors, entropy, and byte distributions. To determine pairwise similarity between two PE files, one can use a tool like CAPA [40] that detects capabilities of executable files (i.e., two files with overlapping capabilities are similar). Notably, we hypothesize that using more complex similarity information than what is available naturally in the features (e.g., disassembly-based similarity vs. static features) will imbue the learned metric space with additional information without the added overhead of the more complex analysis techniques. Once the data and labels are defined, we specify an architecture for the embedding, which we instantiate with a multi-layered neural network. An algorithm then takes those components as input and trains an embedding model. The training algorithm can use a metric embedding network, e.g., a Siamese network or a triplet loss network to learn the model parameters that effectively cause similar PE files to be near one another in the embedding space. Figure 1: A system diagram depicting the upstream training of an embedding model. Training is conducted using a selected embedding model architecture, feature vectors from executable binaries, and additional data to support pairwise sample comparisons (e.g., CAPA labels). During the second phase, as shown in Figure 1, we measure the transferability of the embedding space to various malware analysis tasks. Concretely, we embed our training data and use that representation to train new models for each downstream task. By keeping the models used for the transfer process constant and only varying our embedding process we can make precise measurements of the utility of various similarity information, loss function, binary representation, or other parameters of our embedding network. The following sections detail our modeling setup. Figure 2: A system diagram depicting the downstream use of a trained embedding model for multiple tasks. Features are extracted from executable binaries and fed as inputs to a trained embedding model, which generates a low-dimensional embedding representation of each of the input binaries. These embeddings along with auxiliary label information can be used for different downstream malware analysis tasks. 3.2. Network architecture Our embedding neural network architecture is shown in Figure 3. The network takes as input 2381- dimensional static features from [39]. The features are first normalized via standard scaling with respect to mean and variance on the EMBER 2018 training set, then fed to an embedding network, which is comprised of four dense layers of dimension 4,000, 1,024, 512, and 512. Each layer uses a sigmoid activation. Between the layers we include both a BatchNorm and a Dropout layer with a dropout probability of 10%. Following those layers is the final embedding layer using a linear activation with specified output dimension; in our experiments, we utilize an output dimension of 32. During training the network outputs are then optionally normalized and losses, which compare CAPA attribute information are evaluated and minimized via backpropagation. Figure 3: Architectural schematic of our embedding network. 3.3. Enriching Metric Embeddings with CAPA Labels The CAPA system detects various capabilities of a binary file based using both the static analysis and disassembly of the file, and yields a set of capabilities for each file. These capabilities are categorical and non-mutually exclusive and are labeled with short text snippets, e.g., “read file". We incorporate these generated sets of capabilities to enrich our embeddings via two different loss functions, which we apply both solo and in tandem (via summation) in our experiments. Figure 4: A Sample CAPA report for a PE file Lab01-01.dll_ from [40]. (a) Contrastive (b) Spearman Figure 5: A visual representation of the difference between the Contrastive and Spearman losses. The Contrastive loss considers all entities either in-cluster (blue) or out-of-cluster (orange). The Spearman loss captures the relative similarity of objects, shown in shades of blue, regardless of cluster status. 3.3.1. Contrastive Loss Contrastive loss is defined as: 𝐿contrastive = 𝑌true 𝐷 + (1 − 𝑌true ) max(margin − 𝐷, 0) (1) where 𝐷 is the distance between a pair of points, 𝑌𝑡𝑟𝑢𝑒 is 1 if the pair contains similar objects or 0 if the objects are dissimilar, and margin is the desired separation between dissimilar objects and is a tunable hyperparameter. Equation 1 requires that samples either belong to the same group or not, meaning that we cannot include more fine-grained similarity (e.g., these two samples are 75% similar), in the loss function. Consequently, in this case we convert the CAPA detection sets into hard clusters with a locality sensitive hash. Employing a MinHash with one band and 64 permutations, we compute a single hash (cluster) for each binary file, where two files are similar if they lie in the same cluster (𝑌𝑡𝑟𝑢𝑒 = 1) and different otherwise (𝑌𝑡𝑟𝑢𝑒 = 0). During our experiments we employ a contrastive loss with a margin of 10 based on the Euclidean distance of the embeddings for each pair of samples in the batch. Note that we have tried a few tuning experiments for the margin hyperparameter but did not see significant improvements over this default. 3.3.2. Spearman Loss While our contrastive approach assesses similarity based on CAPA clusters, it coarsely embeds binaries as “similar" or “not similar", when in reality some sets of CAPA labels are more similar than others. To account for finer-grained similarities, we employ a novel approach based on the Spearman Rank Correlation Coefficient. Specifically, advances in approximate differentiable sorting and ranking [41] allow us to optimize Spearman’s rank correlation coefficient with stochastic gradient descent. This allows us to compute the loss between a ground truth ranking and a predicted ranking from our model. This is desirable as it allows inserting more nuanced information into the loss function based on finer grained degree, rather than a simple binary similar/dissimilar decision. In our experiments the ranking is based on similarity, from most similar to least. Given integer ranks∑︀we can define Spearman’s rank correlation coefficient as 6 𝑖 (𝑅(𝑋𝑖 ) − 𝑅(𝑌𝑖 ))2 𝑟 =1− (2) 𝑛(𝑛2 − 1) where 𝑋𝑖 and 𝑌𝑖 are the ground truth similarity and predicted similarity of data point 𝑖 and 𝑅(𝑋𝑖 ) and 𝑅(𝑌𝑖 ) are the corresponding ranks. We assess ground truth similarity between two CAPA capability sets as their Jaccard similarity, and use the soft rank implementation from [41] to compare predicted and ground truth ranks. For a given batch, we use these ranks to establish the Spearman rank correlation coefficient – the loss for that batch. 3.4. Training Process All the layers of our networks are initialized by the Xavier algorithm [42] and trained with stochastic gradient descent (SGD) to a maximum of 30 epochs with a learning rate of 0.001. The batching algorithm used for SGD training was modified to better support our metric learning loss functions. Ordinarily, each batch contains 𝐶 randomly sampled (with replacement) clusters and 𝑀 randomly sampled PE files from each cluster (again, sampled with replacement). For our binary similarity problem, we are confronted with two complications. First, each cluster can potentially contain both goodware and malware unlike typical metric learning problems where clusters are homogeneous. Second, we have an extremely large number of clusters (𝒪(105 )). To address the goodware and malware heterogeneity concern, we split each cluster 𝐶 into two clusters, 𝐶-goodware and 𝐶-malware. When we sample 𝐶 clusters, we do so from the combination of all 𝐶-goodware and 𝐶-malware clusters. To address the second concern, we sample 𝐶 clusters without replacement and define the end of the epoch when the model has processed examples from every cluster. This algorithm ensures we cover the full space of goodware, malware, and clusters in each epoch while maintaining a balance between positive and negative pairs in each batch. For these experiments, we set 𝐶 = 20 and 𝑀 = 4. 3.5. Transfer Process After training the embedding, we measure the embedding’s usability on various malware classification tasks. For our experiments, we train an embedding network using the EMBER 2018 training partition and extracted CAPA labels. Once we have a trained embedding network, we can use this to extract embeddings from any dataset with EMBER features. Using extracted embeddings for a given dataset, we can then fit a lightweight classifier over the embeddings and corresponding labels to make predictions for arbitrary different tasks. The choice of the best final classifier for each task is not obvious. Typically, generalization-based learning using ensemble methods (e.g., random forests or gradient boosted trees) provide state of the art performance on malware tasks. However, our feature space is unique in that distances between two training points have meaning and decision tree methods that rely on splitting individual features may have difficulty capturing that geometry. Notably, SVMs are a generalization-based method that could take our metric space into account, but we ignore it here due to the computational cost of training an SVM on very large datasets. An alternative would be an instance-based learning algorithm (e.g., 𝑘-nearest neighbors), which explicitly considers distances between training data points. As we will show in the following evaluation, we consider both instance and generalization-based classifiers, and the best classifier can vary based on the transfer task. 4. Experiments 4.1. Embedding Networks We trained various embedding networks using EMBER feature vectors and CAPA labeling extracted from PEs in the EMBER 2018 train partition. These consist of: • Contrastive loss on CAPA clusters. • Spearman loss on Jaccard similarities between CAPA attribute sets. • Mixed objective Spearman and Contrastive loss, where the net loss term is the sum of the losses. In practice, we employ a weighting of 10X on the Spearman loss term to bring the contributions from each constituent loss term to roughly the same order of magnitude. We trained each embedding network according to the procedure discussed in Section 3.4. Since deep learning models are not amenable to convex optimization (i.e., no global minimum guarantee), we trained five different instantiations of each model in order to assess variance in performance. When performing transfer task experiments, we then aggregated mean and standard deviation statistics across embeddings from all five networks of a given type. 4.2. Transfer Experiments on EMBER As an initial evaluation of our embeddings, we performed two transfer tasks on the EMBER dataset: malware detection and malware family classification. (a) Goodware/Malware (b) Malware Family Figure 6: Transfer Experiments on EMBER. In (a), results of the Goodware/Malware transfer experiment are reported in terms of the area under the ROC curve (AU-ROC). In (b), results of the malware family transfer experiment are reported in terms of accuracy. For both experiments, classifiers trained on Spearman embeddings underperformed those trained on Contrastive embeddings while classifiers trained on embeddings derived from our weighted mixed-objective loss were the top performers. The malware detection transfer task aims to detect malware by fitting a transfer classifier on embed- dings of the EMBER dataset along with malicious/benign labels. For this task, we extracted embeddings across both train and test partitions of EMBER 2018. We then fit a lightGBM ensemble with 1000 trees and otherwise default parameters over the embeddings extracted from the training set, and evaluated using embeddings extracted from the test set. The results of this experiment are shown in Figure 6a in terms of the area under the ROC curve (AU-ROC) on the test set. In this experimental setting, we tried different weightings of the mixed objective loss, with the Spearman component both unweighted and up-weighted by a factor of 10 to be on the same scale as the contrastive component. We notice that the transfer performance on the contrastive loss embedding significantly outperforms the transfer performance on the Spearman loss embedding. However, both embeddings which use a combination of the two losses offer better classification performance than strictly either of the embeddings trained on a solo loss (Spearman or Contrastive), with the weighted mixed objective loss outperforming the unweighted. Note that none of the transfer malicious/benign classifiers on EMBER 2018 perform comparably to the baseline model from [39]. The second transfer task is a malware family recognition task, which utilizes a nearest neighbor classifier in the embedding space in conjunction with the EMBER 2018 malware family labels (derived via AVClass [43]) to attribute malware family, for only the malicious samples. These results are shown in Figure 6b. While here the performance evaluation is in terms of accuracy not AU-ROC, we notice the same performance trend across embedding types as for the detection transfer task. 4.3. Transfer Experiments on SOREL-20M We additionally evaluated the performance of our embeddings for different tasks on the SOREL-20M dataset. SOREL is a large industry-scale dataset with publicly available labeling telemetry beyond just malicious/benign detection; it also contains public labeling telemetry for 11 distinct malware attributes, namely: Adware, Crypto Miner, Downloader, Dropper, File Infector, Flooder, Installer, Packed, Ransomware, Spyware, and Worm. Note that these attributes are non-mutually exclusive across samples, meaning that a given malware sample can have multiple malware attributes. Figure 7: Results of the Detection Transfer Experiment on SOREL-20M. Classifiers trained on the Contrastive and Mixed embeddings have the highest AU-ROCs of 0.973 ± 0.002 and 0.976 ± 0.001, respectively. Classifiers trained on the Spearman embedding have an AU-ROC of 0.948 ± 0.002. SOREL also contains different data with a different data distribution than EMBER (on which the em- beddings were extracted). This suggests any strong performance over the EMBER-to-SOREL transition is a indication of the robustness of our approach to producing general purpose features for downstream tasks. We extracted 32-dimensional embeddings for all of the SOREL samples apriori, and trained second- stage task-specific lightGBM classifiers on the extracted embeddings. We assessed embedding perfor- mance/quality for two distinct tasks: malware detection and malware attribute labeling. Results from the malware detection task are shown in Figure 7. Embedding extraction and lightGBM fit was performed five different times to obtain error bars. Consistent with our transfer experiments on EMBER, the mixed objective weighted Spearman+Contrastive embedding yields the highest AU-ROC, slightly outperforming the Contrastive embedding and significantly outperforming the Spearman em- bedding. For reference, the lightGBM baseline trained on full 2381-dimensional features has an AU-ROC of 0.981 ± 0.002 [44], a relative average improvement over the top-performing Spearman+Contrastive model of 0.05%. However, storing the full 2381-dimensional feature vectors requires 74.4 times the amount of storage of as that of the 32-dimensional embeddings, indicating that in practice, the top- performing embeddings significantly reduce the storage burden at a slight reduction in net performance. Results from the malware attribute labeling task are shown in Table 1. For this task, we fit lightGBM classifiers across each of the malware tags, using 1-hit for each tag as a criterion for presence of the attribute, consistent with [44]. Notably, we see a similar pattern with the Mixed-10 loss on average outperforming the Contrastive loss and both losses on average, outperforming the Spearman loss. Note also, that the Mixed-10 loss under-performs the Contrastive loss where Spearman performs poorly. On average, the results for tagging under-perform the baseline provided with SOREL-20M benchmark, though this is a somewhat invalid comparison, as the attribute baselines from [44] utilized a large multi-target network, factoring in number of vendor hits, malicious/benign classification, and simultaneous attribute predictions; thus some of the performance discrepancy is likely due to limitations of single-target classifiers. Contrastive Spearman Mixed-10 Adware 0.917 ± 0.005 0.883 ± 0.005 0.917 ± 0.002 Crypto Miner 0.976 ± 0.004 0.962 ± 0.001 0.976 ± 0.003 Downloader 0.832 ± 0.007 0.798 ± 0.005 0.835 ± 0.004 Dropper 0.819 ± 0.009 0.773 ± 0.005 0.824 ± 0.011 File Infector 0.878 ± 0.003 0.834 ± 0.005 0.885 ± 0.007 Flooder 0.982 ± 0.006 0.981 ± 0.003 0.979 ± 0.003 Installer 0.957 ± 0.003 0.929 ± 0.002 0.962 ± 0.002 Packed 0.783 ± 0.003 0.742 ± 0.004 0.779 ± 0.013 Ransomware 0.977 ± 0.003 0.959 ± 0.002 0.978 ± 0.003 Spyware 0.848 ± 0.010 0.776 ± 0.003 0.846 ± 0.014 Worm 0.877 ± 0.014 0.804 ± 0.014 0.877 ± 0.014 Table 1 Results from the malware attribute transfer experiment on SOREL-20M. Mean AU-ROC and AU-ROC standard deviation are reported, with results aggregated over five runs. Best results are shown in bold. 5. Discussion We have introduced two different approaches to enrich metric embeddings with static disassembly capabilities information and performed evaluations thereof on multiple downstream tasks. These approaches consist of a fine-grained Spearman embedding approach and a coarse-grained Contrastive embedding approach. In the vast majority of our experiments, the coarse-grained Contrastive approach exhibited superior performance to the finer-grained Spearman approach. In some respects, this is not surprising, as Contrastive loss inherently forces separability in a way that the Spearman loss does not. An in-depth examination of similarity distributions and adoption of additional similarity measures other than Jaccard similarity could be helpful in improving the Spearman embedding. Consistent with other literature in the ML security/applied ML space, we found that combining both Spearman and Contrastive embedding losses generally improved performance as did balancing loss contributions to a similar order of magnitude [45, 46, 47]. While our embeddings performed comparably to classifiers trained on raw features for certain tasks, they did not work so well for others. Among a variety of factors, this may be due to including semantic information inherent to the CAPA embeddings, over/under-fitting, and hyperparameter selection, all of which are logical areas of follow-up research. Generally, we surmise that improving performance of metric embeddings for additional tasks is trivially feasible by utilizing additional training objectives. These may include malicious/benign labels, malware attribute tags, MITRE ATT&CK tactics (notably, CAPA outputs these as well as attributes) and additional metadata. Moreover, examining how embedding performance scales when training on larger more heterogeneous groups of samples and evaluating on substantially concept-drifted data could offer further insight into embedding performance and design (e.g., [48]). Notably, a low-dimensional embedding which performs well for a variety of classification and/or information retrieval tasks could yield significant computational and storage savings over utilizing raw features or binaries. Such embeddings could be utilized in both academic contexts, where compute resources are often limited or in commercial contexts for rapid prototyping. As a reference, the SOREL- 20M dataset is an order of magnitude smaller than industry datasets which are typically used to train commercial PE malware detectors, yet it still comes with a warning about potentially incurring bandwidth fees or exhausting disk space. Even in featurized format as 32-bit floating point, the SOREL- 20M dataset requires 172 GB of storage. Using the embeddings introduced in this paper, this can be compressed to roughly 2.3 GB, which is small enough to fit in memory, even for most laptops. References [1] VirusTotal, Virustotal - stats, 2022. URL: https://www.virustotal.com/gui/stats, accessed: 2022-08- 04. [2] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, Machine learning: The high interest credit card of technical debt, SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop) (2014). [3] S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, G. Wang, Measuring and modeling the label dynamics of online Anti-Malware engines, in: 29th USENIX Security Symposium (USENIX Security 20), USENIX Association, 2020, pp. 2361–2378. URL: https://www.usenix.org/conference/ usenixsecurity20/presentation/zhu. [4] M. Kaya, H. Ş. Bilge, Deep metric learning: A survey, Symmetry 11 (2019) 1066. [5] G. Koch, R. Zemel, R. Salakhutdinov, et al., Siamese neural networks for one-shot image recognition, in: ICML deep learning workshop, volume 2, Lille, 2015, p. 0. [6] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: International workshop on similarity-based pattern recognition, Springer, 2015, pp. 84–92. [7] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large Scale Online Learning of Image Similarity Through Ranking, J. Mach. Learn. Res. 11 (2010) 1109–1135. URL: http://dl.acm.org/citation.cfm? id=1756006.1756042. [8] Y. Zhao, Z. Jin, G.-J. Qi, H. Lu, X.-S. Hua, An Adversarial Approach to Hard Triplet Generation, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, volume 11213 of Lecture Notes in Computer Science, Springer, 2018, pp. 508–524. URL: https://doi.org/10. 1007/978-3-030-01240-3{_}31. doi:10.1007/978-3-030-01240-3_31. [9] E. Hoffer, N. Ailon, Deep Metric Learning Using Triplet Network, in: A. Feragen, M. Pelillo, M. Loog (Eds.), SIMBAD 2015: Similarity-Based Pattern Recognition, Springer International Publishing, Cham, 2015, pp. 84–92. URL: https://doi.org/10.1007/978-3-319-24261-3{_}7. doi:10. 1007/978-3-319-24261-3_7. [10] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A unified embedding for face recognition and clustering, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 815–823. URL: http://ieeexplore.ieee.org/document/7298682/. doi:10.1109/CVPR.2015. 7298682. [11] Y. Zhai, X. Guo, Y. Lu, H. Li, In Defense of the Triplet Loss for Person Re-Identification, ArXiv e-prints (2018). URL: http://arxiv.org/abs/1809.05864. arXiv:1809.05864. [12] S. Kim, M. Seo, I. Laptev, M. Cho, S. Kwak, Deep Metric Learning Beyond Binary Supervision, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2288–2297. URL: http://arxiv.org/abs/1904.09626. arXiv:1904.09626. [13] W. Zheng, Z. Chen, J. Lu, J. Zhou, Hardness-Aware Deep Metric Learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 72–81. [14] F. Xu, W. Zhang, Y. Cheng, W. Chu, Metric Learning with Equidistant and Equidistributed Triplet-Based Loss for Product Image Search, in: Proceedings of The Web Conference 2020, WWW ’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 57–65. URL: https://doi.org/10.1145/3366423.3380094. doi:10.1145/3366423.3380094. [15] K. Musgrave, S. Belongie, S.-N. Lim, A Metric Learning Reality Check, in: ECCV, 2020. URL: http://arxiv.org/abs/2003.08505. arXiv:2003.08505. [16] E. C. R. Shin, D. Song, R. Moazzezi, Recognizing functions in binaries with neural networks, in: 24th USENIX security symposium (USENIX Security 15), 2015, pp. 611–626. [17] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, D. Song, Neural network-based graph embedding for cross- platform binary code similarity detection, in: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 363–376. [18] J. Oliver, C. Cheng, Y. Chen, Tlsh–a locality sensitive hash, in: 2013 Fourth Cybercrime and Trustworthy Computing Workshop, IEEE, 2013, pp. 7–13. [19] E. Raff, C. Nicholas, An alternative to ncd for large sequences, lempel-ziv jaccard distance, in: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1007–1015. [20] J. Jang, D. Brumley, S. Venkataraman, Bitshred: feature hashing malware for scalable triage and semantic analysis, in: Proceedings of the 18th ACM conference on Computer and communications security, 2011, pp. 309–320. [21] A. Lee, T. Atkison, A comparison of fuzzy hashes: evaluation, guidelines, and future suggestions, in: Proceedings of the SouthEast Conference, 2017, pp. 18–25. [22] R. J. Joyce, D. Amlani, C. Nicholas, E. Raff, MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels, in: The AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS), 2022. URL: https://github.com/boozallen/MOTIF. doi:10.48550/arXiv.2111.15031. arXiv:arXiv:2111.15031v1. [23] R. J. Joyce, E. Raff, C. Nicholas, A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels, in: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Secu- rity (AISec ’21), Association for Computing Machinery, 2021. doi:10.1145/3474369.3486867. arXiv:arXiv:2109.11126v1. [24] E. Raff, C. Nicholas, M. McLean, A New Burrows Wheeler Transform Markov Distance, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 5444–5453. URL: http: //arxiv.org/abs/1912.13046. doi:10.1609/aaai.v34i04.5994. arXiv:1912.13046. [25] E. Raff, C. Nicholas, Malware Classification and Class Imbalance via Stochastic Hashed LZJD, in: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, ACM, New York, NY, USA, 2017, pp. 111–120. URL: http://doi.acm.org/10.1145/3128572.3140446. doi:10.1145/3128572.3140446. [26] C. Winter, M. Schneider, Y. Yannikos, F2S2: Fast forensic similarity search through indexing piecewise hash signatures, Digital Investigation 10 (2013) 361–371. URL: http://linkinghub.elsevier. com/retrieve/pii/S1742287613000789. doi:10.1016/j.diin.2013.08.003. [27] F. Breitinger, K. P. Astebol, H. Baier, C. Busch, mvHash-B - A New Approach for Similarity Preserving Hashing, in: Proceedings of the 2013 Seventh International Conference on IT Security Incident Management and IT Forensics, IMF ’13, IEEE Computer Society, Washington, DC, USA, 2013, pp. 33–44. URL: http://dx.doi.org/10.1109/IMF.2013.18. doi:10.1109/IMF.2013.18. [28] F. Breitinger, H. Baier, D. White, On the database lookup problem of approximate matching, Digital Investigation 11 (2014) S1–S9. URL: http://linkinghub.elsevier.com/retrieve/pii/S1742287614000061. doi:10.1016/j.diin.2014.03.001. [29] F. Breitinger, C. Rathgeb, H. Baier, An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach, The Journal of Digital Forensics, Security and Law (JDFSL) 9 (2014) 155–166. URL: http://ojs.jdfsl.org/index.php/jdfsl/article/view/276. [30] F. Breitinger, H. Baier, Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2, in: Digital Forensics and Cyber Crime, Springer, 2013, pp. 167–182. URL: http://link. springer.com/10.1007/978-3-642-39891-9{_}11. doi:10.1007/978-3-642-39891-9_11. [31] D. Lillis, F. Breitinger, M. Scanlon, Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees, in: 9th EAI International Conference on Digital Forensics and Cyber Crime (ICDF2C 2017), Springer, Prague, Czechia, 2017. [32] J. Oliver, C. Cheng, Y. Chen, TLSH – A Locality Sensitive Hash, in: 2013 Fourth Cybercrime and Trustworthy Computing Workshop, IEEE, 2013, pp. 7–13. URL: http://ieeexplore.ieee.org/ document/6754635/. doi:10.1109/CTC.2013.9. [33] S. H. H. Ding, B. C. M. Fung, P. Charland, Kam1N0: MapReduce-based Assembly Clone Search for Reverse Engineering, in: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 461–470. URL: http://doi.acm.org/10.1145/2939672.2939719. doi:10.1145/2939672.2939719. [34] S. H. H. Ding, B. C. M. Fung, P. Charland, Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization, in: 2019 IEEE Symposium on Security and Privacy (SP), 2019. doi:10.1109/SP.2019.00003. [35] L. Massarelli, G. A. Di Luna, F. Petroni, L. Querzoni, R. Baldoni, SAFE: Self-Attentive Function Embeddings for Binary Similarity, in: Detection of Intrusions and Malware, and Vulnerability Assessment, 2019, pp. 309–329. URL: http://arxiv.org/abs/1811.05296. arXiv:1811.05296. [36] X. Li, Y. Qu, H. Yin, PalmTree : Learning an Assembly Language Model for Instruction Embedding, in: CCS, 2021. arXiv:arXiv:2103.03809v3. [37] S. Yang, L. Cheng, Y. Zeng, Z. Lang, H. Zhu, Z. Shi, Asteria: Deep Learning-based AST-Encoding for Cross-platform Binary Code Similarity Detection, in: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE, 2021, pp. 224–236. URL: https: //ieeexplore.ieee.org/document/9505086/. doi:10.1109/DSN48987.2021.00036. [38] M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, H. B. K. Tan, BinGo: Cross-architecture cross-OS Binary Search, in: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, ACM, New York, NY, USA, 2016, pp. 678–689. URL: http://doi.acm.org/10.1145/2950290.2950350. doi:10.1145/2950290.2950350. [39] H. S. Anderson, P. Roth, Ember: an open dataset for training static pe malware machine learning models, arXiv preprint arXiv:1804.04637 (2018). [40] W. Ballenthin, M. Raabe, capa: Automatically identify malware capabilities (2020). URL: https:// www.mandiant.com/resources/capa-automatically-identify-malware-capabilities, accessed: 2022- 08-05. [41] M. Blondel, O. Teboul, Q. Berthet, J. Djolonga, Fast differentiable sorting and ranking, in: International Conference on Machine Learning, PMLR, 2020, pp. 950–959. [42] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256. [43] M. Sebastián, R. Rivera, P. Kotzias, J. Caballero, Avclass: A tool for massive malware labeling, in: International symposium on research in attacks, intrusions, and defenses, Springer, 2016, pp. 230–253. [44] R. Harang, E. M. Rudd, Sorel-20m: A large scale benchmark dataset for malicious pe detection, arXiv preprint arXiv:2012.07634 (2020). [45] E. M. Rudd, F. N. Ducau, C. Wild, K. Berlin, R. Harang, {ALOHA}: Auxiliary loss optimization for hypothesis augmentation, in: 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 303–320. [46] E. M. Rudd, M. S. Rahman, P. Tully, Transformers for end-to-end infosec tasks: A feasibility study, in: Proceedings of the 1st Workshop on Robust Malware Analysis, 2022, pp. 21–31. [47] E. M. Rudd, M. Günther, T. E. Boult, Moon: A mixed objective optimization network for the recognition of facial attributes, in: European Conference on Computer Vision, Springer, 2016, pp. 19–35. [48] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, G. Wang, Bodmas: An open dataset for learning based temporal analysis of pe malware, in: 4th Deep Learning and Security Workshop, 2021.