Minimizing Compute Costs: When Should We Run More Expensive Malware Analysis? Andre T. Nguyen1,2,3,4 , Richard Zak2,3,4 , Luke E. Richards2,3,4 , Maya Fuchs2,3 , Fred Lu2,3,4 , Robert Brandon2,3,5 , Gary Lopez Munoz2,3,5 , Edward Raff2,3,4 , Charles Nicholas4 and James Holt2 1 JURA Bio, Inc. (work done while at the Laboratory for Physical Sciences and Booz Allen Hamilton) 2 Laboratory for Physical Sciences 3 Booz Allen Hamilton 4 University of Maryland, Baltimore County 5 Microsoft (work done while at the Laboratory for Physical Sciences and Booz Allen Hamilton) Abstract As organizations in government and industry increasingly rely on digitized data and networked computer sys- tems, they face a growing risk of exposure to cyber attacks. Automated methods such as machine learning based malware detection algorithms have helped analysts to sift through large amounts of data. However, it is still too expensive to always run the best algorithms when massive amounts of new data are generated every day. In this work, we demonstrate the benefits of leveraging uncertainty estimation when multiple algorithms with different strengths and costs are used as a part of a larger machine learning malware detection system. In particular, we introduce a novel method in which cheaper machine learning algorithms can choose to defer to costlier models when their own predictions are uncertain and the more expensive model is expected to do well. We first use this method to detect specific capabilities in executable files, then extend it to general malware detection. In both cases, we are able to maintain high accuracy while minimizing the use of the more costly algorithms. With capability detection, we achieve an average 99.9% of correctly labeled capabilities for half the computational cost of using the expensive model throughout. For general malware detection, using this method to strategically balance the use of static and dynamic analysis saves a year’s worth of compute time. Keywords machine learning, malware detection, cyber security, uncertainty 1. Introduction As evidenced by current events, malware is a serious threat to companies, governments, schools, and even health care systems. It has cost billions in damages [1, 2] and even taken lives [3]. As computer networks grow in size, so do the challenges cybersecurity professionals face in securing them. With more connected devices, more users, and more complex systems, the volume of malware attack opportunities increases exponentially. It’s clear, then, that automated malware detection is absolutely crucial, but the sheer volume of incoming data can make it very computationally challenging. Many anti-virus (AV) vendors see over 2 million new malicious files each month [4], and benign files on a network tend to outnumber malicious files at a ratio of 80:1 [5]. The collection and release of malware datasets has allowed for a vast array of machine learning (ML) based malware detection (MLMD) algorithms to be developed with the goal of classifying new files as benign (not malware) or malicious (malware) [6]. Traditionally, anti-virus systems used static and signature-driven systems (i.e., systems that looked for specific software known to be malicious) to detect malware [7]. More recently, dynamic software analysis has become increasingly popular. This involves running evaluated software in a secure environment to directly observe whether or not it behaves maliciously. Egele et al. [8] provide a survey on challenges, features, techniques, and tools for dynamic malware analysis. CAMLIS’22: Conference on Applied Machine Learning in Information Security (CAMLIS), October 20–21, 2022, Arlington, VA " an@jurabio.com (A. T. Nguyen); edraff@lps.umd.edu (E. Raff); holt@lps.umd.edu (J. Holt) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Using machine learning to automate static and dynamic analysis enables faster investigation of more files and allows human analysts to focus on “hard” samples. While powerful, anti-virus and dynamic analysis tools have limitations. In particular, these methods can be time consuming when the data volume and velocity are high [9], with certain methods such as dynamic analysis having particularly high computational cost. Unsurprisingly, these algorithms also vary widely in their strengths, weaknesses, and costs, both in terms of model execution and especially feature extraction. Recent works have demonstrated the benefits of incorporating uncertainty quantification in MLMD systems. Leveraging uncertainty allows for better identification of model errors, uncovering of new malware families, predictive performance under extreme false positive constraints, and out of distribu- tion data detection [10, 11, 12]. These previous works have focused exclusively on static analysis based malware detection using a single ML algorithm. In the following sections, we introduce a novel method in which cheaper ML algorithms can choose to defer to costlier models under two conditions: their own predictions are uncertain, and the more expensive model is likely to do well. We note that the main contribution of this paper is this deferral methodology; the determination of computational expense can vary from use case to use case. Despite this variation, we find that by using uncertainty quantification combined with auxiliary prediction targets, we can consistently minimize compute costs while maximizing overall system accuracy. Specifically, we estimate uncertainty using Bayesian and ensembling approaches, and we train the cheaper algorithms to predict the success of the more expensive algorithms. The result is an easy to implement approach that is applicable to a wide variety of AV systems. We first use this approach to speed up capability detection in executable files, using CAPA as the expensive model and MalConv as the cheaper one. We then show that this approach can improve general malware detection, where the cheaper model uses static analysis and the expensive model uses dynamic analysis. By using the expensive models only as-needed, we are able to improve accuracy while minimizing the use of more costly algorithms. 2. Related Work The majority of the existing research on malware detection with machine learning has focused on the automation of static malware analysis. Raff et al. [13] introduce MalConv, a convolutional neural network for malware detection that operates on the raw byte sequences of files. Raff et al. [14] develop a new approach to temporal max pooling that improves MalConv’s memory and computational costs. Understanding when a machine learning model is uncertain about its prediction is critical in high-risk applications such as cyber defense. However, the modeling of uncertainty is generally missing from the machine learning for malware literature. Three recent works have demonstrated the benefits of incorporating uncertainty quantification in MLMD systems. In particular, leveraging uncertainty allows for the identification of model errors, the uncovering of new malware families, improved predictive performance under extreme false positive constraints, and better out-of-distribution data detection [10, 11, 12]. We note that these previous works have focused exclusively on static analysis based malware detection using a single ML algorithm. In the current work, we explore the benefits of uncertainty estimation in settings where multiple, diverse algorithms with different strengths and costs are used as part of a larger MLMD system. 2.1. Uncertainty Estimation The Bayesian framework allows for the principled modeling of uncertainty in machine learning and decision making [15]. Within this framework, probabilities represent degrees of belief, as opposed to the frequentist interpretation of probabilities as long-run frequencies [16]. Bayesian inference uses Bayes’ Theorem to update beliefs (represented in the form of probability distributions) when new data is observed. In the context of machine learning, a Bayesian update takes the following form, where 𝜃 represents model parameters, 𝐷 represents the data, and 𝑀 represents the model class: 𝑃 (𝐷|𝜃, 𝑀 )𝑃 (𝜃|𝑀 ) 𝑃 (𝜃|𝐷, 𝑀 ) = 𝑃 (𝐷|𝑀 ) 𝑃 (𝜃|𝐷, 𝑀 ) is the posterior belief about the model parameters given the data, 𝑃 (𝐷|𝜃, 𝑀 ) is the likeli- hood of the data given the model parameters, 𝑃 (𝜃|𝑀 ) is the prior belief about model parameters, and 𝑃 (𝐷|𝑀 ) is the marginal likelihood or evidence. These are related by Bayes’ rule. For prediction, the posterior predictive ∫︀ can be computed as follows, where 𝐷 is the training data and 𝐷* is the test data: 𝑃 (𝐷* |𝐷) = 𝑃 (𝐷* |𝐷, 𝜃)𝑃 (𝜃|𝐷)𝑑𝜃. When data points are conditionally independent given model parameters, then 𝑃 (𝐷* |𝐷, 𝜃) = 𝑃 (𝐷* |𝜃), and we can write: ∫︁ 𝑃 (𝐷 |𝐷) = 𝑃 (𝐷* |𝜃)𝑃 (𝜃|𝐷)𝑑𝜃 * The posterior predictive is an example of Bayesian model averaging [17], a posterior weighted average of 𝑃 (𝐷* |𝜃). As exact Bayesian inference cannot be done for Bayesian deep learning models, approximate inference methods need to be used. A straightforward approach is to use a Laplace approximation to model the posterior over neural network weights as a Gaussian [18, 19]. Ritter et al. [20] construct a Kronecker factored Laplace approximation to the posterior. Kristiadi et al. [21] add uncertainty units to a pre-trained network. The units are trained post-hoc using an uncertainty-aware loss. This addition improves uncertainty under a Laplace approximation. Proper Markov Chain Monte Carlo (MCMC) methods will always perfectly sample from the posterior given enough time. The MCMC method of choice is Hamiltonian Monte Carlo (HMC) as described in Neal [22] under the name “hybrid Monte Carlo.” Betancourt [23] provides an in-depth introduction to HMC. While HMC is notoriously difficult to tune, methods such as the No-U-Turn Sampler of Hoffman and Gelman [24] attempt to automate the tuning. While considered the gold standard, HMC isn’t scalable because the method requires gradient computations that use the entire dataset. As MCMC is hard to scale and tune in practice, variational inference is often used instead [25, 26]. This converts the integration problem into an optimization problem, where the posterior is approximated using a simpler variational distribution. In particular, the exact posterior 𝑝(·|𝐷) is approximated by a variational approximation 𝑞(·) by minimizing the Kullback-Leibler divergence: 𝑞 * = arg min KL(𝑞(·)||𝑝(·|𝐷)) 𝑞∈𝑄 ∫︁ 𝑞(𝜃) 𝑞 * = arg min 𝑞(𝜃) log 𝑑𝜃 𝑞∈𝑄 𝑝(𝜃|𝐷) ∫︁ * 𝑞(𝜃)𝑝(𝐷) 𝑞 = arg min 𝑞(𝜃) log 𝑑𝜃 𝑞∈𝑄 𝑝(𝜃, 𝐷) ∫︁ * 𝑝(𝜃, 𝐷) 𝑞 = arg min log 𝑝(𝐷) − 𝑞(𝜃) log 𝑑𝜃 𝑞∈𝑄 𝑞(𝜃) The log 𝑝(𝐷) term does not depend on 𝑞, so it can be dropped from the optimization, and the subtracted term that remains is called the evidence lower bound (ELBO).1 So the optimization becomes 𝑞 * = arg max ELBO(𝑞) 𝑞∈𝑄 1 ELBO is a lower bound since the KL divergence is always positive, and as a result log 𝑝(𝐷) ≥ ELBO. The scaling of Bayesian methods to sizeable datasets has largely been made possible by stochastic variational inference [27]. Black box automatic stochastic variational inference methods [28, 29, 30, 31] that only require the specification of the model log-likelihood have improved the usability and adoption of Bayesian methods, and have sped up model iteration. Variational inference for neural networks was first introduced in the early 1990’s [32]. Graves [33] revived interest in the topic by introducing a stochastic variational method for inference in neural networks, which was later improved by Blundell et al. [34]. Gal and Ghahramani [35], Gal [36] introduce an easy to implement approach to variational inference in Bayesian neural networks. In particular, they show that a neural network with dropout [37, 38] is equivalent to an approximation of a deep Gaussian process [39], and training with dropout effectively performs variational inference for the deep Gaussian process model. For reference, dropout is a technique commonly used to reduce overfitting in neural networks by randomly dropping units during training, applied before every weight layer. Leaving dropout on at test time allows for sampling from the posterior distribution. This is the approach used for the Bayesian MalConv model [12] that we will use as the cheaper, static model in our experiments. Two different kinds of uncertainty can be measured [36]. The first, aleatoric uncertainty, is a result of inherent stochasticity in the input data. More training data will not reduce the aleatoric uncertainty associated with a prediction, as this uncertainty is caused by noise in the prediction time input data. The second, epistemic uncertainty, is caused by a lack of training data resembling the prediction time input data. In regions of the input space that lack training data, model parameters are less constrained and different model parameter settings that produce diverse and potentially conflicting predictions can be comparably likely under the posterior. For classification tasks where aleatoric and epistemic uncertainty don’t need to be differentiated, uncertainty can be measured using the predictive distribution entropy: ∑︁ 𝐻[P(𝑦|𝑥, 𝐷)] = − P(𝑦|𝑥, 𝐷) log P(𝑦|𝑥, 𝐷) 𝑦∈𝐶 Aleatoric uncertainty can be measured using expected entropy: 𝑢alea = EP(𝜃|𝐷) 𝐻[P(𝑦|𝑥, 𝜃)] Mutual information can be used to measure epistemic uncertainty: 𝑢epis = 𝐼(𝜃, 𝑦|𝐷, 𝑥) = 𝐻[P(𝑦|𝑥, 𝐷)] − EP(𝜃|𝐷) 𝐻[P(𝑦|𝑥, 𝜃)] Monte Carlo estimates that are obtained by sampling from the posterior can be used∑︀ to approximate the terms of these equations for our Bayesian models [40]. In particular, P(𝑦|𝑥, 𝐷) ≈ 𝑇1 𝑇𝑖=1 P(𝑦|𝑥, 𝜃𝑖 ) and EP(𝜃|𝐷) 𝐻[P(𝑦|𝑥, 𝜃)] ≈ 𝑇1 𝑇𝑖=1 𝐻[P(𝑦|𝑥, 𝜃𝑖 )], where the 𝜃𝑖 are samples from the posterior over ∑︀ models, and 𝑇 is the number of samples. For ensemble models which are not explicitly Bayesian (because each ensemble member receives the same weight), uncertainties can be computed in a similar way. In this case, the 𝜃𝑖 are no longer samples from a posterior, but multiple independent trainings of a model with 𝑇 different random seeds. In our experiments, we measure overall uncertainty using predictive entropy. 2.2. Learning to Reject/Defer Rejection Learning is a framework where a machine learning model can reject a data point at prediction time and choose to not make a prediction if it is not confident or is uncertain [41, 42, 43, 44, 45]. Importantly, Rejection Learning does not take into account any downstream models nor decision makers that will process rejected data points. Madras et al. [46] extends and generalizes this idea with Adaptive Rejection Learning, also known as Learning to Defer. Unlike Rejection Learning, Learning to Defer allows models to learn to adapt to the strengths, weaknesses, and biases of downstream decision makers and models. Mozannar and Sontag [47] use the method of Madras et al. [46] as a baseline and also tackle the problem of Learning to Defer. They provide a particularly good toy example of why the rejector and classifier need to be trained jointly. Similarly, Wilder et al. [48] introduce methods to optimize teams consisting of humans and ML algorithms. Our approach differs from that of Madras et al. [46] in two ways. First, their approach can be seen as a mixture of Bernoullis: a binary deferral decision gating variable is used to both express uncertainty and to try to predict which model will have a lower loss. Our approach, on the other hand, explicitly decouples uncertainty from the prediction of a downstream model’s performance. This separation allows for an easier interpretation of the reasons for a deferral decision, and in many cases likely results in an easier learning problem. Second, the Bayesian uncertainty approach described in Appendix F of Madras et al. [46] uses only uncertainty to reject, and, unlike our work, does not provide a way to incorporate the prediction of a downstream model’s performance. 2.3. Active Feature Acquisition An area related to Learning to Defer is active feature acquisition, the sequential decision process of whether to query more features or not under budget constraints. Gao and Koller [49] introduce an active classification process that combines multiple classifiers and features at test time using an instance- specific decision path. Similarly, Saar-tsechansky et al. [50] use Value of Information to rank feature acquisition options. There are many ways to measure the value of information, including information gain, classification loss, and decision robustness [51]. Active feature acquisition is useful not only for minimizing costs, but also for applications such as fairness. For instance, Noriega-Campero et al. [52] discuss active feature acquisition in the context of fair classification. Xu et al. [53, 54] develop algorithms that select features individually for each test data point and reduce trees of classifiers into more efficient cascades of classifiers. Des Jardins et al. [55] construct a cascaded ensemble of classifiers to selectively acquire missing features at both train and test time. Wang et al. [45] develop Prediction Cascades, which use additional augmenting classifiers that evaluate the distributional output of the earlier classifier and estimate its uncertainty. Many cascade methods have an assumed ordering of cascade members, so are unable to adaptively reorder members based on the specific datum at hand. Gao and Koller [49] differs by balancing expected classification gain with computational cost. Observations are selected dynamically based on previous observations to achieve instance-specific decision paths. Cascades are not the only approach, however; Xu et al. [56] don’t use a cascade of classifiers, instead choosing to extend stage-wise regression. They incorporate feature extraction cost into the objective during training, which minimizes test-time computation. Active feature acquisition can also be formulated as a Markov Decision Process [57, 58, 59] which allows for techniques from reinforcement learning to be used. We note that active feature acquisition is likely excessive for our malware detection use case, as we have two classifiers with a clear ordering of static analysis then dynamic analysis. More broadly, this problem is fundamentally related to the task of estimating the value of information [60]; that alone is a highly multidisciplinary problem that is rooted in Information Theory [61]. Behrens et al. [62] show evidence that humans modulate their learning rates based on environment volatility and uncertainty, and in a way that can be predicted by a Bayesian learner. This supports the hypothesis of Bayesian reasoning in humans. Feltham [60] provide a formal framework for measuring the value of information in the context of accounting. In particular, they develop a framework to calculate the expected payoff for an information system from the perspective of the decision maker. Future research could include building off of this work in information theory by developing bespoke measures and cost functions that are applicable to cyber security and malware detection. 3. Data and Models For malware and benign samples, we use the EMBER2018 dataset, which consists of portable executable files (PE files) scanned by VirusTotal in or before 2018 [63]. The dataset contains 600,000 labeled training samples and 200,000 labeled testing samples, with an equal number of malicious and benign samples in both sets. The malicious samples are also labeled by malware family using AVClass [64]. The train/test split in the data is temporal. For each sample, EMBER2018 includes vectorized features which encode general file information, header information, imported functions, exported functions, section information, byte histograms, byte-entropy histograms, and string information [63]. There are also 1.1TB of raw PE files that are not available as part of EMBER2018; these can be downloaded via VirusTotal. Following the methodology of Nguyen et al. [11], we use a dropout Bayesian MalConv model [12] when working with raw binaries, and an LGBM model [65] for the EMBER extracted features. We also develop a model for dynamic analysis logs, and extract capabilities in the executable files. The latter is done using CAPA, with the SMDA recursive disassembler as a backend. 2 For a comparison of computational costs, a 16 sample Bayesian MalConv model can process an average of about 51.2 EMBER2018 files per second on a NVIDIA Tesla P100 GPU. This is about 0.02 seconds per file for feature extraction, prediction, and sampling. The EMBER features take on average about 0.09 seconds to extract per file. CAPA feature extraction takes about 45.75 seconds per model on average. Finally, running dynamic analysis on a file takes an average of 526 seconds, or 8 minutes and 46 seconds. Note that these exclude the downstream prediction model, so the actual cost is slightly higher. In other words, running MalConv on a file is over 26, 300 times faster than running a dynamic analysis model. We note that we do not use larger datasets such as SoReL-20M of Harang and Rudd [66] due to the considerable computational cost. Running 20 million files through dynamic analysis would take over 333 years! 3.1. Dynamic Analysis Features and Model We use a proprietary dynamic execution sandbox provided by a United States based Software Security company. The dynamic analysis tool uses libvirt to work with QEMU3 to manage Virtual Machines (VMs) running Windows 7 32-bit. The dynamic analysis tool puts a sample into the VM, starts a custom monitoring application, runs the sample, saves the monitor’s log file, and pulls the log from the VM to the host. Then, the tool kills the VM image and creates a new one from a template. This way, the VM is clean for every sample. These dynamic analysis logs were featurized with a domain knowledge-driven approach similar to that of Anderson and Roth [63]. The dynamic analysis engine captures broad information about DNS, registry, file, process, and security events from the dynamic execution of the sample. These events contain raw information which we used to extract features for each file. The features include event subtype counts, event subtype information counts, number of unique processes by user, PID counts by user, event key frequencies, count of registry bytes written, registry key path information, timing information, authentication signature information, hashed event message, hashed command line information, and counts of unnamed events. We then train a LGBM model on top of these features. We note that it is possible that some executables did not exhibit all their behavior or functionality during dynamic analysis. This could happen for a number of reasons. It could be because they were not in the proper environment. For instance, if the executable targets a specific Windows build, it may fail to run anywhere else. It might also not have access to all the resources needed for execution. The executable might not run if it can’t connect to the internet, or to its command-and-control server. It’s also possible that the executable has some built-in, anti-reverse engineering techniques. In that case, it might wait a few days before executing, for example. 2 https://github.com/mandiant/capa 3 https://www.qemu.org/ The point is, some malicious files may not act malicious during dynamic analysis. This means that even though dynamic analysis is more expensive than static analysis, it does not always result in better malware detection when used in isolation. We will show that judiciously running dynamic analysis on a subset of files will both improve malware detection accuracy and keep analysis costs low. 4. Predicting Capabilities in Executable Files While most Machine Learning approaches to malware have focused on the detection of maliciousness, there has been less work on the identification of specific behaviors, which is a critical component to malware analysis [67]. We investigated the prediction of CAPA outputs using an 8 sample Bayesian MalConv model, as well as an 8 sample LGBM ensemble trained on the EMBER features. For prediction targets, we used the 350 CAPA rules from the Mandiant standard collection 4 , plus a target for failures, for a total of 351 non-exclusive labels for each file. Figure 1 shows detection accuracy for each CAPA rule for both MalConv and the LGBM model. We note that accuracy is high across rules, suggesting that CAPA may not need to be run for all samples when outcomes can be predicted. The LGBM model is consistently more accurate than MalConv across rules, but requires more expensive feature extraction. Figure 2 shows the area under the receiver operating characteristic curve (AUC) for each CAPA rule, for both MalConv and the LGBM model. While LGBM generally does better than MalConv in terms of AUC, MalConv is better in the worst case. Running CAPA is no small feat. CAPA is more than 2200 times more expensive than MalConv in terms of time, mainly due to the costs associated with disassembly. We can drastically reduce these computational costs by using CAPA exclusively when a prediction model, such as MalConv or LGBM, is highly uncertain. Specifically, CAPA should only deal with files for which the uncertainty is above some threshold. Figure 3 shows the performance of such an approach for various uncertainty thresholds. The results show that we can achieve an average 99.9% of correctly labeled rules while running CAPA on less than half of the data. The thresholds are measured by predictive entropy, and correspond to the proportion of the test data that is sent to CAPA. To simulate real world deployment of such a system, we plot two types of uncertainty thresholds: ideal and actualized. Ideal thresholds are those needed for the 99.9% average across all files; actualized are those chosen using a validation set. The validation and test sets were temporally split into samples first seen in November 2018 and December 2018, respectively. We note that the ideal and actualized uncertainty thresholds are very close, and that compute time spent on CAPA extraction can be cut by more than half. This leads to substantial gains in a deployed production setting, where it is not feasible to run CAPA on all of the data. In especially computationally-constrained settings, it might not be possible to run CAPA at all. In these cases, MalConv alone can achieve a 98% average of correctly labeled CAPA rules. This result suggests that MalConv is capable of learning features that are predictive of capabilities in executables. 5. Deferring to More Expensive Models Dynamic malware analysis is more expensive than static malware analysis. Human malware analysis is more expensive than automated malware analysis. Ideally, we would like to minimize the analysis cost needed to determine if a file is malicious or benign. In other words, we want to run cheap, automated static analysis first, and only run more expensive analysis (such as dynamic analysis) if the static analysis model is uncertain. Additionally, we want to train the static model to be adaptive to the dynamic model’s strengths and weaknesses. Even if uncertain, the static model should guess if the dynamic model will perform worse given the type of data being predicted on. 4 https://github.com/mandiant/capa-rules CAPA Prediction Accuracies By Rule 1.00 0.98 0.96 MalConv Accuracy 0.94 0.92 0.90 0.88 0.86 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 LGBM Accuracy Figure 1: Detection accuracy for each CAPA rule for both MalConv and the LGBM model. The LGBM model is consistently more accurate than MalConv, but also more expensive. Figure 4 shows the performance using Bayesian MalConv as the static model against various predictive entropy thresholds, which represent different total runtimes. Following the naming conventions from Madras et al. [46], “reject” means letting the dynamic model predict when Bayesian MalConv’s uncertainty is above a certain threshold; “defer” means letting the dynamic model predict only when Bayesian MalConv’s uncertainty is above a certain threshold, and when it guesses that the dynamic model will make the correct prediction. To enable the Bayesian MalConv model to defer, we train it with two equally-weighted prediction tasks: the malware classification task, and the prediction of whether or not the dynamic model will be accurate on the sample. Algorithm 1 describes the training procedure for the Defer approach, algorithm 2 describes the deployment time prediction procedure for the Reject approach, and algorithm 3 describes the deployment time prediction procedure for the Defer approach. PyTorch is used to develop all of our code [68], the Adam optimizer with default recommended settings is used for training in all of our experiments [69], and a single NVIDIA Tesla P100 GPU was used. Thresholds are chosen using a validation set. In Figure 4, we plot the actualized uncertainty thresholds that maximize malware detection accuracy. The validation and test sets are split into samples from November 2018 and December 2018 respectively. We also plot the accuracy achieved by an ensemble model, which averages the predictions of the MalConv and dynamic models together. Unsurprisingly, the ensemble model outperforms both MalConv and the dynamic model used in isolation. However, both MalConv and dynamic analysis need to be run on every single file in order to compute the ensemble CAPA Prediction AUCs By Rule 1.00 0.95 0.90 MalConv AUC 0.85 0.80 0.75 0.70 0.65 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 LGBM AUC Figure 2: The AUC for each CAPA rule for both MalConv and the LGBM model. While LGBM generally does better than MalConv, MalConv is better in the worst case. Algorithm 1 Training procedure for the Defer approach. 1: 𝑓.𝑡𝑟𝑎𝑖𝑛(𝑋𝑡𝑟𝑎𝑖𝑛 , 𝑌𝑡𝑟𝑎𝑖𝑛 ) ◁ Train the dynamic model 𝑓 (including feature extraction) on dataset of executables 𝑋𝑡𝑟𝑎𝑖𝑛 , and maliciousness labels 𝑌𝑡𝑟𝑎𝑖𝑛 . 2: 𝑃𝑡𝑟𝑎𝑖𝑛,𝑓 ← 𝑓.𝑝𝑟𝑒𝑑𝑖𝑐𝑡(𝑋𝑡𝑟𝑎𝑖𝑛 ) ◁ Compute dynamic model’s predictions, 𝑃𝑡𝑟𝑎𝑖𝑛,𝑓 . 3: 𝐼𝑡𝑟𝑎𝑖𝑛 ← (𝑃𝑡𝑟𝑎𝑖𝑛,𝑓 == 𝑌𝑡𝑟𝑎𝑖𝑛 ) ◁ Compute a new set of labels, 𝐼𝑡𝑟𝑎𝑖𝑛 , corresponding to the dynamic model’s success on individual files. 4: 𝑔.𝑡𝑟𝑎𝑖𝑛(𝑋𝑡𝑟𝑎𝑖𝑛 , [𝑌𝑡𝑟𝑎𝑖𝑛 , 𝐼𝑡𝑟𝑎𝑖𝑛 ]) ◁ Train the static model 𝑔 (including feature extraction) on dataset of executables, 𝑋𝑡𝑟𝑎𝑖𝑛 , to predict maliciousness labels, 𝑌𝑡𝑟𝑎𝑖𝑛 , and the dynamic model’s success, 𝐼𝑡𝑟𝑎𝑖𝑛 . The loss function used for training is 𝑙𝑜𝑠𝑠 = 𝑐𝑒(𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑌 (𝑋𝑡𝑟𝑎𝑖𝑛 ), 𝑌𝑡𝑟𝑎𝑖𝑛 ) + 𝑐𝑒(𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝐼 (𝑋𝑡𝑟𝑎𝑖𝑛 ), 𝐼𝑡𝑟𝑎𝑖𝑛 ), where 𝑐𝑒 is the Cross Entropy Loss, 𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑌 predicts maliciousness, and 𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝐼 predicts the dynamic model’s success. predictions, so the cost is significantly higher. We highlight that the rejection model with an actualized threshold achieves a test accuracy roughly equal to that of the ensemble model, while requiring dynamic analysis to be run on only 13.2 percent of the test data. This saves a year’s worth of compute time compared to the ensemble model. The defer approach achieves a higher accuracy than that of the ensemble, and only requires dynamic analysis to be run on 17.3 percent of the test data. 360 CAPA rules correctly classified (out of 351) 355 350 MalConv (mean) 345 LGBM (mean) 99.9% LGBM 99.9% Validation Threshold 340 LGBM 99.9% Test Threshold MalConv 99.9% Validation Threshold 335 MalConv 99.9% Test Threshold MalConv (std) LGBM (std) 330 0 20 40 60 80 100 Percent of data where CAPA extraction was run ordered by predictive entropy Figure 3: Performance with uncertainty-based rejection at various uncertainty thresholds. The x-axis maps the percentage of the data that requires expensive analysis. The far right represents CAPA run- ning on all the data, and the far left only uses cheaper predictive models (MalConv or LGBM). Algorithm 2 Making predictions in deployment using the Reject approach. In this setting, the static model 𝑔 is trained to predict only maliciousness. 1: 𝑌𝑝𝑟𝑒𝑑 ← 𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑌 (𝑋𝑡𝑒𝑠𝑡 ) ◁ Predict maliciousness using the static model on a deployment time test input 𝑋𝑡𝑒𝑠𝑡 . 2: 𝑈𝑝𝑟𝑒𝑑 ← 𝑔.𝑢𝑛𝑐𝑒𝑟𝑡𝑎𝑖𝑛𝑡𝑦𝑌 (𝑋𝑡𝑒𝑠𝑡 ) ◁ Compute the static model’s uncertainty (as measured by predictive entropy) for its maliciousness prediction. 3: if 𝑈𝑝𝑟𝑒𝑑 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then ◁ If the static model’s uncertainty is above the desired threshold, make a prediction using the dynamic model. 4: 𝑌𝑝𝑟𝑒𝑑𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ← 𝑓.𝑝𝑟𝑒𝑑𝑖𝑐𝑡(𝑋𝑡𝑒𝑠𝑡 ) 5: return 𝑌𝑝𝑟𝑒𝑑𝑑𝑦𝑛𝑎𝑚𝑖𝑐 6: else ◁ If the static model’s uncertainty is below the desired threshold, return the static model’s prediction. 7: return 𝑌𝑝𝑟𝑒𝑑 8: end if Even if the uncertainty threshold is set too high, accuracy with the deferral approach never dips below that of the MalConv model. This confirms that MalConv is able to accurately predict when the dynamic model will make a mistake. This, along with MalConv’s ability to accurately predict CAPA outputs, suggests that MalConv is capable of uncovering useful features in binary data. Finally, in Figure 5, we show a similar analysis, but instead of the dynamic model as the expensive one, we use an LGBM model trained on EMBER features. In this case, running the more expensive model on all of the data (the ensemble method) offers the best accuracy, but is also the most expensive method by far. Both the reject and defer methods show the desired behavior of improving accuracy while limiting compute costs. Algorithm 3 Making predictions in deployment using the Defer approach. 1: 𝑌𝑝𝑟𝑒𝑑 ← 𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑌 (𝑋𝑡𝑒𝑠𝑡 ) ◁ Predict maliciousness using the static model on a deployment time test input 𝑋𝑡𝑒𝑠𝑡 . 2: 𝑈𝑝𝑟𝑒𝑑 ← 𝑔.𝑢𝑛𝑐𝑒𝑟𝑡𝑎𝑖𝑛𝑡𝑦𝑌 (𝑋𝑡𝑒𝑠𝑡 ) ◁ Compute the static model’s uncertainty (as measured by predictive entropy) for its maliciousness prediction. 3: 𝐼𝑝𝑟𝑒𝑑 ← 𝑔.𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝐼 (𝑋𝑡𝑒𝑠𝑡 ) ◁ Predict dynamic model’s success using the static model. 4: if 𝑈𝑝𝑟𝑒𝑑 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 and 𝐼𝑝𝑟𝑒𝑑 == 1 then ◁ If the static model’s uncertainty is above the desired threshold and the static model predicts that the dynamic model will be successful, make a prediction using the dynamic model. 5: 𝑌𝑝𝑟𝑒𝑑𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ← 𝑓.𝑝𝑟𝑒𝑑𝑖𝑐𝑡(𝑋𝑡𝑒𝑠𝑡 ) 6: return 𝑌𝑝𝑟𝑒𝑑𝑑𝑦𝑛𝑎𝑚𝑖𝑐 7: else ◁ Else, return the static model’s prediction. 8: return 𝑌𝑝𝑟𝑒𝑑 9: end if 0.94 0.92 Malware Detection Accuracy 0.90 0.88 0.86 0.84 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Total Runtime (seconds) ·107 Reject Defer Reject Val. Threshold Defer Val. Threshold MalConv Dynamic Model Ensemble Figure 4: Performance with Bayesian MalConv as the cheap model that runs on all files, followed by a more expensive dynamic model. Both “Reject” and “Defer” require MalConv’s uncertainty to be above a certain threshold before turning to the dynamic model, but “Defer” also needs MalConv to predict that the dynamic model is likely to succeed. 0.950 0.945 0.940 Malware Detection Accuracy 0.935 0.930 0.925 0.920 0.915 0.910 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Total Runtime (seconds) ·10 4 Reject Defer Reject Val. Threshold Defer Val. Threshold MalConv LGBM Ensemble Figure 5: The performance with Bayesian MalConv as the cheap model, followed by LGBM on EMBER features as the expensive model, for various predictive entropy thresholds. 6. Conclusions We have shown that cheaper static analysis features can be used to predict the outcomes of more expensive disassembly based analysis. We have also developed a dynamic analysis model that is complementary to static analysis models. In both cases, we showed how uncertainty can be used to minimize compute costs. Future work should explore chaining together a larger number of algorithms and approaches, examine more refined measures of cost, and assess applicability in other areas of cyber security. References [1] R. Anderson, C. Barton, R. Boehme, R. Clayton, M. J. G. Van Eeten, M. Levi, T. Moore, S. Savage, Measuring the Changing Cost of Cybercrime, Workshop on the Economics of Information Security (WEIS) (2019). URL: https://weis2019.econinfosec.org/wp-content/uploads/sites/6/2019/05/WEIS_ 2019_paper_25.pdf. [2] P. Hyman, Cybercrime: it’s serious, but exactly how serious?, Communications of the ACM 56 (2013) 18–20. doi:10.1145/2428556.2428563. [3] B. M. Eddy, N. Perlroth, Cyber Attack Suspected in German Woman’s Death, The New York Times (2021). [4] E. C. Spafford, Is Anti-virus Really Dead?, Computers & Security 44 (2014) iv. doi:10.1016/ S0167-4048(14)00082-0. [5] B. Li, K. Roundy, C. Gates, Y. Vorobeychik, Large-Scale Identification of Malicious Singleton Files, in: 7TH ACM Conference on Data and Application Security and Privacy, 2017. [6] E. Raff, C. Nicholas, A Survey of Machine Learning Methods and Challenges for Windows Malware Classification, arXiv preprint arXiv:2006.09271 (2020) 1–48. URL: http://arxiv.org/abs/2006.09271. [7] C. Wressnegger, K. Freeman, F. Yamaguchi, K. Rieck, Automatically inferring malware signatures for anti-virus assisted attacks, ASIA CCS 2017 - Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security (2017) 587–598. doi:10.1145/3052973.3053002. [8] M. Egele, T. Scholte, E. Kirda, C. Kruegel, A survey on automated dynamic malware-analysis techniques and tools, ACM Computing Surveys 44 (2012). doi:10.1145/2089125.2089126. [9] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, M. Van Steen, Prudent practices for designing malware experiments: Status quo and outlook, Proceedings - IEEE Symposium on Security and Privacy (2012) 65–79. doi:10.1109/SP.2012.14. [10] M. Backes, M. Nauman, LUNA: Quantifying and Leveraging Uncertainty in Android Malware Analysis through Bayesian Machine Learning, Proceedings - 2nd IEEE European Symposium on Security and Privacy, EuroS and P 2017 (2017) 204–217. doi:10.1109/EuroSP.2017.24. [11] A. T. Nguyen, E. Raff, C. Nicholas, J. Holt, Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints, IJCAI-21 1st International Workshop on Adaptive Cyber Defense (2021). URL: http://arxiv.org/abs/2108.04081. [12] A. T. Nguyen, F. Lu, G. Lopez Munoz, E. Raff, C. Nicholas, J. Holt, Out of Distribution Data Detection Using Dropout Bayesian Neural Networks, in: AAAI, 2022. URL: www.aaai.org. [13] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, C. Nicholas, Malware Detection by Eating a Whole EXE, Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence (2017). URL: http://arxiv.org/abs/1710.09435. [14] E. Raff, W. Fleshman, R. Zak, H. S. Anderson, B. Filar, M. Mclean, Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection, in: AAAI, 2021. [15] J. M. Bernardo, A. F. Smith, Bayesian Theory, 2008. doi:10.1002/9780470316870. [16] H. Liu, L. Wasserman, Bayesian Inference, in: Statistical Machine Learning, In Preparation., 2014, pp. 299–351. [17] J. A. Hoeting, D. Madigan, A. E. Raftery, C. T. Volinsky, Bayesian Averaging Models, Statistical Science 14 (1999) 382–417. [18] D. J. C. MacKay, A Practical Bayesian Framework for Backpropagation Networks, Neural Compu- tation 472 (1992) 448–472. [19] E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, P. Hennig, Laplace Redux – Effortless Bayesian Deep Learning (2021). URL: http://arxiv.org/abs/2106.14806. [20] H. Ritter, A. Botev, D. Barber, A Scalable Laplace Approximation for Neural Networks, in: ICLR, 2018. [21] A. Kristiadi, M. Hein, P. Hennig, Learnable Uncertainty under Laplace Approximations (2020). URL: http://arxiv.org/abs/2010.02720. [22] R. M. Neal, Bayesian Learning for Neural Networks, Ph.D. thesis, University of Toronto, 1995. [23] M. Betancourt, A Conceptual Introduction to Hamiltonian Monte Carlo (2017). URL: http://arxiv. org/abs/1701.02434. [24] M. D. Hoffman, A. Gelman, The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo, Journal of Machine Learning Research 15 (2014) 1593–1623. [25] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul, Introduction to variational methods for graphical models, Machine Learning 37 (1999) 183–233. doi:10.1023/A:1007665907178. [26] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational Inference: A Review for Statisticians, Journal of the American Statistical Association 112 (2017) 859–877. doi:10.1080/01621459. 2017.1285773. [27] M. D. Hoffman, D. M. Blei, C. Wang, J. Paisley, Stochastic variational inference, Journal of Machine Learning Research 14 (2013) 1303–1347. [28] A. Kucukelbir, A. Gelman, Automatic Differentiation Variational Inference 18 (2017) 1–45. [29] R. Ranganath, S. Gerrish, D. M. Blei, Black box variational inference, Journal of Machine Learning Research 33 (2014) 814–822. [30] M. K. Titsias, M. Lázaro-Gredilla, Doubly stochastic variational bayes for non-conjugate inference, 31st International Conference on Machine Learning, ICML 2014 5 (2014) 4056–4069. [31] D. Duvenaud, R. P. Adams, Black-Box Stochastic Variational Inference in Five Lines of Python (2014) 1–4. [32] G. E. Hinton, D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, COLT (1993) 5–13. doi:10.1145/168304.168306. [33] A. Graves, Practical variational inference for neural networks, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 (2011) 1–9. [34] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, 32nd International Conference on Machine Learning, ICML 2015 2 (2015) 1613–1622. [35] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, 33rd International Conference on Machine Learning, ICML 2016 3 (2016) 1651–1660. [36] Y. Gal, Uncertainty in Deep Learning, Ph.D. thesis, University of Cambridge, 2016. [37] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580 (2012) 1–18. URL: http://arxiv.org/abs/1207.0580. [38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958. [39] A. C. Damianou, N. D. Lawrence, Deep Gaussian Processes, Artificial intelligence and statistics, PMLR 31 (2013) 207–215. [40] L. Smith, Y. Gal, Understanding measures of uncertainty for adversarial example detection, 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 2 (2018) 560–569. [41] C. Cortes, G. DeSalvo, M. Mohri, Learning with rejection, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9925 LNAI (2016) 67–82. doi:10.1007/978-3-319-46379-7{\_}5. [42] C. Chow, An Optimum Character Recognition System Using Decision Functions, IRE Transactions on Electronic Computers EC-7 (1958) 180. doi:10.1109/TEC.1958.5222530. [43] C. K. Chow, On optimum recognition error and reject tradeoff, 1969. [44] J. Attenberg, P. Ipeirotis, F. Provost, Beat the Machine: Challenging Workers to Find the Unknown Unknowns, AAAI (2011). [45] X. Wang, Y. Luo, D. Crankshaw, A. Tumanov, F. Yu, J. E. Gonzalez, IDK Cascades: Fast deep learning by learning not to Overthink, 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 2 (2018) 580–590. [46] D. Madras, T. Pitassi, R. Zemel, Predict responsibly: Improving fairness and accuracy by learning to defer, Advances in Neural Information Processing Systems 2018-Decem (2018) 6147–6157. [47] H. Mozannar, D. Sontag, Consistent Estimators for Learning to Defer to an Expert (2020). URL: http://arxiv.org/abs/2006.01862. [48] B. Wilder, E. Horvitz, E. Kamar, Learning to Complement Humans (2020) 1526–1533. doi:10. 24963/ijcai.2020/212. [49] T. Gao, D. Koller, Active classification based on value of classifier, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 (2011) 1–9. [50] M. Saar-tsechansky, P. Melville, F. Provost, Active Feature-Value Acquisition (????) 1–32. URL: papers://5e3e5e59-48a2-47c1-b6b1-a778137d3ec1/Paper/p1779. [51] S. Chen, A. Choi, A. Darwiche, Value of information based on decision robustness, Proceedings of the National Conference on Artificial Intelligence 5 (2015) 3503–3510. [52] A. Noriega-Campero, B. Garcia-Bulle, M. A. Bakker, A. S. Pentland, Active fairness in algorithmic decision making, AIES 2019 - Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (2019) 77–83. doi:10.1145/3306618.3314277. [53] Z. Xu, M. J. Kusner, K. Q. Weinberger, M. Chen, Cost-sensitive tree of classifiers, 30th International Conference on Machine Learning, ICML 2013 28 (2013) 133–141. [54] Z. E. Xu, M. J. Kusner, K. Q. Weinberger, M. Chen, O. Chapelle, Classifier cascades and trees for minimizing feature evaluation cost, Journal of Machine Learning Research 15 (2014) 2113–2144. [55] M. Des Jardins, J. MacGlashan, K. L. Wagstaff, Confidence-based feature acquisition to minimize training and test costs, Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010 (2010) 514–524. doi:10.1137/1.9781611972801.45. [56] Z. Xu, K. Q. Weinberger, O. Chapelle, The Greedy Miser: Learning under test-time budgets, Proceedings of the 29th International Conference on Machine Learning, ICML 2012 2 (2012) 1175–1182. [57] H. Shim, S. J. Hwang, E. Yang, Joint active feature acquisition and classification with variable-size set encoding, Advances in Neural Information Processing Systems 2018-Decem (2018) 1368–1378. [58] T. Rückstieß, C. Osendorfer, P. Van Der Smagt, Sequential feature selection for classification, Lec- ture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lec- ture Notes in Bioinformatics) 7106 LNAI (2011) 132–141. doi:10.1007/978-3-642-25832-9{\_ }14. [59] G. Dulac-Arnold, L. Denoyer, P. Preux, P. Gallinari, Datum-Wise classification: A sequential approach to sparsity, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6911 LNAI (2011) 375–390. doi:10. 1007/978-3-642-23780-5{\_}34. [60] G. A. Feltham, The value of information, The Accounting Review (1968). [61] J. McCarthy, Measures of the Value of Information, Proceedings of the National Academy of Sciences 42 (1956) 654–655. doi:10.1073/pnas.42.9.654. [62] T. E. Behrens, M. W. Woolrich, M. E. Walton, M. F. Rushworth, Learning the value of information in an uncertain world, Nature Neuroscience 10 (2007) 1214–1221. doi:10.1038/nn1954. [63] H. S. Anderson, P. Roth, EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models, arXiv preprint arXiv:1804.04637 (2018). URL: http://arxiv.org/abs/1804.04637. [64] M. Sebastián, R. Rivera, P. Kotzias, J. Caballero, AVCLASS: A Tool for Massive Malware Labeling, in: International symposium on research in attacks, intrusions, and defenses., 2016. URL: https: //github.com/malicialab/avclass. [65] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Technical Report, ???? URL: https://github.com/Microsoft/ LightGBM. [66] R. Harang, E. M. Rudd, SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection (2020). URL: http://arxiv.org/abs/2012.07634. [67] M. R. Smith, A. J. Carbajal, C. C. Lamb, N. T. Johnson, R. Ramyaa, S. J. Verzi, J. B. Ingram, E. Domschot, W. Philip Kegelmeyer, Mind the Gap: On Bridging the Semantic Gap between MachineLearning and Malware Analysis, arXiv (2020) 49–60. [68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style, high-performance deep learning library, Advances in Neural Information Processing Systems 32 (2019). [69] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, International Conference for Learning Representations (2015) 1–15. URL: http://arxiv.org/abs/1412.6980. doi:http://doi.acm. org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503.