Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus William Fleshman1 Edward Raff1,2,3 Richard Zak1,2 Mark McLean1 Charles Nicholas3 1 Laboratory for Physical Sciences, 2 Booz Allen Hamilton, 3 University of Maryland, Baltimore County {william.fleshman,edraff,rzak,mrmclea}@lps.umd.edu, nicholas@umbc.edu Abstract Given these issues and urgency, Machine Learning would As machine-learning (ML) based systems for malware detec- appear to be a potential solution to the problem of detect- tion become more prevalent, it becomes necessary to quan- ing new malware. Malware detection can be directly phrased tify the benefits compared to the more traditional anti-virus as a classification problem. Given a binary, we define some (AV) systems widely used today. It is not practical to build an feature set, and learn a binary classifier that outputs either agreed upon test set to benchmark malware detection systems benign or malicious. The output of this model could be cali- on pure classification performance. Instead we tackle the prob- brated to reach any desired false-positive ratio. lem by creating a new testing methodology, where we evaluate However, one should never switch to a new technology the change in performance on a set of known benign & ma- for its own sake. It is necessary to have empirical and quan- licious files as adversarial modifications are performed. The change in performance combined with the evasion techniques tifiable reasons to adopt a new approach for malware detec- then quantifies a system’s robustness against that approach. tion. Ideally, this would be based off the accuracy of cur- Through these experiments we are able to show in a quantifi- rent anti-virus systems compared to their machine-learning able way how purely ML based systems can be more robust counterparts. In reality, this is non-trivial to estimate, and than AV products at detecting malware that attempts evasion many arguments currently exist as to how this should be done through modification, but may be slower to adapt in the face (Jordaney et al. 2016; Deo et al. 2016; Sommer and Paxson of significantly novel attacks. 2010). Designing a corpus and test protocol to determine accuracy at large is hampered by issues like concept drift 1 Introduction (Jordaney et al. 2017), label uncertainty, cost (Miller et al. The threat and impact of malicious software (malware) has 2016), and correlations over time (Kantchelian et al. 2013; continued to grow every year. The annual financial impact Singh, Walenstein, and Lakhotia 2012). is already measured in the hundreds of billions of dollars Toward remedying this issue, we propose a new testing (Hyman 2013; Anderson et al. 2013). Simultaneously, there methodology that can highlight a detector’s strength or weak- are worries that the classical anti-virus approach may not ness with regard to specific attack types. We first utilize the continue to scale and fail to recognize new threats (Spafford framework from Biggio et al., to specify a model for our ad- 2014). versary (Biggio, Fumera, and Roli 2014). After defining the Anti-virus systems were historically built around signature adversary’s goal, knowledge, and capabilities, we look at how based approaches. Wressnegger et al. (2016) discussed a difficult it is for an adversary with known malware to evade number of issues with signatures, but the primary shortcom- malware detection systems under comparison. As examples, ing is an intrinsically static nature and inability to generalize. we compare two machine learning approaches, one a simpler Most anti-virus companies have likely incorporated machine n-gram model and one neural network based, to a quartet of learning into their software, but to what extent remains un- production AV systems. While we do not exhaustively test clear due to the nature of proprietary information. Regard- all possible evasion techniques, we show the potential for our less, adversaries can still successfully evade current detection protocol using a handful of relevant tests: non-destructive au- systems with minimal effort. The success of methods like tomated binary modification, destructive adversarial attacks, obfuscation and polymorphism is evident by its prevalence, and malicious injection. We are specifically looking only and recent work has suggested that the majority of unique at black box attacks which can be applied to any potential malicious files are due to the use of polymorphic malware malware detector, assuming that the adversary is looking for (Li et al. 2017). maximal dispersion and not a targeted attack. This new eval- uation protocol is our primary contribution, and a black-box Copyright © by the paper’s authors. Copying permitted for private adversarial attack for byte-based malware detectors our sec- and academic purposes. In: Joseph Collins, Prithviraj Dasgupta, Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo- ond contribution. This evaluation protocol allows us to make sium on Adversary-Aware Learning Techniques and Trends in Cy- new conclusions on the potential benefits and weakness of bersecurity, Arlington, VA, USA, 18-19 October, 2018, published two byte based machine learning malware detectors. at http://ceur-ws.org We will review previous work in section 2. The experi- ments and methodology we use to compare our ML based Another approach is to modify the bytes of a binary, and systems to anti-virus will be given in section 3 and section 4, run the malware detection system after modification. While followed by their results in section 5 and conclude in sec- this risks breaking the binary, it can still be effective, and it tion 6. is important to still quarantine malware even if it does not ex- ecute as intended. Wressnegger et al. (2016) modified each 2 Related Work binary one byte at a time, and found that single byte changes could evade classical AV detectors. They used this approach Malware detection and analysis has been an area of active to find which bytes were important to the classification, and research for several years. One class of techniques that is of reverse-engineered signatures used from multiple AV prod- particular note is dynamic analysis, where the binary itself ucts. We extend this technique in subsection 4.2 to create a is run to observe its behavior. Intuitively, malicious behavior new adversarial approach against our models, which finds a is the best indicator of a malicious binary — making dy- contiguous byte region which can be altered to evade detec- namic analysis a popular approach. However, dynamic analy- tion. sis has many complications. It requires significant effort and For the machine learning based malware detection meth- infrastructure to make it accurately reflect user environments, ods, we look at two approaches: byte n-grams and neural which are often not met in practice (Rossow et al. 2012). Fur- networks. The byte n-gram approach was one of the first thermore, the malware author can use a number of techniques machine learning methods proposed for malware detection to detect dynamic execution, hide behavior, or otherwise ob- (Schultz et al. 2001), and has received significant attention fuscate such analysis. This means effective dynamic analysis (Kolter and Maloof 2006). We use the same approach for systems are often a cat-and-mouse game of technical issues, training such models as presented in Raff et al. (2016). For and can require running binaries across many layers of emu- the neural network approach we use the recently proposed lation (Egele et al. 2017). MalConv, which processes an entire binary by using an em- It is for these reasons that we focus on the static analysis bedding layer to map bytes to feature vectors followed by case. This removes the unlimited degrees of freedom pro- a convolutional network (Raff et al. 2017). Because the em- vided by actually running code. However, this does not mean bedding layer used by MalConv is non-differentiable with the malicious author has no recourse. Below we will review respect to the input bytes, the aforementioned adversarial the related work in this area. approaches applied to image classification tasks are not as Questions regarding the security of machine learning based easily applicable. Similarly, anti-virus products do not pro- systems have been an active area of research for over a decade vide a derivative for such attacks. This is why we develop a (Barreno et al. 2010), but have recently received increased new black-box adversarial approach in subsection 4.2. attention. This is in particular due to the success in generating Using these modeling methods is beneficial for our analy- adversarial inputs to neural networks, which are inputs that sis because they are trained with minimal assumptions about induce an incorrect classification despite only minor changes the underlying data. Other approaches to malware detection to the input (Szegedy et al. 2014). These approaches generally that process the PE-header (Shafiq et al. 2009) or use disas- work by taking the gradient of the network with respect to sembly (Moskovitch et al. 2008) exist. We avoid these under the input, and adjusting the input until the network’s output is the belief that their use of explicit feature sets aids in evading altered. This does not directly translate to malware detection them (e.g, modifying the PE-Header is easy and allows for when using the raw bytes, as bytes are discontinuous in nature. direct evasion of models using those features). That is to say, any change in a single byte’s value is an equally We leverage the existing framework from (Biggio, Fumera, “large” change, whereas in images, adjusting a pixel value and Roli 2014) to model our adversary for each comparison can result in visually imperceptible changes. Arbitrary byte made. This framework prescribes defining the adversary’s changes may also result in a binary that does not execute, goals, knowledge, and capabilities in regards to the classifiers. making it more difficult to apply these attacks in this space (Grosse et al. 2016; Russu et al. 2016). While such attacks on individual models are possible (Kolosnjaji et al. 2018; 3 Adversarial Model Kreuk et al. 2018), we are concerned with attacks that can be For each experiment we specify the model of our adversary applied to any detector, therefore these methods are out of by outlining the goals, knowledge, and capabilities available the scope of this work. (Biggio, Fumera, and Roli 2014). These make clear the as- To overcome the issue with arbitrary byte changes breaking sumptions and use cases presented by each scenario. executables, Anderson, Filar, and Roth (2017) developed The goal of our adversary is the same across all experi- a set of benign actions that can modify a binary without ments – using a single attack to maximize the misclassifica- changing its functionality. Selecting these actions at random tion rate of malicious files across many classifiers. The one allowed malware to evade a ML model 13% of the time. caveat is that in subsection 4.2 we only goes as far as identi- Introducing a Reinforcement Learning agent to select the fying a small portion of the file which can be altered to evade actions increased this to 16% of the time. Their ML model detection. A real attacker would then have to alter those bytes used features from the PE header as well as statistics on while retaining the malicious functionality of the file. It is un- strings and bytes from the whole binary. We will replicate likely that authors of benign software would seek a malicious this type of attack as one of the comparisons between our ML classification – therefore we limit ourselves to experiments systems and anti-virus. on malware only. We severely constrain the knowledge of our adversary. In corpora of benign and malicious files. For the AV products, all cases we assume the adversary has no knowledge of the these are used to ensure that a certain target level of false classifiers’ algorithms, parameters, training data, features, or positives and false negatives are met. In addition, the training decision boundaries. This is the best case scenario for security corpus we use for our models is orders of magnitude smaller applications, as knowledge of any of the previous attributes than what the AV companies will have access to. For this would increase the sophistication of possible attacks. reason we would not necessarily expect our approaches to Similarly, the experiments outlined in subsection 4.1 and have better accuracy. The goal is to quantify the robustness subsection 4.3 require no prior interaction with the classi- of any classifier to a specific threat model and attack. fiers before deploying a malicious file to the wild. In sub- One novel and three pre-existing techniques (“φ”s) will section 4.2, the adversary has the capability of querying our be used to perform this evaluation. These are not exhaustive, n-gram model only. The inner workings of the model remain but show how our approach can be used to quantify relative hidden, but the output of the model is available for deductive performance differences. Below we will review their method- reasoning. ology, and the information that we can glean from their re- These assumptions are restrictive to the adversary’s actions sults. In section 5 we will review the results of running these and knowledge, but are important because current adversaries experiments. are able to succeed under such restrictions today. This means malware authors can obtain success with minimal effort or Dataset expense. Through our evaluation procedure we are able to We take only a brief moment to expound upon the machine better understand which techniques / products are most robust learning models and data used for training, as the information to this restricted scenario, and thus will increase the effort disparity with other AV products makes direct comparison that must be expended by malware authors. difficult. We trained the machine learning models on a large dataset from industry discussed in (Raff and Nicholas 2017) 4 Experiments and Methodology which contains approximately 2 million samples balanced We now perform experiments under our proposed testing almost evenly between benign and malicious. Our test set methodology to compare machine learning models to com- consists of approximately 80,000 files held out for our post- mercial anti-virus systems. To do so, we compare two ma- training experimentation. The test set is also almost perfectly chine learning classifiers and four commercial anti-virus prod- balanced. This set was not seen by our machine learning mod- ucts: AV1, AV2, AV3, and AV4. We use these anonymous els during training, but could have files previously seen by names to be compliant with their End User License Agree- the anti-virus corporations. The n-gram model follows the ments (EULA), which bars us from naming the products in procedure used in (Raff et al. 2016), but uses one million fea- any comparative publication. We are prohibited from using tures instead of 100,000. The MalConv approach is specified the internet based mass scanners, like VirusTotal, for similar in (Raff et al. 2017). Both of these approaches require no reasons. Ideally, we could test many products, but with our domain knowledge to apply. large corpus of files we limit ourselves to these four – which we believe to be representative of the spectrum of commer- 4.1 Randomly Choosing Benign Modifications cially used AV platforms. We purposely did not chose any In this experiment, we use a subset1 of the modifications used security products which advertise themselves as being pri- by Anderson, Filar, and Roth (2017) to alter malicious files marily powered by machine learning or artificial intelligence. before scanning them with the classifiers. The objective is to Additionally, any cloud based features which could upload test the brittleness of the classifiers’ decision mechanisms by our files were disabled to protect the proprietary nature of our making small changes that do not alter the functionality of dataset. Our machine learning based approaches will be built the malware. This experiment is ideal because it produces a upon n-grams and neural networks which process a file’s raw new file that is still functional, and should have no impact on bytes. We will compare these systems on a number of tasks an ideal malware detector. to evaluate their robustness to evasion. There are nine different modification actions, and grouped We now describe our new testing protocol. Given a by type are: corpus {x1 , . . . , xn } of n files with known labels yi ∈ {Benign, Malicious}, and detector C(·) we first measure • rename or create new sections • append bytes to the end of a section or the file the accuracy of C(xi ), ∀i. This gives us our baseline • add an unused function to the import table performance. We then use a suite of evasion techniques • create a new entry point (which jumps to the old entry) φ1 (·), . . . , φK (·) and look at the difference in accuracy be- • modify the header checksum, the signature, or debug info tween C(xi ) = yi and C(φ(xi )) = yi . The difference in Each malicious file was scanned before the experiment these scores tells us the ability of the detector C(·) to general- to see if it was already being misclassified as benign (false ize past its known data and catch evasive modifications. Any negative). If it was correctly classified as malicious then a approach for which C(xi ) 6= C(φj (xi )) for many different modification was randomly selected from the set and applied. evasion methods φ which preserve the label yi , is then an in- This process was repeated up to ten times or until the file trinsically brittle detector, even if it obtains perfect accuracy evaded the classifier. Approaches that rely too heavily on before φ is used. exact signatures or patterns should suffer in this test, and we The raw detection accuracy is not the point of this test. 1 Each system will have been built using different and distinct A UPX (un)packing option did not operate correctly. would expect any approach that uses purely dynamic features Algorithm 1 Occlusion Binary Search to be unaffected by the modifications. Require: A file F of length |F |, a classifier C(·), target This experiment is time intensive, and so we use a subset of occlusion size β, byte replacement distribution D 1,000 files randomly chosen from our test set. Anderson, Filar, 1: split ← |F |/2, size ← |F |/2 and Roth limited themselves to a small 200 file sample, so our 2: start ← 0, end ← |F | larger test set may provide additional confidence in the results. 3: while size > β do The original paper proposed using a reinforcement learning 4: Fl ← F , Fr ← F algorithm (RLA) to select the modifications in an adversarial 5: Fl [split−size:split] ← contiguous sample ∼ D manner. We found that including a RLA did not change 6: Fr [split:split+size] ← contiguous sample ∼ D results, but did limit our ability to test with all products in a 7: if C(Fl ) < C(Fr ) then timely manner. 8: split ← split − size/2 9: start ← split − size, end ← split 4.2 Targeted Occlusion of Important Bytes 10: else The first approach described requires extensive domain 11: split ← split + size/2 knowledge based tooling, and would be difficult to adapt to 12: start ← split, end ← split + size new file types. We introduce a novel approach to find impor- 13: size ← size/2 tant byte regions of a malicious file given a working detector. Identifying the most important region of a file gives feedback 14: return start, end to adversaries, and allows us to occlude that region as an evasion technique. Our approach works by occluding certain bytes of a file, benign training files to occlude the bytes in the file under and then running the malware detector on the occluded file. test. Since we use this approach only with malicious files, By finding a contiguous region of bytes that reduced the Adversarial Occlusion is an especially challenging test for detector’s confidence that a file is malicious, we can infer our machine learning based methods: they have seen these that the bytes in that region are important to the detector. If byte regions before with an explicit label of benign. This occluding a small region causes the detector to change its vote test also helps to validate that all approaches aren’t simply from malicious to benign, then we can infer that the detector detecting high-entropy regions that “look” packed, and then is too fragile. In this case it is plausible for a malware author defaulting to a decision of malicious. to determine what is stored in the small region detected, and To show that our search procedure provides meaningful then modify the binary to evade detection. improvements for the Random and Adversarial Occlusions, Manually editing groups of bytes one at a time would we will also compare against an Undirected Occlusion ap- be computationally intractable. Given a file F of |F | bytes, proach that eschews our search procedure. In this case we and a desired contiguous region size of β, it would take randomly select a region of the binary to occlude with high- O(|F |) calls to the detector to find the most important region. entropy random bytes. We instead develop a binary-search style procedure, which Recall, that this method identifies the critical region – for a is outlined in Algorithm 1. Here C(·) returns the malware successful attack the adversary would then have to manually detector’s confidence that a given file is malicious, and D is a alter the portion of the file at that location while maintaining source of bytes to use for the occlusion of the original bytes. their intended outcome. For this reason, we use a single This approach allows us to find an approximate region of target window size of 2048 bytes. The resulting occluded area importance in only O(log |F |) calls to the detector C(·). In will fall in the range of (1024-2048] bytes. On average, this the event that C(Fl ) = C(Fr ), ties can be broken arbitrarily restricts the occluded section to under 0.5% of the file size for – which implicitly covers the boolean decisions from an anti- our testing set. We also limit ourselves to searching for critical virus product. regions with our n-gram model. It would be infeasible for an This method starts with a window size equal to half of the adversary to conduct this search across all possible classifiers, file size. Both halves are occluded and the file is analyzed and we would expect there to be some overlap among which by the classifier. The half that results in the largest drop in regions are critical. This has been shown to be true in gradient classification confidence is chosen to be split for the next time based classifiers, where an adversarial example generated by step. This binary search through the file is continued until the one model can fool many others (Papernot, McDaniel, and window size is at least as small as the target window size. Goodfellow 2016). The last component of our approach is to specify what method D should be used to replace the bytes of the original 4.3 ROP Injection file. One approach, which we call Random Occlusion, is The last method we will consider is of a different variety. to simply select each replacement byte at random. This is Instead of modifying known malicious files in an attempt to similar to prior work in finding the important regions of an evade detection, we will inject malicious functionality into image according to an object detector, where it was found otherwise benign applications. that randomly blocking out the pixels was effective — even if Many such techniques for this exist. We used the Return the replacement values were nonsensical (Zeiler and Fergus Oriented Programming (ROP) Injector (Poulios, Ntantogian, 2014). We also look at Adversarial Occlusion, where we use and Xenakis 2015) for our work. The ROP Injector converts a contiguous region selected randomly from one of our own malicious shellcode into a series of special control flow in- structions that are already found inside the file. These instruc- Table 1. Baseline accuracy and true/false positive/negative results tions are inherently benign, as they already exist in the benign for each model on the test set. file. The technique patches the binary in order to modify the memory stack so that the targeted instructions, which are non- Classifier TN% TP% FN% FP% Accuracy% contiguous in the file, are executed in an order equivalent to N-Gram 92.1 98.7 1.3 7.9 95.5 the original shellcode. The result is that the functionality of MalConv 90.7 97.2 2.8 9.3 94.1 the file is maintained, with the added malicious activity exe- AV1 94.3 99.5 0.5 5.7 97.0 cuting right before the process exits. We use the same reverse AV2 99.4 64.9 35.1 0.6 81.6 Meterpreter shellcode as Poulios, Ntantogian, and Xenakis AV3 98.5 80.5 19.5 1.5 89.2 (2015) for our experiment. AV4 93.8 91.9 8.1 6.6 92.6 We note that not all benign executables in our testing set were injectable. The files must either have enough instruc- tions to represent the shellcode already, or have room for further instructions to be added in a non-contiguous man- 1,000 False Negatives ner so as to prevent raising suspicion. Additionally, if the Evaded malware requires external functionality such as networking 800 Correct capabilities, and these are not inherent to the original exe- cutable, then portions of the PE header such as the import 600 table must be adjusted. For these reasons, only 13,560 files Files were successfully infected. Poulios, Ntantogian, and Xenakis claim they were able to 400 evade anti-virus over 99% of the time using this technique. This approach should be nearly undetectable with static anal- 200 ysis alone, as all instructions in the file remain benign in nature. This makes it an extremely difficult test for both the 0 machine learning and anti-virus detectors. m nv 1 2 3 4 5 Results AV AV AV AV ra Co -G al N Before delving into results, we make explicit that we con- sidered Malware the positive class and Benign the negative M class . We remind the reader that not all tested systems are on Figure 1. Robustness of classifiers to benign modifications. The blue equal footing. Our n-gram and MalConv models are trained bar (left) represents the number of false negatives before any modi- on the same corpus, but we have no knowledge of the cor- fications were made. The red bar (center) represents the number of pora used by the AV companies and their products. In all files that were able to evade. The tan bar (right) represents the num- likelihood, they will be using datasets orders of magnitude ber of files that were still classified correctly after 10 modifications. Higher is better for the tan bar, and lower is better for all others. larger than our own to ensure the quality of their products and to reach a desired false-positive rate. From this perspective alone, we would not necessarily expect the machine learning Percentage of Files Evading N-Gram MalConv AV1 based methods to have better accuracies, as there is a dispar- 100 AV2 AV3 AV4 ity of information at training time. Our methods are also not optimized for the same low false-positive goal. 80 The accuracies and metrics of each method are presented 60 in Table 1. Because of the mentioned information disparity between each system, we do not present this information 40 for direct comparison. Our goal is to use these metrics as a base-line measure of performance for each method, and 20 look at how these baselines are affected by various evasion techniques. 0 2 4 6 8 10 5.1 Benign Modifications Number of Modifications Performed The results of the experiment described in subsection 4.1 are shown in Figure 1. All four anti-virus products were signifi- Figure 2. The number of files that evade increase as the number of cantly degraded by making seemingly insignificant changes modifications increases. Lower is better. to the malware files. The machine learning models were im- mune to this evasion technique. Both model’s confidence that these files were malware changed by less than 1% on average. The anti-virus products did not perform as well in this test. The few files that did evade were very close to the models’ de- AV3 and AV4 were the most robust to these modifications, cision boundaries before the modifications were performed. with 130 and 215 files evading each respectively. While this is better than products like AV1, where 783 out of 1000 ·104 malicious files where able to evade, it is still an order of magnitude worse than the 3 and 20 files that could evade the Baseline Undirected n-gram and MalConv models, respectively. Random Adversarial Given these results, one might wonder — could the ML 4 based models be evaded by this approach if more modifica- tions were performed? In particular, a number of the modifica- Files tions have multiple results for a single action. So re-applying 3 them is not unreasonable. To answer this question, we look at the evasion rate as a function of the number of modifications performed in Figure 2. In this plot we can see that all four AV products have an upward trending slope as the number 2 of modifications is increased. In contrast, the n-gram model is completely unchanged from 3 or more modifications, and MalConv at 4 or more. This would appear to indicate that the m nv 1 2 3 4 AV AV AV AV ra Co ML approaches are truly immune to this evasion technique, -G al N whereas the AVs are not. M These results also speak to the use of dynamic evaluation, or lack thereof, in the AV products at scan time. Any dynamic Figure 3. Robustness to byte level manipulations. Black dashed line analysis based approach should be intrinsically immune to indicates maximum. The blue bar (far left) represents number of the modifications performed in this test. Because we see all files classified correctly before occlusion. The red bar (left center) AV products fail to detect 13-78% of modified malware, we represents the number of files classified correctly after random can postulate that if any dynamic analysis is being done its occlusion. The tan bar (center right) represents the number of files classified correctly after targeted occlusion using random bytes. The application is ineffectual. dark gray bar (far right) represents the number of files classified We make note that in the original work of Anderson, Fi- correctly after targeted occlusion using bytes from known good files. lar, and Roth (2017), they applied this attack to a simple machine learning based model that used only features from the PE header. This attack was able to evade the PE-feature based model, but not the byte based models tested under our among the AV vendors, but was still impacted by the oc- framework. clusion of bytes and did not perform as strongly as the ML models. 5.2 Targeted Occlusion From the AV results it is also clear that the Targeted Ran- The experiment detailed in subsection 4.2 attempts to modify dom occlusion is an improvement over the Undirected occlu- the most important region for detecting the maliciousness of sion of random bytes, showing that Algorithm 1 is effective a binary as deduced from querying only the n-gram model at identifying important regions. This was not discernible and attempting to transfer that knowledge to all others. from the n-gram and MalConv models, which are relatively immune to this scenario. The results for the byte occlusion tests are given in Fig- ure 3, where the blue bar shows the accuracy on 40,000 mali- Although on average the occlusion was limited to 0.5% cious files when no bytes are occluded. For both the n-gram of a file’s bytes, we acknowledge that one could argue this and MalConv models, we see that the Random, Undirected, transformation may have altered the true label of the file. The and Adversarial occlusion attacks have almost no impact on search could be potentially removing the malicious payload classification accuracy. In the case of the n-gram model, only of the application, rendering it muted (if runnable). We accept 0.13% of the files where able to evade its detection after this possibility under the assumption that the adversary now occlusion. Again evasions were highly correlated with files has enough information to alter this small portion of the file close to the decision boundary. in order to break possible signatures. In particular, we remind the reader that the Adversarial The results from the machine learning models would sug- occlusion is replacing up to 2KB of the malicious file with gest that maliciousness is not contained to a small contiguous bytes taken from benign training data. This is designed to section of most files. This makes intuitive sense, as malicious maximally impact the ML approaches, as the model was functionality being employed in executable sections of a file trained with the given bytes as having an explicit label of might also require functions to be referenced in the import benign. Yet, the Adversarial choice has almost no impact on table. The non-contiguous nature of the PE file format allows results. This is a strong indicator of the potential robustness of for many similar circumstances, and so we believe it unlikely the ML approach, and that simply adding bytes from benign that all of a file’s maliciousness would be contained within a programs is not sufficient to evade their detection. small contiguous region of a binary. In addition, this results Considering the AV products we again see a different story. in binaries that are analogous to deployed malware with bugs. AV1 had the highest detection rate when no occlusion oc- Just because malware may not function properly, and thus not curred, but also had the largest drop in detection rate for negatively impact users, doesn’t remove its malicious intent all three types of occlusion. AV4 had the best performance and the benefit in detecting it. 5.3 ROP Injection Results features of the injection), or is simply not sufficiently repre- sented in our training data for the model to learn. In Table 2 we show the results of running the ROPInjector on the benign test files. The results are shown only for the 13,560 This question was tested by applying the ROPInjector to all files that the ROPInjector was able to infect. Before infection, of our training data, both benign and malicious files, which re- the Pre-ROP Accuracy column indicates the percentage of sulted in 235,865 total ROPInjected binaries. We then trained files correctly labeled benign. After infection, the Post-ROP an n-gram model that tried to distinguish between ROPIn- Accuracy column indicates what percentage were correctly jected vs Not-ROPInjected binaries. The n-gram model was labeled as malicious. The Post-ROP Lift then shows how able to obtain 97.6% accuracy at this task with an AUC of many percentage points of Post-ROP Accuracy came from 99.6%. This indicates that the models could learn to detect the classifier correctly determining that a formerly benign file ROPInjection, but that it is not sufficiently prevalent in our is now malicious, rather than simply being a false-positive. training corpus for the models to have learned. That is to say, if a model incorrectly called a benign file Overall this case highlights a potential advantage of the malicious, it’s not impressive when it calls an infected version classical AV systems with extensive domain knowledge. of the file malicious as well. When sufficiently novel attacks are developed for which cur- rent models have almost no predictive power, it may be easier Table 2. Accuracy on originally benign binaries before and after to develop and deploy new signatures to catch the attacks — applying the ROPInjector. while the machine learning approaches may require waiting for data, or creating synthetic data (which has its own risks) Classifier Pre-ROP Accuracy Post-ROP Accuracy Post-ROP Lift to adjust the model. N-Gram 85.1 15.3 0.4 MalConv 82.4 18.8 1.2 AV1 99.3 1.3 0.6 6 Conclusion AV2 98.7 1.2 −0.1 AV3 97.9 0.7 −1.4 We have demonstrated a new testing methodology for compar- AV4 89.2 32.9 22.1 ing the robustness of machine learning classifiers to current anti-virus software. Furthermore, we have provided evidence Most malware detectors had significant difficulty with this that machine learning approaches may be more successful at task. The machine learning based models, n-gram and Mal- catching malware that has been manipulated in an attempt to Conv, showed only modest improvements of 0.4 and 1.2 per- evade detection. Anti-virus products do a good job of catch- centage points. AV1 had a similarly small 0.6 improvement, ing known and static malicious files, but their rigid decision but surprisingly AV2 and AV3 had negative improvements. boundaries prevent them from generalizing to new threats or That means for 1.4% of the benign files, AV3 called the catching evolutionary malware. We demonstrated that top tier original benign file malicious. But when that same file was anti-virus products can be fooled by simple modifications in- infected by the ROPInjector, AV3 changed its decision to cluding changing a few bytes or importing random functions. benign. This is a surprising failure case, and may be caused The machine learning models appear to better generalize ma- by the AV system relying too heavily on a signature based liciousness — leading to an increased robustness to evasion approach (i.e., a signature had a false-positive on the benign techniques compared to their anti-virus counterparts. file, but the infection process broke the signature — turning a would-be true positive into a false-negative). References Overall these models showed no major impact from the ROPInjector. Only AV4 was able to significantly adjust its [2013] Anderson, R.; Barton, C.; Böhme, R.; Clayton, R.; decision for 22.1% of the infected files to correctly label them van Eeten, M. J. G.; Levi, M.; Moore, T.; and Savage, S. as malicious. Though the evasion rate for AV4 remains high 2013. Measuring the Cost of Cybercrime. Berlin, Heidelberg: at 77%, it performed best in this scenario. Given the muted Springer Berlin Heidelberg. 265–300. and negative performance of the other AV systems in this test, [2017] Anderson, H. S.; Filar, B.; and Roth, P. 2017. Evading we suspect AV4 has developed techniques specifically for the Machine Learning Malware Detection. Black Hat USA. ROPInjector approach. [2010] Barreno, M.; Nelson, B.; Joseph, A. D.; and Tygar, We also notice that the benign files that were injectable J. D. 2010. The security of machine learning. Machine include 67% of our original false positives for the n-gram Learning 81(2):121–148. model. We suspect that this is due to those files already con- [2014] Biggio, B.; Fumera, G.; and Roli, F. 2014. Security taining the networking functionality required by the shell- evaluation of pattern classifiers under attack. IEEE Transac- code. The majority of malware requires communication over tions on Knowledge and Data Engineering 26(4):984–996. the Internet, therefore our machine learning detectors may view networking related features as somewhat malicious in [2016] Deo, A.; Dash, S. K.; Suarez-Tangil, G.; Vovk, V.; and nature. Cavallaro, L. 2016. Prescience: Probabilistic Guidance on the Overall the ROPInjection of malicious code into other- Retraining Conundrum for Malware Detection. In Proceed- wise benign applications presents an apparent weakness for ings of the 2016 ACM Workshop on Artificial Intelligence our machine learning approaches. An important question and Security, AISec ’16, 71–82. New York, NY, USA: ACM. is whether the ROPInjection is functionally invisible to the [2017] Egele, M.; Scholte, T.; Kirda, E.; and Barbara, S. 2017. machine learning based models (i.e., it could never detect A Survey On Automated Dynamic Malware Analysis Eva- sion and Counter-Evasion. In Proceedings of Reversing and [2015] Poulios, G.; Ntantogian, C.; and Xenakis, C. 2015. Offensive-oriented Trends Symposium. ROPInjector: Using Return- Oriented Programming for Poly- [2016] Grosse, K.; Papernot, N.; Manoharan, P.; Backes, M.; morphism and AV Evasion. In Black Hat USA. and McDaniel, P. D. 2016. Adversarial perturbations against [2017] Raff, E., and Nicholas, C. 2017. Malware Classifica- deep neural networks for malware classification. CoRR tion and Class Imbalance via Stochastic Hashed LZJD. In abs/1606.04435. Proceedings of the 10th ACM Workshop on Artificial Intel- [2013] Hyman, P. 2013. Cybercrime. Communications of the ligence and Security, AISec ’17, 111–120. New York, NY, ACM 56(3):18. USA: ACM. [2016] Jordaney, R.; Wang, Z.; Papini, D.; Nouretdinov, I.; [2016] Raff, E.; Zak, R.; Cox, R.; Sylvester, J.; Yacci, P.; and Cavallaro, L. 2016. Misleading Metrics : On Evaluating Ward, R.; Tracy, A.; McLean, M.; and Nicholas, C. 2016. An Machine Learning for Malware with Confidence. Technical investigation of byte n-gram features for malware classifica- report, University of London. tion. Journal of Computer Virology and Hacking Techniques. [2017] Jordaney, R.; Sharad, K.; Dash, S. K.; Wang, Z.; Pap- [2017] Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catan- ini, D.; Nouretdinov, I.; and Cavallaro, L. 2017. Transcend: zaro, B.; and Nicholas, C. 2017. Malware Detection by Detecting Concept Drift in Malware Classification Models. Eating a Whole EXE. arXiv preprint arXiv:1710.09435. In 26th USENIX Security Symposium (USENIX Security 17), [2012] Rossow, C.; Dietrich, C. J.; Grier, C.; Kreibich, C.; 625–642. Vancouver, BC: {USENIX} Association. Paxson, V.; Pohlmann, N.; Bos, H.; and van Steen, M. 2012. [2013] Kantchelian, A.; Afroz, S.; Huang, L.; Islam, A. C.; Prudent Practices for Designing Malware Experiments: Sta- Miller, B.; Tschantz, M. C.; Greenstadt, R.; Joseph, A. D.; tus Quo and Outlook. In 2012 IEEE Symposium on Security and Tygar, J. D. 2013. Approaches to Adversarial Drift. and Privacy, 65–79. IEEE. In Proceedings of the 2013 ACM Workshop on Artificial [2016] Russu, P.; Demontis, A.; Biggio, B.; Fumera, G.; and Intelligence and Security, AISec ’13, 99–110. New York, NY, Roli, F. 2016. Secure kernel machines against evasion attacks. USA: ACM. In Proceedings of the 2016 ACM Workshop on Artificial [2018] Kolosnjaji, B.; Demontis, A.; Biggio, B.; Maiorca, D.; Intelligence and Security, AISec ’16, 59–69. New York, NY, Giacinto, G.; Eckert, C.; and Roli, F. 2018. Adversarial USA: ACM. Malware Binaries: Evading Deep Learning for Malware De- [2001] Schultz, M.; Eskin, E.; Zadok, F.; and Stolfo, S. 2001. tection in Executables. Data Mining Methods for Detection of New Malicious Exe- [2006] Kolter, J. Z., and Maloof, M. A. 2006. Learning cutables. In Proceedings 2001 IEEE Symposium on Security to Detect and Classify Malicious Executables in the Wild. and Privacy. S&P 2001, 38–49. IEEE Comput. Soc. Journal of Machine Learning Research 7:2721–2744. [2009] Shafiq, M. Z.; Tabish, S. M.; Mirza, F.; and Farooq, M. [2018] Kreuk, F.; Barak, A.; Aviv-Reuven, S.; Baruch, M.; 2009. PE-Miner: Mining Structural Information to Detect Pinkas, B.; and Keshet, J. 2018. Adversarial Examples Malicious Executables in Realtime. In Recent Advances in on Discrete Sequences for Beating Whole-Binary Malware Intrusion Detection. 121–141. Detection. [2012] Singh, A.; Walenstein, A.; and Lakhotia, A. 2012. [2017] Li, B.; Roundy, K.; Gates, C.; and Vorobeychik, Y. Tracking Concept Drift in Malware Families. In Proceedings 2017. Large-Scale Identification of Malicious Singleton Files. of the 5th ACM Workshop on Security and Artificial Intelli- Proceedings of the Seventh ACM on Conference on Data and gence, AISec ’12, 81–92. New York, NY, USA: ACM. Application Security and Privacy - CODASPY ’17 227–238. [2010] Sommer, R., and Paxson, V. 2010. Outside the Closed World: On Using Machine Learning for Network Intrusion [2016] Miller, B.; Kantchelian, A.; Tschantz, M. C.; Afroz, S.; Detection. In Proceedings of the 2010 IEEE Symposium on Bachwani, R.; Faizullabhoy, R.; Huang, L.; Shankar, V.; Wu, Security and Privacy, SP ’10, 305–316. Washington, DC, T.; Yiu, G.; Joseph, A. D.; and Tygar, J. D. 2016. Reviewer USA: IEEE Computer Society. Integration and Performance Measurement for Malware De- tection. In Proceedings of the 13th International Conference [2014] Spafford, E. C. 2014. Is Anti-virus Really Dead? on Detection of Intrusions and Malware, and Vulnerability Computers & Security 44:iv. Assessment - Volume 9721, DIMVA 2016, 122–141. New [2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; York, NY, USA: Springer-Verlag New York, Inc. Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing [2008] Moskovitch, R.; Feher, C.; Tzachar, N.; Berger, E.; properties of neural networks. In ICLR. Gitelman, M.; Dolev, S.; and Elovici, Y. 2008. Unknown [2016] Wressnegger, C.; Freeman, K.; Yamaguchi, F.; and Malcode Detection Using OPCODE Representation. In Pro- Rieck, K. 2016. From Malware Signatures to Anti-Virus ceedings of the 1st European Conference on Intelligence and Assisted Attacks. Security Informatics, EuroISI ’08, 204–215. Berlin, Heidel- [2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and berg: Springer-Verlag. understanding convolutional networks. Lecture Notes in Com- [2016] Papernot, N.; McDaniel, P. D.; and Goodfellow, I. J. puter Science (including subseries Lecture Notes in Artifi- 2016. Transferability in machine learning: from phenom- cial Intelligence and Lecture Notes in Bioinformatics) 8689 ena to black-box attacks using adversarial samples. CoRR LNCS(PART 1):818–833. abs/1605.07277.