=Paper= {{Paper |id=Vol-2269/FSS-18_paper_11 |storemode=property |title=Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus |pdfUrl=https://ceur-ws.org/Vol-2269/FSS-18_paper_11.pdf |volume=Vol-2269 |authors=William Fleshman,Edward Raff,Richard Zak,Mark McLean,Charles Nicholas |dblpUrl=https://dblp.org/rec/conf/aaaifs/FleshmanRZMN18 }} ==Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus== https://ceur-ws.org/Vol-2269/FSS-18_paper_11.pdf

Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine
Learning and Current Anti-Virus

William Fleshman1 Edward Raff1,2,3 Richard Zak1,2 Mark McLean1 Charles Nicholas3
1
Laboratory for Physical Sciences, 2 Booz Allen Hamilton, 3 University of Maryland, Baltimore County
{william.fleshman,edraff,rzak,mrmclea}@lps.umd.edu, nicholas@umbc.edu

Abstract Given these issues and urgency, Machine Learning would
As machine-learning (ML) based systems for malware detec- appear to be a potential solution to the problem of detect-
tion become more prevalent, it becomes necessary to quan- ing new malware. Malware detection can be directly phrased
tify the benefits compared to the more traditional anti-virus as a classification problem. Given a binary, we define some
(AV) systems widely used today. It is not practical to build an feature set, and learn a binary classifier that outputs either
agreed upon test set to benchmark malware detection systems benign or malicious. The output of this model could be cali-
on pure classification performance. Instead we tackle the prob- brated to reach any desired false-positive ratio.
lem by creating a new testing methodology, where we evaluate However, one should never switch to a new technology
the change in performance on a set of known benign & ma- for its own sake. It is necessary to have empirical and quan-
licious files as adversarial modifications are performed. The
change in performance combined with the evasion techniques
tifiable reasons to adopt a new approach for malware detec-
then quantifies a system’s robustness against that approach. tion. Ideally, this would be based off the accuracy of cur-
Through these experiments we are able to show in a quantifi- rent anti-virus systems compared to their machine-learning
able way how purely ML based systems can be more robust counterparts. In reality, this is non-trivial to estimate, and
than AV products at detecting malware that attempts evasion many arguments currently exist as to how this should be done
through modification, but may be slower to adapt in the face (Jordaney et al. 2016; Deo et al. 2016; Sommer and Paxson
of significantly novel attacks. 2010). Designing a corpus and test protocol to determine
accuracy at large is hampered by issues like concept drift
1 Introduction (Jordaney et al. 2017), label uncertainty, cost (Miller et al.
The threat and impact of malicious software (malware) has 2016), and correlations over time (Kantchelian et al. 2013;
continued to grow every year. The annual financial impact Singh, Walenstein, and Lakhotia 2012).
is already measured in the hundreds of billions of dollars Toward remedying this issue, we propose a new testing
(Hyman 2013; Anderson et al. 2013). Simultaneously, there methodology that can highlight a detector’s strength or weak-
are worries that the classical anti-virus approach may not ness with regard to specific attack types. We first utilize the
continue to scale and fail to recognize new threats (Spafford framework from Biggio et al., to specify a model for our ad-
2014). versary (Biggio, Fumera, and Roli 2014). After defining the
Anti-virus systems were historically built around signature adversary’s goal, knowledge, and capabilities, we look at how
based approaches. Wressnegger et al. (2016) discussed a difficult it is for an adversary with known malware to evade
number of issues with signatures, but the primary shortcom- malware detection systems under comparison. As examples,
ing is an intrinsically static nature and inability to generalize. we compare two machine learning approaches, one a simpler
Most anti-virus companies have likely incorporated machine n-gram model and one neural network based, to a quartet of
learning into their software, but to what extent remains un- production AV systems. While we do not exhaustively test
clear due to the nature of proprietary information. Regard- all possible evasion techniques, we show the potential for our
less, adversaries can still successfully evade current detection protocol using a handful of relevant tests: non-destructive au-
systems with minimal effort. The success of methods like tomated binary modification, destructive adversarial attacks,
obfuscation and polymorphism is evident by its prevalence, and malicious injection. We are specifically looking only
and recent work has suggested that the majority of unique at black box attacks which can be applied to any potential
malicious files are due to the use of polymorphic malware malware detector, assuming that the adversary is looking for
(Li et al. 2017). maximal dispersion and not a targeted attack. This new eval-
uation protocol is our primary contribution, and a black-box
Copyright © by the paper’s authors. Copying permitted for private adversarial attack for byte-based malware detectors our sec-
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-
ond contribution. This evaluation protocol allows us to make
sium on Adversary-Aware Learning Techniques and Trends in Cy- new conclusions on the potential benefits and weakness of
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published two byte based machine learning malware detectors.
at http://ceur-ws.org We will review previous work in section 2. The experi-
ments and methodology we use to compare our ML based Another approach is to modify the bytes of a binary, and
systems to anti-virus will be given in section 3 and section 4, run the malware detection system after modification. While
followed by their results in section 5 and conclude in sec- this risks breaking the binary, it can still be effective, and it
tion 6. is important to still quarantine malware even if it does not ex-
ecute as intended. Wressnegger et al. (2016) modified each
2 Related Work binary one byte at a time, and found that single byte changes
could evade classical AV detectors. They used this approach
Malware detection and analysis has been an area of active to find which bytes were important to the classification, and
research for several years. One class of techniques that is of reverse-engineered signatures used from multiple AV prod-
particular note is dynamic analysis, where the binary itself ucts. We extend this technique in subsection 4.2 to create a
is run to observe its behavior. Intuitively, malicious behavior new adversarial approach against our models, which finds a
is the best indicator of a malicious binary — making dy- contiguous byte region which can be altered to evade detec-
namic analysis a popular approach. However, dynamic analy- tion.
sis has many complications. It requires significant effort and For the machine learning based malware detection meth-
infrastructure to make it accurately reflect user environments, ods, we look at two approaches: byte n-grams and neural
which are often not met in practice (Rossow et al. 2012). Fur- networks. The byte n-gram approach was one of the first
thermore, the malware author can use a number of techniques machine learning methods proposed for malware detection
to detect dynamic execution, hide behavior, or otherwise ob- (Schultz et al. 2001), and has received significant attention
fuscate such analysis. This means effective dynamic analysis (Kolter and Maloof 2006). We use the same approach for
systems are often a cat-and-mouse game of technical issues, training such models as presented in Raff et al. (2016). For
and can require running binaries across many layers of emu- the neural network approach we use the recently proposed
lation (Egele et al. 2017). MalConv, which processes an entire binary by using an em-
It is for these reasons that we focus on the static analysis bedding layer to map bytes to feature vectors followed by
case. This removes the unlimited degrees of freedom pro- a convolutional network (Raff et al. 2017). Because the em-
vided by actually running code. However, this does not mean bedding layer used by MalConv is non-differentiable with
the malicious author has no recourse. Below we will review respect to the input bytes, the aforementioned adversarial
the related work in this area. approaches applied to image classification tasks are not as
Questions regarding the security of machine learning based easily applicable. Similarly, anti-virus products do not pro-
systems have been an active area of research for over a decade vide a derivative for such attacks. This is why we develop a
(Barreno et al. 2010), but have recently received increased new black-box adversarial approach in subsection 4.2.
attention. This is in particular due to the success in generating Using these modeling methods is beneficial for our analy-
adversarial inputs to neural networks, which are inputs that sis because they are trained with minimal assumptions about
induce an incorrect classification despite only minor changes the underlying data. Other approaches to malware detection
to the input (Szegedy et al. 2014). These approaches generally that process the PE-header (Shafiq et al. 2009) or use disas-
work by taking the gradient of the network with respect to sembly (Moskovitch et al. 2008) exist. We avoid these under
the input, and adjusting the input until the network’s output is the belief that their use of explicit feature sets aids in evading
altered. This does not directly translate to malware detection them (e.g, modifying the PE-Header is easy and allows for
when using the raw bytes, as bytes are discontinuous in nature. direct evasion of models using those features).
That is to say, any change in a single byte’s value is an equally We leverage the existing framework from (Biggio, Fumera,
“large” change, whereas in images, adjusting a pixel value and Roli 2014) to model our adversary for each comparison
can result in visually imperceptible changes. Arbitrary byte made. This framework prescribes defining the adversary’s
changes may also result in a binary that does not execute, goals, knowledge, and capabilities in regards to the classifiers.
making it more difficult to apply these attacks in this space
(Grosse et al. 2016; Russu et al. 2016). While such attacks
on individual models are possible (Kolosnjaji et al. 2018;
3 Adversarial Model
Kreuk et al. 2018), we are concerned with attacks that can be For each experiment we specify the model of our adversary
applied to any detector, therefore these methods are out of by outlining the goals, knowledge, and capabilities available
the scope of this work. (Biggio, Fumera, and Roli 2014). These make clear the as-
To overcome the issue with arbitrary byte changes breaking sumptions and use cases presented by each scenario.
executables, Anderson, Filar, and Roth (2017) developed The goal of our adversary is the same across all experi-
a set of benign actions that can modify a binary without ments – using a single attack to maximize the misclassifica-
changing its functionality. Selecting these actions at random tion rate of malicious files across many classifiers. The one
allowed malware to evade a ML model 13% of the time. caveat is that in subsection 4.2 we only goes as far as identi-
Introducing a Reinforcement Learning agent to select the fying a small portion of the file which can be altered to evade
actions increased this to 16% of the time. Their ML model detection. A real attacker would then have to alter those bytes
used features from the PE header as well as statistics on while retaining the malicious functionality of the file. It is un-
strings and bytes from the whole binary. We will replicate likely that authors of benign software would seek a malicious
this type of attack as one of the comparisons between our ML classification – therefore we limit ourselves to experiments
systems and anti-virus. on malware only.
We severely constrain the knowledge of our adversary. In corpora of benign and malicious files. For the AV products,
all cases we assume the adversary has no knowledge of the these are used to ensure that a certain target level of false
classifiers’ algorithms, parameters, training data, features, or positives and false negatives are met. In addition, the training
decision boundaries. This is the best case scenario for security corpus we use for our models is orders of magnitude smaller
applications, as knowledge of any of the previous attributes than what the AV companies will have access to. For this
would increase the sophistication of possible attacks. reason we would not necessarily expect our approaches to
Similarly, the experiments outlined in subsection 4.1 and have better accuracy. The goal is to quantify the robustness
subsection 4.3 require no prior interaction with the classi- of any classifier to a specific threat model and attack.
fiers before deploying a malicious file to the wild. In sub- One novel and three pre-existing techniques (“φ”s) will
section 4.2, the adversary has the capability of querying our be used to perform this evaluation. These are not exhaustive,
n-gram model only. The inner workings of the model remain but show how our approach can be used to quantify relative
hidden, but the output of the model is available for deductive performance differences. Below we will review their method-
reasoning. ology, and the information that we can glean from their re-
These assumptions are restrictive to the adversary’s actions sults. In section 5 we will review the results of running these
and knowledge, but are important because current adversaries experiments.
are able to succeed under such restrictions today. This means
malware authors can obtain success with minimal effort or Dataset
expense. Through our evaluation procedure we are able to We take only a brief moment to expound upon the machine
better understand which techniques / products are most robust learning models and data used for training, as the information
to this restricted scenario, and thus will increase the effort disparity with other AV products makes direct comparison
that must be expended by malware authors. difficult. We trained the machine learning models on a large
dataset from industry discussed in (Raff and Nicholas 2017)
4 Experiments and Methodology which contains approximately 2 million samples balanced
We now perform experiments under our proposed testing almost evenly between benign and malicious. Our test set
methodology to compare machine learning models to com- consists of approximately 80,000 files held out for our post-
mercial anti-virus systems. To do so, we compare two ma- training experimentation. The test set is also almost perfectly
chine learning classifiers and four commercial anti-virus prod- balanced. This set was not seen by our machine learning mod-
ucts: AV1, AV2, AV3, and AV4. We use these anonymous els during training, but could have files previously seen by
names to be compliant with their End User License Agree- the anti-virus corporations. The n-gram model follows the
ments (EULA), which bars us from naming the products in procedure used in (Raff et al. 2016), but uses one million fea-
any comparative publication. We are prohibited from using tures instead of 100,000. The MalConv approach is specified
the internet based mass scanners, like VirusTotal, for similar in (Raff et al. 2017). Both of these approaches require no
reasons. Ideally, we could test many products, but with our domain knowledge to apply.
large corpus of files we limit ourselves to these four – which
we believe to be representative of the spectrum of commer- 4.1 Randomly Choosing Benign Modifications
cially used AV platforms. We purposely did not chose any In this experiment, we use a subset1 of the modifications used
security products which advertise themselves as being pri- by Anderson, Filar, and Roth (2017) to alter malicious files
marily powered by machine learning or artificial intelligence. before scanning them with the classifiers. The objective is to
Additionally, any cloud based features which could upload test the brittleness of the classifiers’ decision mechanisms by
our files were disabled to protect the proprietary nature of our making small changes that do not alter the functionality of
dataset. Our machine learning based approaches will be built the malware. This experiment is ideal because it produces a
upon n-grams and neural networks which process a file’s raw new file that is still functional, and should have no impact on
bytes. We will compare these systems on a number of tasks an ideal malware detector.
to evaluate their robustness to evasion. There are nine different modification actions, and grouped
We now describe our new testing protocol. Given a by type are:
corpus {x1 , . . . , xn } of n files with known labels yi ∈
{Benign, Malicious}, and detector C(·) we first measure • rename or create new sections
• append bytes to the end of a section or the file
the accuracy of C(xi ), ∀i. This gives us our baseline • add an unused function to the import table
performance. We then use a suite of evasion techniques • create a new entry point (which jumps to the old entry)
φ1 (·), . . . , φK (·) and look at the difference in accuracy be- • modify the header checksum, the signature, or debug info
tween C(xi ) = yi and C(φ(xi )) = yi . The difference in Each malicious file was scanned before the experiment
these scores tells us the ability of the detector C(·) to general- to see if it was already being misclassified as benign (false
ize past its known data and catch evasive modifications. Any negative). If it was correctly classified as malicious then a
approach for which C(xi ) 6= C(φj (xi )) for many different modification was randomly selected from the set and applied.
evasion methods φ which preserve the label yi , is then an in- This process was repeated up to ten times or until the file
trinsically brittle detector, even if it obtains perfect accuracy evaded the classifier. Approaches that rely too heavily on
before φ is used. exact signatures or patterns should suffer in this test, and we
The raw detection accuracy is not the point of this test.
1
Each system will have been built using different and distinct A UPX (un)packing option did not operate correctly.
would expect any approach that uses purely dynamic features Algorithm 1 Occlusion Binary Search
to be unaffected by the modifications. Require: A file F of length |F |, a classifier C(·), target
This experiment is time intensive, and so we use a subset of occlusion size β, byte replacement distribution D
1,000 files randomly chosen from our test set. Anderson, Filar, 1: split ← |F |/2, size ← |F |/2
and Roth limited themselves to a small 200 file sample, so our 2: start ← 0, end ← |F |
larger test set may provide additional confidence in the results. 3: while size > β do
The original paper proposed using a reinforcement learning 4: Fl ← F , Fr ← F
algorithm (RLA) to select the modifications in an adversarial 5: Fl [split−size:split] ← contiguous sample ∼ D
manner. We found that including a RLA did not change 6: Fr [split:split+size] ← contiguous sample ∼ D
results, but did limit our ability to test with all products in a 7: if C(Fl ) < C(Fr ) then
timely manner. 8: split ← split − size/2
9: start ← split − size, end ← split
4.2 Targeted Occlusion of Important Bytes
10: else
The first approach described requires extensive domain 11: split ← split + size/2
knowledge based tooling, and would be difficult to adapt to 12: start ← split, end ← split + size
new file types. We introduce a novel approach to find impor- 13: size ← size/2
tant byte regions of a malicious file given a working detector.
Identifying the most important region of a file gives feedback 14: return start, end
to adversaries, and allows us to occlude that region as an
evasion technique.
Our approach works by occluding certain bytes of a file, benign training files to occlude the bytes in the file under
and then running the malware detector on the occluded file. test. Since we use this approach only with malicious files,
By finding a contiguous region of bytes that reduced the Adversarial Occlusion is an especially challenging test for
detector’s confidence that a file is malicious, we can infer our machine learning based methods: they have seen these
that the bytes in that region are important to the detector. If byte regions before with an explicit label of benign. This
occluding a small region causes the detector to change its vote test also helps to validate that all approaches aren’t simply
from malicious to benign, then we can infer that the detector detecting high-entropy regions that “look” packed, and then
is too fragile. In this case it is plausible for a malware author defaulting to a decision of malicious.
to determine what is stored in the small region detected, and To show that our search procedure provides meaningful
then modify the binary to evade detection. improvements for the Random and Adversarial Occlusions,
Manually editing groups of bytes one at a time would we will also compare against an Undirected Occlusion ap-
be computationally intractable. Given a file F of |F | bytes, proach that eschews our search procedure. In this case we
and a desired contiguous region size of β, it would take randomly select a region of the binary to occlude with high-
O(|F |) calls to the detector to find the most important region. entropy random bytes.
We instead develop a binary-search style procedure, which Recall, that this method identifies the critical region – for a
is outlined in Algorithm 1. Here C(·) returns the malware successful attack the adversary would then have to manually
detector’s confidence that a given file is malicious, and D is a alter the portion of the file at that location while maintaining
source of bytes to use for the occlusion of the original bytes. their intended outcome. For this reason, we use a single
This approach allows us to find an approximate region of target window size of 2048 bytes. The resulting occluded area
importance in only O(log |F |) calls to the detector C(·). In will fall in the range of (1024-2048] bytes. On average, this
the event that C(Fl ) = C(Fr ), ties can be broken arbitrarily restricts the occluded section to under 0.5% of the file size for
– which implicitly covers the boolean decisions from an anti- our testing set. We also limit ourselves to searching for critical
virus product. regions with our n-gram model. It would be infeasible for an
This method starts with a window size equal to half of the adversary to conduct this search across all possible classifiers,
file size. Both halves are occluded and the file is analyzed and we would expect there to be some overlap among which
by the classifier. The half that results in the largest drop in regions are critical. This has been shown to be true in gradient
classification confidence is chosen to be split for the next time based classifiers, where an adversarial example generated by
step. This binary search through the file is continued until the one model can fool many others (Papernot, McDaniel, and
window size is at least as small as the target window size. Goodfellow 2016).
The last component of our approach is to specify what
method D should be used to replace the bytes of the original 4.3 ROP Injection
file. One approach, which we call Random Occlusion, is The last method we will consider is of a different variety.
to simply select each replacement byte at random. This is Instead of modifying known malicious files in an attempt to
similar to prior work in finding the important regions of an evade detection, we will inject malicious functionality into
image according to an object detector, where it was found otherwise benign applications.
that randomly blocking out the pixels was effective — even if Many such techniques for this exist. We used the Return
the replacement values were nonsensical (Zeiler and Fergus Oriented Programming (ROP) Injector (Poulios, Ntantogian,
2014). We also look at Adversarial Occlusion, where we use and Xenakis 2015) for our work. The ROP Injector converts
a contiguous region selected randomly from one of our own malicious shellcode into a series of special control flow in-
structions that are already found inside the file. These instruc- Table 1. Baseline accuracy and true/false positive/negative results
tions are inherently benign, as they already exist in the benign for each model on the test set.
file. The technique patches the binary in order to modify the
memory stack so that the targeted instructions, which are non- Classifier TN% TP% FN% FP% Accuracy%
contiguous in the file, are executed in an order equivalent to N-Gram 92.1 98.7 1.3 7.9 95.5
the original shellcode. The result is that the functionality of MalConv 90.7 97.2 2.8 9.3 94.1
the file is maintained, with the added malicious activity exe- AV1 94.3 99.5 0.5 5.7 97.0
cuting right before the process exits. We use the same reverse AV2 99.4 64.9 35.1 0.6 81.6
Meterpreter shellcode as Poulios, Ntantogian, and Xenakis AV3 98.5 80.5 19.5 1.5 89.2
(2015) for our experiment. AV4 93.8 91.9 8.1 6.6 92.6
We note that not all benign executables in our testing set
were injectable. The files must either have enough instruc-
tions to represent the shellcode already, or have room for
further instructions to be added in a non-contiguous man- 1,000 False Negatives
ner so as to prevent raising suspicion. Additionally, if the Evaded
malware requires external functionality such as networking 800 Correct
capabilities, and these are not inherent to the original exe-
cutable, then portions of the PE header such as the import 600
table must be adjusted. For these reasons, only 13,560 files

Files
were successfully infected.
Poulios, Ntantogian, and Xenakis claim they were able to 400
evade anti-virus over 99% of the time using this technique.
This approach should be nearly undetectable with static anal- 200
ysis alone, as all instructions in the file remain benign in
nature. This makes it an extremely difficult test for both the
0
machine learning and anti-virus detectors.

4
5 Results

AV
ra

Co
-G

al
N

Before delving into results, we make explicit that we con-
sidered Malware the positive class and Benign the negative M
class . We remind the reader that not all tested systems are on Figure 1. Robustness of classifiers to benign modifications. The blue
equal footing. Our n-gram and MalConv models are trained bar (left) represents the number of false negatives before any modi-
on the same corpus, but we have no knowledge of the cor- fications were made. The red bar (center) represents the number of
pora used by the AV companies and their products. In all files that were able to evade. The tan bar (right) represents the num-
likelihood, they will be using datasets orders of magnitude ber of files that were still classified correctly after 10 modifications.
Higher is better for the tan bar, and lower is better for all others.
larger than our own to ensure the quality of their products and
to reach a desired false-positive rate. From this perspective
alone, we would not necessarily expect the machine learning
Percentage of Files Evading

N-Gram MalConv AV1
based methods to have better accuracies, as there is a dispar- 100 AV2 AV3 AV4
ity of information at training time. Our methods are also not
optimized for the same low false-positive goal. 80
The accuracies and metrics of each method are presented
60
in Table 1. Because of the mentioned information disparity
between each system, we do not present this information 40
for direct comparison. Our goal is to use these metrics as
a base-line measure of performance for each method, and 20
look at how these baselines are affected by various evasion
techniques. 0
2 4 6 8 10
5.1 Benign Modifications
Number of Modifications Performed
The results of the experiment described in subsection 4.1 are
shown in Figure 1. All four anti-virus products were signifi- Figure 2. The number of files that evade increase as the number of
cantly degraded by making seemingly insignificant changes modifications increases. Lower is better.
to the malware files. The machine learning models were im-
mune to this evasion technique. Both model’s confidence that
these files were malware changed by less than 1% on average. The anti-virus products did not perform as well in this test.
The few files that did evade were very close to the models’ de- AV3 and AV4 were the most robust to these modifications,
cision boundaries before the modifications were performed. with 130 and 215 files evading each respectively. While this
is better than products like AV1, where 783 out of 1000 ·104
malicious files where able to evade, it is still an order of
magnitude worse than the 3 and 20 files that could evade the Baseline Undirected
n-gram and MalConv models, respectively. Random Adversarial
Given these results, one might wonder — could the ML 4
based models be evaded by this approach if more modifica-
tions were performed? In particular, a number of the modifica-

Files
tions have multiple results for a single action. So re-applying
3
them is not unreasonable. To answer this question, we look at
the evasion rate as a function of the number of modifications
performed in Figure 2. In this plot we can see that all four
AV products have an upward trending slope as the number
2
of modifications is increased. In contrast, the n-gram model
is completely unchanged from 3 or more modifications, and
MalConv at 4 or more. This would appear to indicate that the

4
AV

AV
ra

Co
ML approaches are truly immune to this evasion technique,

-G

al
N
whereas the AVs are not.

M
These results also speak to the use of dynamic evaluation,
or lack thereof, in the AV products at scan time. Any dynamic Figure 3. Robustness to byte level manipulations. Black dashed line
analysis based approach should be intrinsically immune to indicates maximum. The blue bar (far left) represents number of
the modifications performed in this test. Because we see all files classified correctly before occlusion. The red bar (left center)
AV products fail to detect 13-78% of modified malware, we represents the number of files classified correctly after random
can postulate that if any dynamic analysis is being done its occlusion. The tan bar (center right) represents the number of files
classified correctly after targeted occlusion using random bytes. The
application is ineffectual. dark gray bar (far right) represents the number of files classified
We make note that in the original work of Anderson, Fi- correctly after targeted occlusion using bytes from known good files.
lar, and Roth (2017), they applied this attack to a simple
machine learning based model that used only features from
the PE header. This attack was able to evade the PE-feature
based model, but not the byte based models tested under our among the AV vendors, but was still impacted by the oc-
framework. clusion of bytes and did not perform as strongly as the ML
models.
5.2 Targeted Occlusion From the AV results it is also clear that the Targeted Ran-
The experiment detailed in subsection 4.2 attempts to modify dom occlusion is an improvement over the Undirected occlu-
the most important region for detecting the maliciousness of sion of random bytes, showing that Algorithm 1 is effective
a binary as deduced from querying only the n-gram model at identifying important regions. This was not discernible
and attempting to transfer that knowledge to all others. from the n-gram and MalConv models, which are relatively
immune to this scenario.
The results for the byte occlusion tests are given in Fig-
ure 3, where the blue bar shows the accuracy on 40,000 mali- Although on average the occlusion was limited to 0.5%
cious files when no bytes are occluded. For both the n-gram of a file’s bytes, we acknowledge that one could argue this
and MalConv models, we see that the Random, Undirected, transformation may have altered the true label of the file. The
and Adversarial occlusion attacks have almost no impact on search could be potentially removing the malicious payload
classification accuracy. In the case of the n-gram model, only of the application, rendering it muted (if runnable). We accept
0.13% of the files where able to evade its detection after this possibility under the assumption that the adversary now
occlusion. Again evasions were highly correlated with files has enough information to alter this small portion of the file
close to the decision boundary. in order to break possible signatures.
In particular, we remind the reader that the Adversarial The results from the machine learning models would sug-
occlusion is replacing up to 2KB of the malicious file with gest that maliciousness is not contained to a small contiguous
bytes taken from benign training data. This is designed to section of most files. This makes intuitive sense, as malicious
maximally impact the ML approaches, as the model was functionality being employed in executable sections of a file
trained with the given bytes as having an explicit label of might also require functions to be referenced in the import
benign. Yet, the Adversarial choice has almost no impact on table. The non-contiguous nature of the PE file format allows
results. This is a strong indicator of the potential robustness of for many similar circumstances, and so we believe it unlikely
the ML approach, and that simply adding bytes from benign that all of a file’s maliciousness would be contained within a
programs is not sufficient to evade their detection. small contiguous region of a binary. In addition, this results
Considering the AV products we again see a different story. in binaries that are analogous to deployed malware with bugs.
AV1 had the highest detection rate when no occlusion oc- Just because malware may not function properly, and thus not
curred, but also had the largest drop in detection rate for negatively impact users, doesn’t remove its malicious intent
all three types of occlusion. AV4 had the best performance and the benefit in detecting it.
5.3 ROP Injection Results features of the injection), or is simply not sufficiently repre-
sented in our training data for the model to learn.
In Table 2 we show the results of running the ROPInjector on
the benign test files. The results are shown only for the 13,560 This question was tested by applying the ROPInjector to all
files that the ROPInjector was able to infect. Before infection, of our training data, both benign and malicious files, which re-
the Pre-ROP Accuracy column indicates the percentage of sulted in 235,865 total ROPInjected binaries. We then trained
files correctly labeled benign. After infection, the Post-ROP an n-gram model that tried to distinguish between ROPIn-
Accuracy column indicates what percentage were correctly jected vs Not-ROPInjected binaries. The n-gram model was
labeled as malicious. The Post-ROP Lift then shows how able to obtain 97.6% accuracy at this task with an AUC of
many percentage points of Post-ROP Accuracy came from 99.6%. This indicates that the models could learn to detect
the classifier correctly determining that a formerly benign file ROPInjection, but that it is not sufficiently prevalent in our
is now malicious, rather than simply being a false-positive. training corpus for the models to have learned.
That is to say, if a model incorrectly called a benign file Overall this case highlights a potential advantage of the
malicious, it’s not impressive when it calls an infected version classical AV systems with extensive domain knowledge.
of the file malicious as well. When sufficiently novel attacks are developed for which cur-
rent models have almost no predictive power, it may be easier
Table 2. Accuracy on originally benign binaries before and after to develop and deploy new signatures to catch the attacks —
applying the ROPInjector. while the machine learning approaches may require waiting
for data, or creating synthetic data (which has its own risks)
Classifier Pre-ROP Accuracy Post-ROP Accuracy Post-ROP Lift to adjust the model.
N-Gram 85.1 15.3 0.4
MalConv 82.4 18.8 1.2
AV1 99.3 1.3 0.6 6 Conclusion
AV2 98.7 1.2 −0.1
AV3 97.9 0.7 −1.4 We have demonstrated a new testing methodology for compar-
AV4 89.2 32.9 22.1
ing the robustness of machine learning classifiers to current
anti-virus software. Furthermore, we have provided evidence
Most malware detectors had significant difficulty with this that machine learning approaches may be more successful at
task. The machine learning based models, n-gram and Mal- catching malware that has been manipulated in an attempt to
Conv, showed only modest improvements of 0.4 and 1.2 per- evade detection. Anti-virus products do a good job of catch-
centage points. AV1 had a similarly small 0.6 improvement, ing known and static malicious files, but their rigid decision
but surprisingly AV2 and AV3 had negative improvements. boundaries prevent them from generalizing to new threats or
That means for 1.4% of the benign files, AV3 called the catching evolutionary malware. We demonstrated that top tier
original benign file malicious. But when that same file was anti-virus products can be fooled by simple modifications in-
infected by the ROPInjector, AV3 changed its decision to cluding changing a few bytes or importing random functions.
benign. This is a surprising failure case, and may be caused The machine learning models appear to better generalize ma-
by the AV system relying too heavily on a signature based liciousness — leading to an increased robustness to evasion
approach (i.e., a signature had a false-positive on the benign techniques compared to their anti-virus counterparts.
file, but the infection process broke the signature — turning
a would-be true positive into a false-negative). References
Overall these models showed no major impact from the
ROPInjector. Only AV4 was able to significantly adjust its [2013] Anderson, R.; Barton, C.; Böhme, R.; Clayton, R.;
decision for 22.1% of the infected files to correctly label them van Eeten, M. J. G.; Levi, M.; Moore, T.; and Savage, S.
as malicious. Though the evasion rate for AV4 remains high 2013. Measuring the Cost of Cybercrime. Berlin, Heidelberg:
at 77%, it performed best in this scenario. Given the muted Springer Berlin Heidelberg. 265–300.
and negative performance of the other AV systems in this test, [2017] Anderson, H. S.; Filar, B.; and Roth, P. 2017. Evading
we suspect AV4 has developed techniques specifically for the Machine Learning Malware Detection. Black Hat USA.
ROPInjector approach. [2010] Barreno, M.; Nelson, B.; Joseph, A. D.; and Tygar,
We also notice that the benign files that were injectable J. D. 2010. The security of machine learning. Machine
include 67% of our original false positives for the n-gram Learning 81(2):121–148.
model. We suspect that this is due to those files already con- [2014] Biggio, B.; Fumera, G.; and Roli, F. 2014. Security
taining the networking functionality required by the shell- evaluation of pattern classifiers under attack. IEEE Transac-
code. The majority of malware requires communication over tions on Knowledge and Data Engineering 26(4):984–996.
the Internet, therefore our machine learning detectors may
view networking related features as somewhat malicious in [2016] Deo, A.; Dash, S. K.; Suarez-Tangil, G.; Vovk, V.; and
nature. Cavallaro, L. 2016. Prescience: Probabilistic Guidance on the
Overall the ROPInjection of malicious code into other- Retraining Conundrum for Malware Detection. In Proceed-
wise benign applications presents an apparent weakness for ings of the 2016 ACM Workshop on Artificial Intelligence
our machine learning approaches. An important question and Security, AISec ’16, 71–82. New York, NY, USA: ACM.
is whether the ROPInjection is functionally invisible to the [2017] Egele, M.; Scholte, T.; Kirda, E.; and Barbara, S. 2017.
machine learning based models (i.e., it could never detect A Survey On Automated Dynamic Malware Analysis Eva-
sion and Counter-Evasion. In Proceedings of Reversing and [2015] Poulios, G.; Ntantogian, C.; and Xenakis, C. 2015.
Offensive-oriented Trends Symposium. ROPInjector: Using Return- Oriented Programming for Poly-
[2016] Grosse, K.; Papernot, N.; Manoharan, P.; Backes, M.; morphism and AV Evasion. In Black Hat USA.
and McDaniel, P. D. 2016. Adversarial perturbations against [2017] Raff, E., and Nicholas, C. 2017. Malware Classifica-
deep neural networks for malware classification. CoRR tion and Class Imbalance via Stochastic Hashed LZJD. In
abs/1606.04435. Proceedings of the 10th ACM Workshop on Artificial Intel-
[2013] Hyman, P. 2013. Cybercrime. Communications of the ligence and Security, AISec ’17, 111–120. New York, NY,
ACM 56(3):18. USA: ACM.
[2016] Jordaney, R.; Wang, Z.; Papini, D.; Nouretdinov, I.; [2016] Raff, E.; Zak, R.; Cox, R.; Sylvester, J.; Yacci, P.;
and Cavallaro, L. 2016. Misleading Metrics : On Evaluating Ward, R.; Tracy, A.; McLean, M.; and Nicholas, C. 2016. An
Machine Learning for Malware with Confidence. Technical investigation of byte n-gram features for malware classifica-
report, University of London. tion. Journal of Computer Virology and Hacking Techniques.
[2017] Jordaney, R.; Sharad, K.; Dash, S. K.; Wang, Z.; Pap- [2017] Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catan-
ini, D.; Nouretdinov, I.; and Cavallaro, L. 2017. Transcend: zaro, B.; and Nicholas, C. 2017. Malware Detection by
Detecting Concept Drift in Malware Classification Models. Eating a Whole EXE. arXiv preprint arXiv:1710.09435.
In 26th USENIX Security Symposium (USENIX Security 17), [2012] Rossow, C.; Dietrich, C. J.; Grier, C.; Kreibich, C.;
625–642. Vancouver, BC: {USENIX} Association. Paxson, V.; Pohlmann, N.; Bos, H.; and van Steen, M. 2012.
[2013] Kantchelian, A.; Afroz, S.; Huang, L.; Islam, A. C.; Prudent Practices for Designing Malware Experiments: Sta-
Miller, B.; Tschantz, M. C.; Greenstadt, R.; Joseph, A. D.; tus Quo and Outlook. In 2012 IEEE Symposium on Security
and Tygar, J. D. 2013. Approaches to Adversarial Drift. and Privacy, 65–79. IEEE.
In Proceedings of the 2013 ACM Workshop on Artificial [2016] Russu, P.; Demontis, A.; Biggio, B.; Fumera, G.; and
Intelligence and Security, AISec ’13, 99–110. New York, NY, Roli, F. 2016. Secure kernel machines against evasion attacks.
USA: ACM. In Proceedings of the 2016 ACM Workshop on Artificial
[2018] Kolosnjaji, B.; Demontis, A.; Biggio, B.; Maiorca, D.; Intelligence and Security, AISec ’16, 59–69. New York, NY,
Giacinto, G.; Eckert, C.; and Roli, F. 2018. Adversarial USA: ACM.
Malware Binaries: Evading Deep Learning for Malware De- [2001] Schultz, M.; Eskin, E.; Zadok, F.; and Stolfo, S. 2001.
tection in Executables. Data Mining Methods for Detection of New Malicious Exe-
[2006] Kolter, J. Z., and Maloof, M. A. 2006. Learning cutables. In Proceedings 2001 IEEE Symposium on Security
to Detect and Classify Malicious Executables in the Wild. and Privacy. S&P 2001, 38–49. IEEE Comput. Soc.
Journal of Machine Learning Research 7:2721–2744. [2009] Shafiq, M. Z.; Tabish, S. M.; Mirza, F.; and Farooq, M.
[2018] Kreuk, F.; Barak, A.; Aviv-Reuven, S.; Baruch, M.; 2009. PE-Miner: Mining Structural Information to Detect
Pinkas, B.; and Keshet, J. 2018. Adversarial Examples Malicious Executables in Realtime. In Recent Advances in
on Discrete Sequences for Beating Whole-Binary Malware Intrusion Detection. 121–141.
Detection. [2012] Singh, A.; Walenstein, A.; and Lakhotia, A. 2012.
[2017] Li, B.; Roundy, K.; Gates, C.; and Vorobeychik, Y. Tracking Concept Drift in Malware Families. In Proceedings
2017. Large-Scale Identification of Malicious Singleton Files. of the 5th ACM Workshop on Security and Artificial Intelli-
Proceedings of the Seventh ACM on Conference on Data and gence, AISec ’12, 81–92. New York, NY, USA: ACM.
Application Security and Privacy - CODASPY ’17 227–238. [2010] Sommer, R., and Paxson, V. 2010. Outside the Closed
World: On Using Machine Learning for Network Intrusion
[2016] Miller, B.; Kantchelian, A.; Tschantz, M. C.; Afroz, S.;
Detection. In Proceedings of the 2010 IEEE Symposium on
Bachwani, R.; Faizullabhoy, R.; Huang, L.; Shankar, V.; Wu,
Security and Privacy, SP ’10, 305–316. Washington, DC,
T.; Yiu, G.; Joseph, A. D.; and Tygar, J. D. 2016. Reviewer
USA: IEEE Computer Society.
Integration and Performance Measurement for Malware De-
tection. In Proceedings of the 13th International Conference [2014] Spafford, E. C. 2014. Is Anti-virus Really Dead?
on Detection of Intrusions and Malware, and Vulnerability Computers & Security 44:iv.
Assessment - Volume 9721, DIMVA 2016, 122–141. New [2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.;
York, NY, USA: Springer-Verlag New York, Inc. Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing
[2008] Moskovitch, R.; Feher, C.; Tzachar, N.; Berger, E.; properties of neural networks. In ICLR.
Gitelman, M.; Dolev, S.; and Elovici, Y. 2008. Unknown [2016] Wressnegger, C.; Freeman, K.; Yamaguchi, F.; and
Malcode Detection Using OPCODE Representation. In Pro- Rieck, K. 2016. From Malware Signatures to Anti-Virus
ceedings of the 1st European Conference on Intelligence and Assisted Attacks.
Security Informatics, EuroISI ’08, 204–215. Berlin, Heidel- [2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and
berg: Springer-Verlag. understanding convolutional networks. Lecture Notes in Com-
[2016] Papernot, N.; McDaniel, P. D.; and Goodfellow, I. J. puter Science (including subseries Lecture Notes in Artifi-
2016. Transferability in machine learning: from phenom- cial Intelligence and Lecture Notes in Bioinformatics) 8689
ena to black-box attacks using adversarial samples. CoRR LNCS(PART 1):818–833.
abs/1605.07277.