Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine
                       Learning and Current Anti-Virus

        William Fleshman1 Edward Raff1,2,3 Richard Zak1,2 Mark McLean1 Charles Nicholas3
            1
                Laboratory for Physical Sciences, 2 Booz Allen Hamilton, 3 University of Maryland, Baltimore County
                           {william.fleshman,edraff,rzak,mrmclea}@lps.umd.edu, nicholas@umbc.edu


                            Abstract                                     Given these issues and urgency, Machine Learning would
  As machine-learning (ML) based systems for malware detec-          appear to be a potential solution to the problem of detect-
  tion become more prevalent, it becomes necessary to quan-          ing new malware. Malware detection can be directly phrased
  tify the benefits compared to the more traditional anti-virus      as a classification problem. Given a binary, we define some
  (AV) systems widely used today. It is not practical to build an    feature set, and learn a binary classifier that outputs either
  agreed upon test set to benchmark malware detection systems        benign or malicious. The output of this model could be cali-
  on pure classification performance. Instead we tackle the prob-    brated to reach any desired false-positive ratio.
  lem by creating a new testing methodology, where we evaluate           However, one should never switch to a new technology
  the change in performance on a set of known benign & ma-           for its own sake. It is necessary to have empirical and quan-
  licious files as adversarial modifications are performed. The
  change in performance combined with the evasion techniques
                                                                     tifiable reasons to adopt a new approach for malware detec-
  then quantifies a system’s robustness against that approach.       tion. Ideally, this would be based off the accuracy of cur-
  Through these experiments we are able to show in a quantifi-       rent anti-virus systems compared to their machine-learning
  able way how purely ML based systems can be more robust            counterparts. In reality, this is non-trivial to estimate, and
  than AV products at detecting malware that attempts evasion        many arguments currently exist as to how this should be done
  through modification, but may be slower to adapt in the face       (Jordaney et al. 2016; Deo et al. 2016; Sommer and Paxson
  of significantly novel attacks.                                    2010). Designing a corpus and test protocol to determine
                                                                     accuracy at large is hampered by issues like concept drift
                      1   Introduction                               (Jordaney et al. 2017), label uncertainty, cost (Miller et al.
The threat and impact of malicious software (malware) has            2016), and correlations over time (Kantchelian et al. 2013;
continued to grow every year. The annual financial impact            Singh, Walenstein, and Lakhotia 2012).
is already measured in the hundreds of billions of dollars               Toward remedying this issue, we propose a new testing
(Hyman 2013; Anderson et al. 2013). Simultaneously, there            methodology that can highlight a detector’s strength or weak-
are worries that the classical anti-virus approach may not           ness with regard to specific attack types. We first utilize the
continue to scale and fail to recognize new threats (Spafford        framework from Biggio et al., to specify a model for our ad-
2014).                                                               versary (Biggio, Fumera, and Roli 2014). After defining the
   Anti-virus systems were historically built around signature       adversary’s goal, knowledge, and capabilities, we look at how
based approaches. Wressnegger et al. (2016) discussed a              difficult it is for an adversary with known malware to evade
number of issues with signatures, but the primary shortcom-          malware detection systems under comparison. As examples,
ing is an intrinsically static nature and inability to generalize.   we compare two machine learning approaches, one a simpler
Most anti-virus companies have likely incorporated machine           n-gram model and one neural network based, to a quartet of
learning into their software, but to what extent remains un-         production AV systems. While we do not exhaustively test
clear due to the nature of proprietary information. Regard-          all possible evasion techniques, we show the potential for our
less, adversaries can still successfully evade current detection     protocol using a handful of relevant tests: non-destructive au-
systems with minimal effort. The success of methods like             tomated binary modification, destructive adversarial attacks,
obfuscation and polymorphism is evident by its prevalence,           and malicious injection. We are specifically looking only
and recent work has suggested that the majority of unique            at black box attacks which can be applied to any potential
malicious files are due to the use of polymorphic malware            malware detector, assuming that the adversary is looking for
(Li et al. 2017).                                                    maximal dispersion and not a targeted attack. This new eval-
                                                                     uation protocol is our primary contribution, and a black-box
Copyright © by the paper’s authors. Copying permitted for private    adversarial attack for byte-based malware detectors our sec-
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-
                                                                     ond contribution. This evaluation protocol allows us to make
sium on Adversary-Aware Learning Techniques and Trends in Cy-        new conclusions on the potential benefits and weakness of
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published      two byte based machine learning malware detectors.
at http://ceur-ws.org                                                    We will review previous work in section 2. The experi-
ments and methodology we use to compare our ML based                   Another approach is to modify the bytes of a binary, and
systems to anti-virus will be given in section 3 and section 4,     run the malware detection system after modification. While
followed by their results in section 5 and conclude in sec-         this risks breaking the binary, it can still be effective, and it
tion 6.                                                             is important to still quarantine malware even if it does not ex-
                                                                    ecute as intended. Wressnegger et al. (2016) modified each
                    2    Related Work                               binary one byte at a time, and found that single byte changes
                                                                    could evade classical AV detectors. They used this approach
Malware detection and analysis has been an area of active           to find which bytes were important to the classification, and
research for several years. One class of techniques that is of      reverse-engineered signatures used from multiple AV prod-
particular note is dynamic analysis, where the binary itself        ucts. We extend this technique in subsection 4.2 to create a
is run to observe its behavior. Intuitively, malicious behavior     new adversarial approach against our models, which finds a
is the best indicator of a malicious binary — making dy-            contiguous byte region which can be altered to evade detec-
namic analysis a popular approach. However, dynamic analy-          tion.
sis has many complications. It requires significant effort and         For the machine learning based malware detection meth-
infrastructure to make it accurately reflect user environments,     ods, we look at two approaches: byte n-grams and neural
which are often not met in practice (Rossow et al. 2012). Fur-      networks. The byte n-gram approach was one of the first
thermore, the malware author can use a number of techniques         machine learning methods proposed for malware detection
to detect dynamic execution, hide behavior, or otherwise ob-        (Schultz et al. 2001), and has received significant attention
fuscate such analysis. This means effective dynamic analysis        (Kolter and Maloof 2006). We use the same approach for
systems are often a cat-and-mouse game of technical issues,         training such models as presented in Raff et al. (2016). For
and can require running binaries across many layers of emu-         the neural network approach we use the recently proposed
lation (Egele et al. 2017).                                         MalConv, which processes an entire binary by using an em-
   It is for these reasons that we focus on the static analysis     bedding layer to map bytes to feature vectors followed by
case. This removes the unlimited degrees of freedom pro-            a convolutional network (Raff et al. 2017). Because the em-
vided by actually running code. However, this does not mean         bedding layer used by MalConv is non-differentiable with
the malicious author has no recourse. Below we will review          respect to the input bytes, the aforementioned adversarial
the related work in this area.                                      approaches applied to image classification tasks are not as
   Questions regarding the security of machine learning based       easily applicable. Similarly, anti-virus products do not pro-
systems have been an active area of research for over a decade      vide a derivative for such attacks. This is why we develop a
(Barreno et al. 2010), but have recently received increased         new black-box adversarial approach in subsection 4.2.
attention. This is in particular due to the success in generating      Using these modeling methods is beneficial for our analy-
adversarial inputs to neural networks, which are inputs that        sis because they are trained with minimal assumptions about
induce an incorrect classification despite only minor changes       the underlying data. Other approaches to malware detection
to the input (Szegedy et al. 2014). These approaches generally      that process the PE-header (Shafiq et al. 2009) or use disas-
work by taking the gradient of the network with respect to          sembly (Moskovitch et al. 2008) exist. We avoid these under
the input, and adjusting the input until the network’s output is    the belief that their use of explicit feature sets aids in evading
altered. This does not directly translate to malware detection      them (e.g, modifying the PE-Header is easy and allows for
when using the raw bytes, as bytes are discontinuous in nature.     direct evasion of models using those features).
That is to say, any change in a single byte’s value is an equally      We leverage the existing framework from (Biggio, Fumera,
“large” change, whereas in images, adjusting a pixel value          and Roli 2014) to model our adversary for each comparison
can result in visually imperceptible changes. Arbitrary byte        made. This framework prescribes defining the adversary’s
changes may also result in a binary that does not execute,          goals, knowledge, and capabilities in regards to the classifiers.
making it more difficult to apply these attacks in this space
(Grosse et al. 2016; Russu et al. 2016). While such attacks
on individual models are possible (Kolosnjaji et al. 2018;
                                                                                     3    Adversarial Model
Kreuk et al. 2018), we are concerned with attacks that can be       For each experiment we specify the model of our adversary
applied to any detector, therefore these methods are out of         by outlining the goals, knowledge, and capabilities available
the scope of this work.                                             (Biggio, Fumera, and Roli 2014). These make clear the as-
   To overcome the issue with arbitrary byte changes breaking       sumptions and use cases presented by each scenario.
executables, Anderson, Filar, and Roth (2017) developed                The goal of our adversary is the same across all experi-
a set of benign actions that can modify a binary without            ments – using a single attack to maximize the misclassifica-
changing its functionality. Selecting these actions at random       tion rate of malicious files across many classifiers. The one
allowed malware to evade a ML model 13% of the time.                caveat is that in subsection 4.2 we only goes as far as identi-
Introducing a Reinforcement Learning agent to select the            fying a small portion of the file which can be altered to evade
actions increased this to 16% of the time. Their ML model           detection. A real attacker would then have to alter those bytes
used features from the PE header as well as statistics on           while retaining the malicious functionality of the file. It is un-
strings and bytes from the whole binary. We will replicate          likely that authors of benign software would seek a malicious
this type of attack as one of the comparisons between our ML        classification – therefore we limit ourselves to experiments
systems and anti-virus.                                             on malware only.
   We severely constrain the knowledge of our adversary. In          corpora of benign and malicious files. For the AV products,
all cases we assume the adversary has no knowledge of the            these are used to ensure that a certain target level of false
classifiers’ algorithms, parameters, training data, features, or     positives and false negatives are met. In addition, the training
decision boundaries. This is the best case scenario for security     corpus we use for our models is orders of magnitude smaller
applications, as knowledge of any of the previous attributes         than what the AV companies will have access to. For this
would increase the sophistication of possible attacks.               reason we would not necessarily expect our approaches to
   Similarly, the experiments outlined in subsection 4.1 and         have better accuracy. The goal is to quantify the robustness
subsection 4.3 require no prior interaction with the classi-         of any classifier to a specific threat model and attack.
fiers before deploying a malicious file to the wild. In sub-            One novel and three pre-existing techniques (“φ”s) will
section 4.2, the adversary has the capability of querying our        be used to perform this evaluation. These are not exhaustive,
n-gram model only. The inner workings of the model remain            but show how our approach can be used to quantify relative
hidden, but the output of the model is available for deductive       performance differences. Below we will review their method-
reasoning.                                                           ology, and the information that we can glean from their re-
   These assumptions are restrictive to the adversary’s actions      sults. In section 5 we will review the results of running these
and knowledge, but are important because current adversaries         experiments.
are able to succeed under such restrictions today. This means
malware authors can obtain success with minimal effort or            Dataset
expense. Through our evaluation procedure we are able to             We take only a brief moment to expound upon the machine
better understand which techniques / products are most robust        learning models and data used for training, as the information
to this restricted scenario, and thus will increase the effort       disparity with other AV products makes direct comparison
that must be expended by malware authors.                            difficult. We trained the machine learning models on a large
                                                                     dataset from industry discussed in (Raff and Nicholas 2017)
        4    Experiments and Methodology                             which contains approximately 2 million samples balanced
We now perform experiments under our proposed testing                almost evenly between benign and malicious. Our test set
methodology to compare machine learning models to com-               consists of approximately 80,000 files held out for our post-
mercial anti-virus systems. To do so, we compare two ma-             training experimentation. The test set is also almost perfectly
chine learning classifiers and four commercial anti-virus prod-      balanced. This set was not seen by our machine learning mod-
ucts: AV1, AV2, AV3, and AV4. We use these anonymous                 els during training, but could have files previously seen by
names to be compliant with their End User License Agree-             the anti-virus corporations. The n-gram model follows the
ments (EULA), which bars us from naming the products in              procedure used in (Raff et al. 2016), but uses one million fea-
any comparative publication. We are prohibited from using            tures instead of 100,000. The MalConv approach is specified
the internet based mass scanners, like VirusTotal, for similar       in (Raff et al. 2017). Both of these approaches require no
reasons. Ideally, we could test many products, but with our          domain knowledge to apply.
large corpus of files we limit ourselves to these four – which
we believe to be representative of the spectrum of commer-           4.1      Randomly Choosing Benign Modifications
cially used AV platforms. We purposely did not chose any             In this experiment, we use a subset1 of the modifications used
security products which advertise themselves as being pri-           by Anderson, Filar, and Roth (2017) to alter malicious files
marily powered by machine learning or artificial intelligence.       before scanning them with the classifiers. The objective is to
Additionally, any cloud based features which could upload            test the brittleness of the classifiers’ decision mechanisms by
our files were disabled to protect the proprietary nature of our     making small changes that do not alter the functionality of
dataset. Our machine learning based approaches will be built         the malware. This experiment is ideal because it produces a
upon n-grams and neural networks which process a file’s raw          new file that is still functional, and should have no impact on
bytes. We will compare these systems on a number of tasks            an ideal malware detector.
to evaluate their robustness to evasion.                                There are nine different modification actions, and grouped
   We now describe our new testing protocol. Given a                 by type are:
corpus {x1 , . . . , xn } of n files with known labels yi ∈
{Benign, Malicious}, and detector C(·) we first measure              • rename or create new sections
                                                                     • append bytes to the end of a section or the file
the accuracy of C(xi ), ∀i. This gives us our baseline               • add an unused function to the import table
performance. We then use a suite of evasion techniques               • create a new entry point (which jumps to the old entry)
φ1 (·), . . . , φK (·) and look at the difference in accuracy be-    • modify the header checksum, the signature, or debug info
tween C(xi ) = yi and C(φ(xi )) = yi . The difference in                Each malicious file was scanned before the experiment
these scores tells us the ability of the detector C(·) to general-   to see if it was already being misclassified as benign (false
ize past its known data and catch evasive modifications. Any         negative). If it was correctly classified as malicious then a
approach for which C(xi ) 6= C(φj (xi )) for many different          modification was randomly selected from the set and applied.
evasion methods φ which preserve the label yi , is then an in-       This process was repeated up to ten times or until the file
trinsically brittle detector, even if it obtains perfect accuracy    evaded the classifier. Approaches that rely too heavily on
before φ is used.                                                    exact signatures or patterns should suffer in this test, and we
   The raw detection accuracy is not the point of this test.
                                                                        1
Each system will have been built using different and distinct               A UPX (un)packing option did not operate correctly.
would expect any approach that uses purely dynamic features         Algorithm 1 Occlusion Binary Search
to be unaffected by the modifications.                              Require: A file F of length |F |, a classifier C(·), target
   This experiment is time intensive, and so we use a subset of         occlusion size β, byte replacement distribution D
1,000 files randomly chosen from our test set. Anderson, Filar,      1: split ← |F |/2, size ← |F |/2
and Roth limited themselves to a small 200 file sample, so our       2: start ← 0, end ← |F |
larger test set may provide additional confidence in the results.    3: while size > β do
The original paper proposed using a reinforcement learning           4:     Fl ← F , Fr ← F
algorithm (RLA) to select the modifications in an adversarial        5:     Fl [split−size:split] ← contiguous sample ∼ D
manner. We found that including a RLA did not change                 6:     Fr [split:split+size] ← contiguous sample ∼ D
results, but did limit our ability to test with all products in a    7:     if C(Fl ) < C(Fr ) then
timely manner.                                                       8:          split ← split − size/2
                                                                     9:          start ← split − size, end ← split
4.2    Targeted Occlusion of Important Bytes
                                                                    10:     else
The first approach described requires extensive domain              11:          split ← split + size/2
knowledge based tooling, and would be difficult to adapt to         12:          start ← split, end ← split + size
new file types. We introduce a novel approach to find impor-        13:     size ← size/2
tant byte regions of a malicious file given a working detector.
Identifying the most important region of a file gives feedback      14: return start, end
to adversaries, and allows us to occlude that region as an
evasion technique.
   Our approach works by occluding certain bytes of a file,         benign training files to occlude the bytes in the file under
and then running the malware detector on the occluded file.         test. Since we use this approach only with malicious files,
By finding a contiguous region of bytes that reduced the            Adversarial Occlusion is an especially challenging test for
detector’s confidence that a file is malicious, we can infer        our machine learning based methods: they have seen these
that the bytes in that region are important to the detector. If     byte regions before with an explicit label of benign. This
occluding a small region causes the detector to change its vote     test also helps to validate that all approaches aren’t simply
from malicious to benign, then we can infer that the detector       detecting high-entropy regions that “look” packed, and then
is too fragile. In this case it is plausible for a malware author   defaulting to a decision of malicious.
to determine what is stored in the small region detected, and          To show that our search procedure provides meaningful
then modify the binary to evade detection.                          improvements for the Random and Adversarial Occlusions,
   Manually editing groups of bytes one at a time would             we will also compare against an Undirected Occlusion ap-
be computationally intractable. Given a file F of |F | bytes,       proach that eschews our search procedure. In this case we
and a desired contiguous region size of β, it would take            randomly select a region of the binary to occlude with high-
O(|F |) calls to the detector to find the most important region.    entropy random bytes.
We instead develop a binary-search style procedure, which              Recall, that this method identifies the critical region – for a
is outlined in Algorithm 1. Here C(·) returns the malware           successful attack the adversary would then have to manually
detector’s confidence that a given file is malicious, and D is a    alter the portion of the file at that location while maintaining
source of bytes to use for the occlusion of the original bytes.     their intended outcome. For this reason, we use a single
This approach allows us to find an approximate region of            target window size of 2048 bytes. The resulting occluded area
importance in only O(log |F |) calls to the detector C(·). In       will fall in the range of (1024-2048] bytes. On average, this
the event that C(Fl ) = C(Fr ), ties can be broken arbitrarily      restricts the occluded section to under 0.5% of the file size for
– which implicitly covers the boolean decisions from an anti-       our testing set. We also limit ourselves to searching for critical
virus product.                                                      regions with our n-gram model. It would be infeasible for an
   This method starts with a window size equal to half of the       adversary to conduct this search across all possible classifiers,
file size. Both halves are occluded and the file is analyzed        and we would expect there to be some overlap among which
by the classifier. The half that results in the largest drop in     regions are critical. This has been shown to be true in gradient
classification confidence is chosen to be split for the next time   based classifiers, where an adversarial example generated by
step. This binary search through the file is continued until the    one model can fool many others (Papernot, McDaniel, and
window size is at least as small as the target window size.         Goodfellow 2016).
   The last component of our approach is to specify what
method D should be used to replace the bytes of the original        4.3   ROP Injection
file. One approach, which we call Random Occlusion, is              The last method we will consider is of a different variety.
to simply select each replacement byte at random. This is           Instead of modifying known malicious files in an attempt to
similar to prior work in finding the important regions of an        evade detection, we will inject malicious functionality into
image according to an object detector, where it was found           otherwise benign applications.
that randomly blocking out the pixels was effective — even if         Many such techniques for this exist. We used the Return
the replacement values were nonsensical (Zeiler and Fergus          Oriented Programming (ROP) Injector (Poulios, Ntantogian,
2014). We also look at Adversarial Occlusion, where we use          and Xenakis 2015) for our work. The ROP Injector converts
a contiguous region selected randomly from one of our own           malicious shellcode into a series of special control flow in-
structions that are already found inside the file. These instruc-   Table 1. Baseline accuracy and true/false positive/negative results
tions are inherently benign, as they already exist in the benign    for each model on the test set.
file. The technique patches the binary in order to modify the
memory stack so that the targeted instructions, which are non-       Classifier                                  TN%     TP%        FN%       FP%       Accuracy%
contiguous in the file, are executed in an order equivalent to       N-Gram                                  92.1        98.7       1.3       7.9         95.5
the original shellcode. The result is that the functionality of      MalConv                                 90.7        97.2       2.8       9.3         94.1
the file is maintained, with the added malicious activity exe-       AV1                                     94.3        99.5       0.5       5.7         97.0
cuting right before the process exits. We use the same reverse       AV2                                     99.4        64.9      35.1       0.6         81.6
Meterpreter shellcode as Poulios, Ntantogian, and Xenakis            AV3                                     98.5        80.5      19.5       1.5         89.2
(2015) for our experiment.                                           AV4                                     93.8        91.9       8.1       6.6         92.6
   We note that not all benign executables in our testing set
were injectable. The files must either have enough instruc-
tions to represent the shellcode already, or have room for
further instructions to be added in a non-contiguous man-                                           1,000                                      False Negatives
ner so as to prevent raising suspicion. Additionally, if the                                                                                       Evaded
malware requires external functionality such as networking                                           800                                           Correct
capabilities, and these are not inherent to the original exe-
cutable, then portions of the PE header such as the import                                           600
table must be adjusted. For these reasons, only 13,560 files


                                                                     Files
were successfully infected.
   Poulios, Ntantogian, and Xenakis claim they were able to                                          400
evade anti-virus over 99% of the time using this technique.
This approach should be nearly undetectable with static anal-                                        200
ysis alone, as all instructions in the file remain benign in
nature. This makes it an extremely difficult test for both the
                                                                                                       0
machine learning and anti-virus detectors.


                                                                                                            m


                                                                                                                       nv


                                                                                                                                1


                                                                                                                                          2


                                                                                                                                                    3


                                                                                                                                                            4
                        5    Results


                                                                                                                              AV


                                                                                                                                      AV


                                                                                                                                                AV


                                                                                                                                                           AV
                                                                                                            ra

                                                                                                                    Co
                                                                                                       -G


                                                                                                                  al
                                                                                                      N

Before delving into results, we make explicit that we con-
sidered Malware the positive class and Benign the negative                                                       M
class . We remind the reader that not all tested systems are on     Figure 1. Robustness of classifiers to benign modifications. The blue
equal footing. Our n-gram and MalConv models are trained            bar (left) represents the number of false negatives before any modi-
on the same corpus, but we have no knowledge of the cor-            fications were made. The red bar (center) represents the number of
pora used by the AV companies and their products. In all            files that were able to evade. The tan bar (right) represents the num-
likelihood, they will be using datasets orders of magnitude         ber of files that were still classified correctly after 10 modifications.
                                                                    Higher is better for the tan bar, and lower is better for all others.
larger than our own to ensure the quality of their products and
to reach a desired false-positive rate. From this perspective
alone, we would not necessarily expect the machine learning
                                                                      Percentage of Files Evading


                                                                                                                     N-Gram         MalConv          AV1
based methods to have better accuracies, as there is a dispar-                                       100              AV2            AV3             AV4
ity of information at training time. Our methods are also not
optimized for the same low false-positive goal.                                                       80
   The accuracies and metrics of each method are presented
                                                                                                      60
in Table 1. Because of the mentioned information disparity
between each system, we do not present this information                                               40
for direct comparison. Our goal is to use these metrics as
a base-line measure of performance for each method, and                                               20
look at how these baselines are affected by various evasion
techniques.                                                                                            0
                                                                                                                   2       4       6        8       10
5.1   Benign Modifications
                                                                                                                  Number of Modifications Performed
The results of the experiment described in subsection 4.1 are
shown in Figure 1. All four anti-virus products were signifi-       Figure 2. The number of files that evade increase as the number of
cantly degraded by making seemingly insignificant changes           modifications increases. Lower is better.
to the malware files. The machine learning models were im-
mune to this evasion technique. Both model’s confidence that
these files were malware changed by less than 1% on average.          The anti-virus products did not perform as well in this test.
The few files that did evade were very close to the models’ de-     AV3 and AV4 were the most robust to these modifications,
cision boundaries before the modifications were performed.          with 130 and 215 files evading each respectively. While this
is better than products like AV1, where 783 out of 1000                              ·104
malicious files where able to evade, it is still an order of
magnitude worse than the 3 and 20 files that could evade the                                             Baseline      Undirected
n-gram and MalConv models, respectively.                                                                 Random        Adversarial
   Given these results, one might wonder — could the ML                         4
based models be evaded by this approach if more modifica-
tions were performed? In particular, a number of the modifica-


                                                                        Files
tions have multiple results for a single action. So re-applying
                                                                                3
them is not unreasonable. To answer this question, we look at
the evasion rate as a function of the number of modifications
performed in Figure 2. In this plot we can see that all four
AV products have an upward trending slope as the number
                                                                                2
of modifications is increased. In contrast, the n-gram model
is completely unchanged from 3 or more modifications, and
MalConv at 4 or more. This would appear to indicate that the


                                                                                     m


                                                                                              nv


                                                                                                    1


                                                                                                             2


                                                                                                                      3


                                                                                                                                4
                                                                                                   AV


                                                                                                          AV


                                                                                                                    AV


                                                                                                                             AV
                                                                                    ra

                                                                                            Co
ML approaches are truly immune to this evasion technique,


                                                                                -G


                                                                                          al
                                                                            N
whereas the AVs are not.


                                                                                         M
   These results also speak to the use of dynamic evaluation,
or lack thereof, in the AV products at scan time. Any dynamic        Figure 3. Robustness to byte level manipulations. Black dashed line
analysis based approach should be intrinsically immune to            indicates maximum. The blue bar (far left) represents number of
the modifications performed in this test. Because we see all         files classified correctly before occlusion. The red bar (left center)
AV products fail to detect 13-78% of modified malware, we            represents the number of files classified correctly after random
can postulate that if any dynamic analysis is being done its         occlusion. The tan bar (center right) represents the number of files
                                                                     classified correctly after targeted occlusion using random bytes. The
application is ineffectual.                                          dark gray bar (far right) represents the number of files classified
   We make note that in the original work of Anderson, Fi-           correctly after targeted occlusion using bytes from known good files.
lar, and Roth (2017), they applied this attack to a simple
machine learning based model that used only features from
the PE header. This attack was able to evade the PE-feature
based model, but not the byte based models tested under our          among the AV vendors, but was still impacted by the oc-
framework.                                                           clusion of bytes and did not perform as strongly as the ML
                                                                     models.
5.2   Targeted Occlusion                                                From the AV results it is also clear that the Targeted Ran-
The experiment detailed in subsection 4.2 attempts to modify         dom occlusion is an improvement over the Undirected occlu-
the most important region for detecting the maliciousness of         sion of random bytes, showing that Algorithm 1 is effective
a binary as deduced from querying only the n-gram model              at identifying important regions. This was not discernible
and attempting to transfer that knowledge to all others.             from the n-gram and MalConv models, which are relatively
                                                                     immune to this scenario.
   The results for the byte occlusion tests are given in Fig-
ure 3, where the blue bar shows the accuracy on 40,000 mali-            Although on average the occlusion was limited to 0.5%
cious files when no bytes are occluded. For both the n-gram          of a file’s bytes, we acknowledge that one could argue this
and MalConv models, we see that the Random, Undirected,              transformation may have altered the true label of the file. The
and Adversarial occlusion attacks have almost no impact on           search could be potentially removing the malicious payload
classification accuracy. In the case of the n-gram model, only       of the application, rendering it muted (if runnable). We accept
0.13% of the files where able to evade its detection after           this possibility under the assumption that the adversary now
occlusion. Again evasions were highly correlated with files          has enough information to alter this small portion of the file
close to the decision boundary.                                      in order to break possible signatures.
   In particular, we remind the reader that the Adversarial             The results from the machine learning models would sug-
occlusion is replacing up to 2KB of the malicious file with          gest that maliciousness is not contained to a small contiguous
bytes taken from benign training data. This is designed to           section of most files. This makes intuitive sense, as malicious
maximally impact the ML approaches, as the model was                 functionality being employed in executable sections of a file
trained with the given bytes as having an explicit label of          might also require functions to be referenced in the import
benign. Yet, the Adversarial choice has almost no impact on          table. The non-contiguous nature of the PE file format allows
results. This is a strong indicator of the potential robustness of   for many similar circumstances, and so we believe it unlikely
the ML approach, and that simply adding bytes from benign            that all of a file’s maliciousness would be contained within a
programs is not sufficient to evade their detection.                 small contiguous region of a binary. In addition, this results
   Considering the AV products we again see a different story.       in binaries that are analogous to deployed malware with bugs.
AV1 had the highest detection rate when no occlusion oc-             Just because malware may not function properly, and thus not
curred, but also had the largest drop in detection rate for          negatively impact users, doesn’t remove its malicious intent
all three types of occlusion. AV4 had the best performance           and the benefit in detecting it.
5.3      ROP Injection Results                                        features of the injection), or is simply not sufficiently repre-
                                                                      sented in our training data for the model to learn.
In Table 2 we show the results of running the ROPInjector on
the benign test files. The results are shown only for the 13,560         This question was tested by applying the ROPInjector to all
files that the ROPInjector was able to infect. Before infection,      of our training data, both benign and malicious files, which re-
the Pre-ROP Accuracy column indicates the percentage of               sulted in 235,865 total ROPInjected binaries. We then trained
files correctly labeled benign. After infection, the Post-ROP         an n-gram model that tried to distinguish between ROPIn-
Accuracy column indicates what percentage were correctly              jected vs Not-ROPInjected binaries. The n-gram model was
labeled as malicious. The Post-ROP Lift then shows how                able to obtain 97.6% accuracy at this task with an AUC of
many percentage points of Post-ROP Accuracy came from                 99.6%. This indicates that the models could learn to detect
the classifier correctly determining that a formerly benign file      ROPInjection, but that it is not sufficiently prevalent in our
is now malicious, rather than simply being a false-positive.          training corpus for the models to have learned.
That is to say, if a model incorrectly called a benign file              Overall this case highlights a potential advantage of the
malicious, it’s not impressive when it calls an infected version      classical AV systems with extensive domain knowledge.
of the file malicious as well.                                        When sufficiently novel attacks are developed for which cur-
                                                                      rent models have almost no predictive power, it may be easier
Table 2. Accuracy on originally benign binaries before and after      to develop and deploy new signatures to catch the attacks —
applying the ROPInjector.                                             while the machine learning approaches may require waiting
                                                                      for data, or creating synthetic data (which has its own risks)
 Classifier   Pre-ROP Accuracy   Post-ROP Accuracy   Post-ROP Lift    to adjust the model.
 N-Gram            85.1                15.3              0.4
 MalConv           82.4                18.8              1.2
 AV1               99.3                 1.3              0.6                               6    Conclusion
 AV2               98.7                 1.2             −0.1
 AV3               97.9                 0.7             −1.4          We have demonstrated a new testing methodology for compar-
 AV4               89.2                32.9             22.1
                                                                      ing the robustness of machine learning classifiers to current
                                                                      anti-virus software. Furthermore, we have provided evidence
   Most malware detectors had significant difficulty with this        that machine learning approaches may be more successful at
task. The machine learning based models, n-gram and Mal-              catching malware that has been manipulated in an attempt to
Conv, showed only modest improvements of 0.4 and 1.2 per-             evade detection. Anti-virus products do a good job of catch-
centage points. AV1 had a similarly small 0.6 improvement,            ing known and static malicious files, but their rigid decision
but surprisingly AV2 and AV3 had negative improvements.               boundaries prevent them from generalizing to new threats or
That means for 1.4% of the benign files, AV3 called the               catching evolutionary malware. We demonstrated that top tier
original benign file malicious. But when that same file was           anti-virus products can be fooled by simple modifications in-
infected by the ROPInjector, AV3 changed its decision to              cluding changing a few bytes or importing random functions.
benign. This is a surprising failure case, and may be caused          The machine learning models appear to better generalize ma-
by the AV system relying too heavily on a signature based             liciousness — leading to an increased robustness to evasion
approach (i.e., a signature had a false-positive on the benign        techniques compared to their anti-virus counterparts.
file, but the infection process broke the signature — turning
a would-be true positive into a false-negative).                                               References
   Overall these models showed no major impact from the
ROPInjector. Only AV4 was able to significantly adjust its           [2013] Anderson, R.; Barton, C.; Böhme, R.; Clayton, R.;
decision for 22.1% of the infected files to correctly label them      van Eeten, M. J. G.; Levi, M.; Moore, T.; and Savage, S.
as malicious. Though the evasion rate for AV4 remains high            2013. Measuring the Cost of Cybercrime. Berlin, Heidelberg:
at 77%, it performed best in this scenario. Given the muted           Springer Berlin Heidelberg. 265–300.
and negative performance of the other AV systems in this test,       [2017] Anderson, H. S.; Filar, B.; and Roth, P. 2017. Evading
we suspect AV4 has developed techniques specifically for the          Machine Learning Malware Detection. Black Hat USA.
ROPInjector approach.                                                [2010] Barreno, M.; Nelson, B.; Joseph, A. D.; and Tygar,
   We also notice that the benign files that were injectable          J. D. 2010. The security of machine learning. Machine
include 67% of our original false positives for the n-gram            Learning 81(2):121–148.
model. We suspect that this is due to those files already con-       [2014] Biggio, B.; Fumera, G.; and Roli, F. 2014. Security
taining the networking functionality required by the shell-           evaluation of pattern classifiers under attack. IEEE Transac-
code. The majority of malware requires communication over             tions on Knowledge and Data Engineering 26(4):984–996.
the Internet, therefore our machine learning detectors may
view networking related features as somewhat malicious in            [2016] Deo, A.; Dash, S. K.; Suarez-Tangil, G.; Vovk, V.; and
nature.                                                               Cavallaro, L. 2016. Prescience: Probabilistic Guidance on the
   Overall the ROPInjection of malicious code into other-             Retraining Conundrum for Malware Detection. In Proceed-
wise benign applications presents an apparent weakness for            ings of the 2016 ACM Workshop on Artificial Intelligence
our machine learning approaches. An important question                and Security, AISec ’16, 71–82. New York, NY, USA: ACM.
is whether the ROPInjection is functionally invisible to the         [2017] Egele, M.; Scholte, T.; Kirda, E.; and Barbara, S. 2017.
machine learning based models (i.e., it could never detect            A Survey On Automated Dynamic Malware Analysis Eva-
 sion and Counter-Evasion. In Proceedings of Reversing and        [2015] Poulios, G.; Ntantogian, C.; and Xenakis, C. 2015.
 Offensive-oriented Trends Symposium.                              ROPInjector: Using Return- Oriented Programming for Poly-
[2016] Grosse, K.; Papernot, N.; Manoharan, P.; Backes, M.;        morphism and AV Evasion. In Black Hat USA.
 and McDaniel, P. D. 2016. Adversarial perturbations against      [2017] Raff, E., and Nicholas, C. 2017. Malware Classifica-
 deep neural networks for malware classification. CoRR             tion and Class Imbalance via Stochastic Hashed LZJD. In
 abs/1606.04435.                                                   Proceedings of the 10th ACM Workshop on Artificial Intel-
[2013] Hyman, P. 2013. Cybercrime. Communications of the           ligence and Security, AISec ’17, 111–120. New York, NY,
 ACM 56(3):18.                                                     USA: ACM.
[2016] Jordaney, R.; Wang, Z.; Papini, D.; Nouretdinov, I.;       [2016] Raff, E.; Zak, R.; Cox, R.; Sylvester, J.; Yacci, P.;
 and Cavallaro, L. 2016. Misleading Metrics : On Evaluating        Ward, R.; Tracy, A.; McLean, M.; and Nicholas, C. 2016. An
 Machine Learning for Malware with Confidence. Technical           investigation of byte n-gram features for malware classifica-
 report, University of London.                                     tion. Journal of Computer Virology and Hacking Techniques.
[2017] Jordaney, R.; Sharad, K.; Dash, S. K.; Wang, Z.; Pap-      [2017] Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catan-
 ini, D.; Nouretdinov, I.; and Cavallaro, L. 2017. Transcend:      zaro, B.; and Nicholas, C. 2017. Malware Detection by
 Detecting Concept Drift in Malware Classification Models.         Eating a Whole EXE. arXiv preprint arXiv:1710.09435.
 In 26th USENIX Security Symposium (USENIX Security 17),          [2012] Rossow, C.; Dietrich, C. J.; Grier, C.; Kreibich, C.;
 625–642. Vancouver, BC: {USENIX} Association.                     Paxson, V.; Pohlmann, N.; Bos, H.; and van Steen, M. 2012.
[2013] Kantchelian, A.; Afroz, S.; Huang, L.; Islam, A. C.;        Prudent Practices for Designing Malware Experiments: Sta-
 Miller, B.; Tschantz, M. C.; Greenstadt, R.; Joseph, A. D.;       tus Quo and Outlook. In 2012 IEEE Symposium on Security
 and Tygar, J. D. 2013. Approaches to Adversarial Drift.           and Privacy, 65–79. IEEE.
 In Proceedings of the 2013 ACM Workshop on Artificial            [2016] Russu, P.; Demontis, A.; Biggio, B.; Fumera, G.; and
 Intelligence and Security, AISec ’13, 99–110. New York, NY,       Roli, F. 2016. Secure kernel machines against evasion attacks.
 USA: ACM.                                                         In Proceedings of the 2016 ACM Workshop on Artificial
[2018] Kolosnjaji, B.; Demontis, A.; Biggio, B.; Maiorca, D.;      Intelligence and Security, AISec ’16, 59–69. New York, NY,
 Giacinto, G.; Eckert, C.; and Roli, F. 2018. Adversarial          USA: ACM.
 Malware Binaries: Evading Deep Learning for Malware De-          [2001] Schultz, M.; Eskin, E.; Zadok, F.; and Stolfo, S. 2001.
 tection in Executables.                                           Data Mining Methods for Detection of New Malicious Exe-
[2006] Kolter, J. Z., and Maloof, M. A. 2006. Learning             cutables. In Proceedings 2001 IEEE Symposium on Security
 to Detect and Classify Malicious Executables in the Wild.         and Privacy. S&P 2001, 38–49. IEEE Comput. Soc.
 Journal of Machine Learning Research 7:2721–2744.                [2009] Shafiq, M. Z.; Tabish, S. M.; Mirza, F.; and Farooq, M.
[2018] Kreuk, F.; Barak, A.; Aviv-Reuven, S.; Baruch, M.;          2009. PE-Miner: Mining Structural Information to Detect
 Pinkas, B.; and Keshet, J. 2018. Adversarial Examples             Malicious Executables in Realtime. In Recent Advances in
 on Discrete Sequences for Beating Whole-Binary Malware            Intrusion Detection. 121–141.
 Detection.                                                       [2012] Singh, A.; Walenstein, A.; and Lakhotia, A. 2012.
[2017] Li, B.; Roundy, K.; Gates, C.; and Vorobeychik, Y.          Tracking Concept Drift in Malware Families. In Proceedings
 2017. Large-Scale Identification of Malicious Singleton Files.    of the 5th ACM Workshop on Security and Artificial Intelli-
 Proceedings of the Seventh ACM on Conference on Data and          gence, AISec ’12, 81–92. New York, NY, USA: ACM.
 Application Security and Privacy - CODASPY ’17 227–238.          [2010] Sommer, R., and Paxson, V. 2010. Outside the Closed
                                                                   World: On Using Machine Learning for Network Intrusion
[2016] Miller, B.; Kantchelian, A.; Tschantz, M. C.; Afroz, S.;
                                                                   Detection. In Proceedings of the 2010 IEEE Symposium on
 Bachwani, R.; Faizullabhoy, R.; Huang, L.; Shankar, V.; Wu,
                                                                   Security and Privacy, SP ’10, 305–316. Washington, DC,
 T.; Yiu, G.; Joseph, A. D.; and Tygar, J. D. 2016. Reviewer
                                                                   USA: IEEE Computer Society.
 Integration and Performance Measurement for Malware De-
 tection. In Proceedings of the 13th International Conference     [2014] Spafford, E. C. 2014. Is Anti-virus Really Dead?
 on Detection of Intrusions and Malware, and Vulnerability         Computers & Security 44:iv.
 Assessment - Volume 9721, DIMVA 2016, 122–141. New               [2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.;
 York, NY, USA: Springer-Verlag New York, Inc.                     Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing
[2008] Moskovitch, R.; Feher, C.; Tzachar, N.; Berger, E.;         properties of neural networks. In ICLR.
 Gitelman, M.; Dolev, S.; and Elovici, Y. 2008. Unknown           [2016] Wressnegger, C.; Freeman, K.; Yamaguchi, F.; and
 Malcode Detection Using OPCODE Representation. In Pro-            Rieck, K. 2016. From Malware Signatures to Anti-Virus
 ceedings of the 1st European Conference on Intelligence and       Assisted Attacks.
 Security Informatics, EuroISI ’08, 204–215. Berlin, Heidel-      [2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and
 berg: Springer-Verlag.                                            understanding convolutional networks. Lecture Notes in Com-
[2016] Papernot, N.; McDaniel, P. D.; and Goodfellow, I. J.        puter Science (including subseries Lecture Notes in Artifi-
 2016. Transferability in machine learning: from phenom-           cial Intelligence and Lecture Notes in Bioinformatics) 8689
 ena to black-box attacks using adversarial samples. CoRR          LNCS(PART 1):818–833.
 abs/1605.07277.