<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Selection for Malware Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mitchell Mays</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noah Drabinsky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dr. Stefan Brandle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taylor University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>165</fpage>
      <lpage>170</lpage>
      <abstract>
        <p>In applying machine learning to malware identification, different types of features have proven to be successful. These features have also been tested with different kinds of classification methodologies and have had varying degrees of success. Every time a new machine learning methodology is introduced for classifying malware, there is the potential for increasing the overall quality of malware classification in the field. Even new classifiers with the same accuracy as those used previously can be combined using one of a few different ensemble techniques sharpen the classification and raise the accuracy to new heights. For our purposes, we have attempted to create a coalition of classifiers which each use different features. These classifiers when trained, provide multiple angles to the same problem and can be used to test ensemble techniques. Eventually, such an ensemble of individual malware classifiers could create a highly precise means of picking out malware from other software. Specifically, we have created a convolutional neural network which processes byte data as an image, and a deep feed forward neural network which utilizes opcode N-gram features. Both of these classifiers, while not perfect, provide a significant level of classification. They achieve this independently of one another, and when combined, they each contribute enough to improve the final accuracy. The majority of the effort in this research was placed on gathering the N-gram features, a time and resource intensive process. Tinkering with the parameters or structure of classifiers could provide further improvements to the system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The increased use of technology by citizens in
recent decades has made society dependent on it.
Technology is not going away anytime soon, and
since software has become so entangled in daily
life (e.g. monetary transactions, social media, and
storage of personal information), it has become
an increasingly prevalent target for malicious
individuals. With highly motivated and intelligent
attackers creating malware, many of the simple
defenses currently in place are now insufficient to
protect against them. The common anti-virus
softwares are unable to guarantee protection
against new malware, even if that malware has
only been slightly modified from one that is
similar.</p>
      <p>With this in mind research has turned to
machine learning techniques as a possible
solution. Anti-virus softwares rely on specific
malware signatures (longer byte sequences that
are only found in a particular malware executable
or highly similar executables). These signatures
are a sure sign that a file is a malware.
Unfortunately, the trade off of such high
confidence is a failure to generalize well to new or
unseen malware. Minor changes to a malware can
disrupt the signature and allow it to slip past
detection. Another method used in current
antivirus software is a collection of heuristics created
by experts. These painstakingly created rules
attempt to capture the essence of a type of
malware. They can do better at generalizing, but
the effort needed to keep an anti-virus software
up-to-date with new rules for each new type of
malware is unsustainable. Machine learning
attempts to identify particular indicative features
of a malware that can be generalized across all
malicious software of its kind.</p>
      <p>Previous machine learning attempts have
used shorter byte or assembly opcode sequences
as a feature. These sequences, called n-grams, are
representative of specific actions taken within the
execution of a malware. The same actions are
replicated in all malware that are in the same
family.</p>
      <p>For the sake of identifying malware, it is
helpful to organize them by family. Each family
of malware has a similar course of action or
purpose. For example, a Keylogger is a type of
malware that records all of the text typed into a
keyboard, reporting this information back to the
attacker who can extract usernames, passwords,
and other important data from it. Although two
Keyloggers may accomplish their tasks in unique
ways from one another, they both rely on the
same essential components to do their work.
Therefore, identifying malware by family allows
for a system to find multiple types of malware
without treating them all as one identical
classification. Furthermore, if a particular family
of malware can be accurately classified, then the
features used to classify it are clear indicators of
that specific malware family. Some digging may
then uncover how that feature correlates to the
malware family in question, providing more
insight into how attackers are doing their work.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>In recent years, the topography of malware
research has been churning. Due to a recent surge
of success, as well as the nature of the problem,
machine learning techniques have served as a
backbone for a significant portion of this
research. The quantity of effort put forth recently
has stemmed from the multiple large scale and
damaging malware attacks. Under the public eye,
funding and attention have turned to a more
innovative solution to the persistent problem.
These recent years have brought about many
fruitful developments, and have allowed for the
production of better protection against malware to
be available to businesses and tech-owners.</p>
      <p>Schultz, Eskin, Zadok, and Stolfo [1] were
the first to publish a work attempting to apply
machine learning to malware. They looked at
executable files and found features for
classification within. Their system used a Naïve Bayes
classifier. By training on general indicators from
the executable file, they ended up with a system
that was significantly more accurate than an
antivirus, although it had potential to falsely identify
benign software as malicious.</p>
      <p>
        As with many developments in machine
learning, a lot of effort initially was spent on
feature discovery. Many components of the
executable file were tested as features, including
gathering n-grams from the bytes. N-grams were
first used in text classification, but Abou-Assaleh,
Cercone, Keˇselj, and Sweidan [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] applied this
technique for malware classification. They
extracted these specific sequences of bytes from
the executable files and used their frequency
within the file as a feature. The combined power
of these n-gram features served as a good
classification method for malware.
      </p>
      <p>
        In more recent years, with a successful set
of features for classifying, research has been
directed toward different classifier alternatives.
This set of possible classifiers including support
vector machines, k-nearest neighbor, and artificial
neural networks have performed at varying
degrees, but many have formed a high accuracy
classification system. In the pursuit of the most
accurate classifier possible, new research has
offered up an ensemble classifier [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. An
ensemble classifier is a classification system
which utilizes multiple classifiers to come to a
final class decision. The benefit of using multiple
classifiers is that where one classifier fails, others
may succeed, further cutting down error rates.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing</title>
      <sec id="sec-3-1">
        <title>For this research we used a dataset made available by Microsoft through the machine</title>
        <p>
          learning website Kaggle. It includes nine different
malware families and over 10,000 malwares.
Based off of this dataset, we chose to create a
classifier to distinguish between the different
malware families. The winner of the Kaggle
competition provided a paper with the techniques
that they used [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ]. This served as a launching
point for our research. From it, we settled on
using a convolutional neural network to treat byte
data as an input image, and opcode n-grams for
use in a more standard neural network.
        </p>
        <p>Each of the 10,000 malwares provided,
had both a byte file and an assembly file, which
has been disassembled from the byte file. The
byte file was used to create the byte “images” in
the following manner. Each byte was converted
into a pixel value to be used in the black and
white image. Since each line of the byte file is
sixteen bytes long, the image created represents
this same shape: sixteen pixel wide and sixty-four
pixels tall.
Figure one shows how a byte image is used as an
input to the convolutional neural network. The
network performs convolutions on the pixels
creating a matrix of smaller convolved images.
As it passes through each layer of the network,
the information from the image is condensed and
routed into a final output layer, giving a value to
each possible class node. The class node with the
maximum value is the projected class given by
the network. Our network had nine class nodes
corresponding with the nine possible malware
families.</p>
        <p>
          For the second classifier in our coalition,
the features were opcode n-grams. One possible
n-gram may have been: “mov eax, [ebp+arg_0]”.
This is a 3-gram, a sequence of three consecutive
data points in the assembly code. For each
malware, the number of times that this 3-gram
occurred in its assembly would be an input to the
classifier. Due to results from Abou-Assaleh, T.,
N. Cercone, V. Keselj, and R. Sweidan [
          <xref ref-type="bibr" rid="ref1">2</xref>
          ] as well
as findings from Xiaozhou Wang, Jiwei Liu, and
Xueer Chen [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ], we settled on n-grams ranging
from two to four data points.
        </p>
        <p>The feature gathering for this classifier
was more complicated than for the convolutional
neural network. A method was needed to identify
certain n-grams as possible features and then to
decide which of those chosen n-grams would
provide the most classification power. There were
four stages to this process.</p>
      </sec>
      <sec id="sec-3-2">
        <title>1. 1-gram Selection</title>
        <p>2. 1-gram Paring
3. n-gram Selection
4. n-gram Paring</p>
      </sec>
      <sec id="sec-3-3">
        <title>1. 1-gram Selection</title>
        <p>
          Unless intending to evaluate every possible two
to four-gram of operations in an assembly file (a
number easily in the billions) we needed a
criterion to select particular n-grams. One way of
doing this is to build larger n-grams off of
particular 1-grams. Based off of the work by
Xiaozhou Wang, Jiwei Liu, and Xueer Chen [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ],
we mimicked the execution of the assembly,
following every unconditional jump and counting
the frequency of each 1-gram. If it occured at
least two hundred times in a file it was considered
a valid 1-gram.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>2. 1-gram Paring</title>
        <p>Even after using a specific criterion to limit the
number of 1-grams selected, there were too many
to build n-grams off of. We used a few further
criteria to further pare down the list of 1-grams.
First, we removed 1-grams with only one
character. We then removed 1-grams that started
with a register. While we could assume that the
malware would be manipulating the registers, we
intuited that having the registers themselves be
the start of an n-gram would be unhelpful since it
gives no vision into how they are being
manipulated. Further paring involved removing
items that began with “loc_” or “sub_” as they
are somewhat file specific, relying on file
locations, and are, therefore, not merely specific
to a type of malware.</p>
        <p>These heuristic decisions were made due
to the limitations of my system being unable to
handle the enormous number of n-grams that
would be generated from a larger set of 1-grams.
It is possible that the 1-grams that were removed
could have produced potent n-grams for
classification, but based on our reasoning, they
were less likely to. After all of paring was
complete we were left with 855 different
1grams.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3. n-gram Selection</title>
        <p>With the 1-grams selected, we ran through all of
the files a second time, grabbing every possible 2
through 4-gram that began with a 1-gram in the
list. Upon completion there were 14,566,780
ngrams. Each with a particular frequency per
malware file.</p>
      </sec>
      <sec id="sec-3-6">
        <title>4. n-gram Paring</title>
        <p>The most important step to generating an accurate
classifier is to select the n-grams from the 14
million that will actually aid in separating some
families of malware from others. The best way to
evaluate the benefit of an individual feature as an
element of a classifier is to test its ability to
classify on its own. If a simple classifier trained
on the individual feature is able to classify the
malwares in a way that exceeds randomness then
that feature would be helpful to include in a larger
classifier.</p>
        <p>Information gain is an algorithm that
encapsulates this idea. First, it calculates the
entropy in a dataset. Then, it classifies that dataset
based on some feature. For each class, the
algorithm calculates the entropy of the items
within. Subtracting the average entropy of the
each designated class from the original entropy
gives the information gain, or the improvement in
classification ability based off of that feature.</p>
        <p>In figure two it is clear that the feature
helps to distinguish between the two classes, and
the entropy in each new class evidences that. By
examining the change in entropy, the helpfulness
of a feature is revealed. Unfortunately, to train a
classifier on 14 million possible features is a
lengthy process. In order to compensate for this I
took two different measures.</p>
        <p>First, rather than running the information
gain on each individual feature, we split the list of
features into batches of five. Training the
miniclassifier on five features instead of one allowed
for a quick check to see if any of the five features
were helpful. Even with four unhelpful features,
we assumed that after training, the one helpful
feature would still significantly effect
classification, leading to a noticeable information gain.</p>
        <p>Second, we distributed the work across
several different machines in Taylor University’s
Computer Science Department. By re-purposing a
bit of code written for distributing graphics
rendering tasks, we were able to create a task list
of n-grams to run through the algorithm and
return the result to a central server. Even with up
to sixty-two machines churning through the
features, it took three full days to complete the
information gain on the 14 million n-grams.</p>
        <p>We iterated through the information gain
values for each set of five features, finding each
set with a value above zero. There were
approximately 72,000 sets of five which provided a
positive information gain. We split these sets of
five into individual tasks and re-ran the
information gain test. We were left with 10,985
individual n-grams, to use as features.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Testing</title>
      <p>
        In order to test the the n-gram features, we used a
deep feed-forward neural network. This network
took 10,985 inputs, one for each n-gram. It had
two hidden layers, each with 7,300 nodes. The
output had nine nodes, one for each family of
malware. The structure of the network was based
off of a tutorial by Nathan Lintz [
        <xref ref-type="bibr" rid="ref8">6</xref>
        ], and was built
using the TensorFlow library.
      </p>
      <p>The convolutional neural network was
also built in the TensorFlow library, taking an
input of a 16x64 pixel image. It held four
convolutional layers and finished with nine
output nodes. In order to prevent over-fitting,
both networks employed dropout at every layer.
Dropout ignores a percentage of the weights in
the neural network during training. This prevents
the network from hyper fitting to the training data
and, therefore, failing on the testing data.</p>
      <sec id="sec-4-1">
        <title>The testing methodology that was used for</title>
        <p>both of these networks was 10-fold cross
validation. The entire set of 10,000 malwares was
shuffled and split into ten separate sets with 90%
of files used for training and 10% for testing.
Each iteration of cross validation took place in
the following manner:</p>
        <p>First, both networks were trained to
completion on the training set. Next, the networks
attempted to classify the test sets. The features
(byte image or n-gram frequency) were used as
inputs to the classifier and the output was a set of
float values assigned to the nine class nodes. The
node with the highest value was the network’s
classification. Finally, we utilized the output float
values from each classifier to create an ensemble
classifier. To get a final classification from the
ensemble classifier we multiplied the accuracy of
each classifier on the test set by its output values
for each malware. We summed these class values
for each classifier. This gave a weighted
consensus between the two classifiers.</p>
        <p>The feed-forward neural network trained
on the malware n-grams twenty times, and the
convolutional neural network trained on the
16x64 pixel images fifty times. We chose the
number of iterations after overseeing training.
Each network took approximately that number of
iterations to settle.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>After running the cross validations on both of the
networks, we were left with a final average
accuracy for each. The deep feed-forward neural
network using n-gram features trained to be
96.7% accurate. The convolutional neural
network, on the other hand, trained to 88.5%
accuracy on the test data. By multiplying the
nodes by these accuracies, we created weighted
nodes. Combining these nodes as an ensemble
classifier returned an accuracy of 97.7%.</p>
      <p>The training time for the feed-forward
neural network was much longer than the training
for the convolutional neural network. This was
expected since the feed-forward neural network
was training weights for every input node rather
than the training the smaller convolutions.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        The final outcome of my research was an
ensemble classifier with 97.7% accuracy on the
10,000 malwares. This value is one percent
higher than the best individual classifier in the
coalition (96.7% percent from the feed-forward
neural network). Yi-Bin, Shu-Chang Din,
ChaoFu Zheng, and Bai-Jian Gao [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed that
simpler classifiers such as support vector
machines and decision trees could be used as
components to build an improved ensemble
classifier. Our work shows that neural networks
can be used as constituents to an ensemble
classifier and still create an improved
classification.
      </p>
      <p>While limited in our time to tinker with
the networks and improve their performance, the
principle that ensembles improve the overall
classification will most likely still hold as the
individual classifiers become more accurate.
Furthermore, adding more classifiers to the
ensemble should add to the accuracy of the
system as a whole.</p>
    </sec>
    <sec id="sec-7">
      <title>Future Work</title>
      <p>Ultimately, the use of ensembles of complex
classifiers has barely scratched the surface of
what is possible. Feature gathering took up a
large portion of research time, and there is plenty
more to be done on the classifiers. Manipulating
the structure and number of layers in both the
convolutional neural network and feed-forward
neural network could drastically effect the
performance of the classifiers. Adding new
classifiers also has the potential to add
significant improvements to the malware classification.
Now that using an ensemble method has been
applied to neural networks with success, it is
worth investing more time to perfect this method
and compare it to ensembles which depend on
simpler classifiers.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>[1]. Schultz, M.g., E. Eskin, F. Zadok, and S.j. Stolfo.
"Data Mining Methods for Detection of New
Malicious Executables." Proceedings 2001 IEEE
Symposium on Security and Privacy. S&amp;P 2001 (n.d.):
n. pag. Web.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]. Abou-Assaleh,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cercone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Keselj</surname>
          </string-name>
          , and
          <string-name>
            <surname>R.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Sweidan</surname>
          </string-name>
          .
          <article-title>"N-gram-based Detection of New Malicious Code."</article-title>
          <source>Proceedings of the 28th Annual International Computer Software and Applications Conference</source>
          ,
          <year>2004</year>
          .
          <source>COMPSAC</source>
          <year>2004</year>
          .
          <article-title>(n.d</article-title>
          .): n. pag. Web.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]. Lu, Yi-Bin,
          <string-name>
            <surname>Shu-Chang</surname>
            <given-names>Din</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chao-Fu Zheng</surname>
          </string-name>
          , and
          <string-name>
            <surname>Bai-Jian Gao</surname>
          </string-name>
          .
          <article-title>"Using Multi-Feature and Classifier Ensembles to Improve Malware Detection."</article-title>
          <source>Journal of Chung Cheng Institute of Technology 39.2</source>
          (
          <year>2010</year>
          )
          <article-title>: n. pag</article-title>
          . Web.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]. Kolter, Jeremy
          <string-name>
            <given-names>Z.</given-names>
            , and
            <surname>Marcus</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maloof</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>"Learning to Detect Malicious Executables in the Wild."</article-title>
          <source>Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '04</source>
          (
          <year>2004</year>
          )
          <article-title>: n. pag</article-title>
          . Web.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[5]. Wang, Xiaozhou, Jiwei Liu, and Xueer Chen.</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>"Microsoft Malware Classification Challenge (BIG</article-title>
          <year>2015</year>
          )
          <article-title>: First Place Team: Say No to Overfitting." (</article-title>
          <year>2015</year>
          )
          <article-title>: n. pag</article-title>
          . Web.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>[6]. https://github.com/nlintz/TensorFlowTutorials/blob/master/03_net.py</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>