=Paper=
{{Paper
|id=Vol-1964/S1
|storemode=property
|title=Feature Selection for Malware Classification
|pdfUrl=https://ceur-ws.org/Vol-1964/S1.pdf
|volume=Vol-1964
|authors=Mitchell Mays,Noah Drabinsky,Stephan Brandle
|dblpUrl=https://dblp.org/rec/conf/maics/MaysDB17
}}
==Feature Selection for Malware Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-1964/S1.pdf</pdf>
<pre>
Mitchell Mays et al.                          MAICS 2017                                     pp. 165–170


        Feature Selection for Malware Classification
                          Mitchell Mays, Noah Drabinsky, Dr. Stefan Brandle
                                          Taylor University


Abstract                                              could provide further improvements to the
                                                      system.
In applying machine learning to malware
identification, different types of features have      Introduction
proven to be successful. These features have also
been tested with different kinds of classification    The increased use of technology by citizens in
methodologies and have had varying degrees of         recent decades has made society dependent on it.
success. Every time a new machine learning            Technology is not going away anytime soon, and
methodology is introduced for classifying             since software has become so entangled in daily
malware, there is the potential for increasing the    life (e.g. monetary transactions, social media, and
overall quality of malware classification in the      storage of personal information), it has become
field. Even new classifiers with the same             an increasingly prevalent target for malicious
accuracy as those used previously can be              individuals. With highly motivated and intelligent
combined using one of a few different ensemble        attackers creating malware, many of the simple
techniques sharpen the classification and raise the   defenses currently in place are now insufficient to
accuracy to new heights.                              protect against them. The common anti-virus
        For our purposes, we have attempted to        softwares are unable to guarantee protection
create a coalition of classifiers which each use      against new malware, even if that malware has
different features. These classifiers when trained,   only been slightly modified from one that is
provide multiple angles to the same problem and       similar.
can be used to test ensemble techniques.                       With this in mind research has turned to
Eventually, such an ensemble of individual            machine learning techniques as a possible
malware classifiers could create a highly precise     solution. Anti-virus softwares rely on specific
means of picking out malware from other               malware signatures (longer byte sequences that
software.                                             are only found in a particular malware executable
        Specifically, we have created a convol-       or highly similar executables). These signatures
utional neural network which processes byte data      are a sure sign that a file is a malware.
as an image, and a deep feed forward neural           Unfortunately, the trade off of such high confi-
network which utilizes opcode N-gram features.        dence is a failure to generalize well to new or
Both of these classifiers, while not perfect,         unseen malware. Minor changes to a malware can
provide a significant level of classification. They   disrupt the signature and allow it to slip past
achieve this independently of one another, and        detection. Another method used in current anti-
when combined, they each contribute enough to         virus software is a collection of heuristics created
improve the final accuracy.                           by experts. These painstakingly created rules
        The majority of the effort in this research   attempt to capture the essence of a type of
was placed on gathering the N-gram features, a        malware. They can do better at generalizing, but
time and resource intensive process. Tinkering        the effort needed to keep an anti-virus software
with the parameters or structure of classifiers       up-to-date with new rules for each new type of
                                                      malware is unsustainable. Machine learning


                                                  165
Feature Selection for Malware Classification                                                   pp. 165–170


attempts to identify particular indicative features    production of better protection against malware to
of a malware that can be generalized across all        be available to businesses and tech-owners.
malicious software of its kind.                                Schultz, Eskin, Zadok, and Stolfo [1] were
        Previous machine learning attempts have        the first to publish a work attempting to apply
used shorter byte or assembly opcode sequences         machine learning to malware. They looked at
as a feature. These sequences, called n-grams, are     executable files and found features for classifi-
representative of specific actions taken within the    cation within. Their system used a Naïve Bayes
execution of a malware. The same actions are           classifier. By training on general indicators from
replicated in all malware that are in the same         the executable file, they ended up with a system
family.                                                that was significantly more accurate than an anti-
        For the sake of identifying malware, it is     virus, although it had potential to falsely identify
helpful to organize them by family. Each family        benign software as malicious.
of malware has a similar course of action or                   As with many developments in machine
purpose. For example, a Keylogger is a type of         learning, a lot of effort initially was spent on
malware that records all of the text typed into a      feature discovery. Many components of the
keyboard, reporting this information back to the       executable file were tested as features, including
attacker who can extract usernames, passwords,         gathering n-grams from the bytes. N-grams were
and other important data from it. Although two         first used in text classification, but Abou-Assaleh,
Keyloggers may accomplish their tasks in unique        Cercone, Keˇselj, and Sweidan [2] applied this
ways from one another, they both rely on the           technique for malware classification. They
same essential components to do their work.            extracted these specific sequences of bytes from
Therefore, identifying malware by family allows        the executable files and used their frequency
for a system to find multiple types of malware         within the file as a feature. The combined power
without treating them all as one identical             of these n-gram features served as a good
classification. Furthermore, if a particular family    classification method for malware.
of malware can be accurately classified, then the              In more recent years, with a successful set
features used to classify it are clear indicators of   of features for classifying, research has been
that specific malware family. Some digging may         directed toward different classifier alternatives.
then uncover how that feature correlates to the        This set of possible classifiers including support
malware family in question, providing more             vector machines, k-nearest neighbor, and artificial
insight into how attackers are doing their work.       neural networks have performed at varying
                                                       degrees, but many have formed a high accuracy
Related Work                                           classification system. In the pursuit of the most
                                                       accurate classifier possible, new research has
In recent years, the topography of malware             offered up an ensemble classifier [3, 4]. An
research has been churning. Due to a recent surge      ensemble classifier is a classification system
of success, as well as the nature of the problem,      which utilizes multiple classifiers to come to a
machine learning techniques have served as a           final class decision. The benefit of using multiple
backbone for a significant portion of this             classifiers is that where one classifier fails, others
research. The quantity of effort put forth recently    may succeed, further cutting down error rates.
has stemmed from the multiple large scale and
damaging malware attacks. Under the public eye,
funding and attention have turned to a more            Preprocessing
innovative solution to the persistent problem.
These recent years have brought about many                    For this research we used a dataset made
fruitful developments, and have allowed for the        available by Microsoft through the machine


                                                   166
Mitchell Mays et al.                               MAICS 2017                                     pp. 165–170


learning website Kaggle. It includes nine different         n-gram may have been: “mov eax, [ebp+arg_0]”.
malware families and over 10,000 malwares.                  This is a 3-gram, a sequence of three consecutive
Based off of this dataset, we chose to create a             data points in the assembly code. For each
classifier to distinguish between the different             malware, the number of times that this 3-gram
malware families. The winner of the Kaggle                  occurred in its assembly would be an input to the
competition provided a paper with the techniques            classifier. Due to results from Abou-Assaleh, T.,
that they used [5]. This served as a launching              N. Cercone, V. Keselj, and R. Sweidan [2] as well
point for our research. From it, we settled on              as findings from Xiaozhou Wang, Jiwei Liu, and
using a convolutional neural network to treat byte          Xueer Chen [5], we settled on n-grams ranging
data as an input image, and opcode n-grams for              from two to four data points.
use in a more standard neural network.                              The feature gathering for this classifier
        Each of the 10,000 malwares provided,               was more complicated than for the convolutional
had both a byte file and an assembly file, which            neural network. A method was needed to identify
has been disassembled from the byte file. The               certain n-grams as possible features and then to
byte file was used to create the byte “images” in           decide which of those chosen n-grams would
the following manner. Each byte was converted               provide the most classification power. There were
into a pixel value to be used in the black and              four stages to this process.
white image. Since each line of the byte file is
sixteen bytes long, the image created represents                1. 1-gram Selection
this same shape: sixteen pixel wide and sixty-four              2. 1-gram Paring
pixels tall.                                                    3. n-gram Selection
                                                                4. n-gram Paring


                                                            1. 1-gram Selection
                                                            Unless intending to evaluate every possible two
                                                            to four-gram of operations in an assembly file (a
                                                            number easily in the billions) we needed a
                                                            criterion to select particular n-grams. One way of
                                                            doing this is to build larger n-grams off of
 Figure 1. Example Byte image as input to Convolutional
                    Neural Network                          particular 1-grams. Based off of the work by
                                                            Xiaozhou Wang, Jiwei Liu, and Xueer Chen [5],
Figure one shows how a byte image is used as an             we mimicked the execution of the assembly,
input to the convolutional neural network. The              following every unconditional jump and counting
network performs convolutions on the pixels                 the frequency of each 1-gram. If it occured at
creating a matrix of smaller convolved images.              least two hundred times in a file it was considered
As it passes through each layer of the network,             a valid 1-gram.
the information from the image is condensed and
routed into a final output layer, giving a value to         2. 1-gram Paring
each possible class node. The class node with the           Even after using a specific criterion to limit the
maximum value is the projected class given by               number of 1-grams selected, there were too many
the network. Our network had nine class nodes               to build n-grams off of. We used a few further
corresponding with the nine possible malware                criteria to further pare down the list of 1-grams.
families.                                                   First, we removed 1-grams with only one
        For the second classifier in our coalition,         character. We then removed 1-grams that started
the features were opcode n-grams. One possible              with a register. While we could assume that the


                                                          167
Feature Selection for Malware Classification                                                        pp. 165–170


malware would be manipulating the registers, we          each designated class from the original entropy
intuited that having the registers themselves be         gives the information gain, or the improvement in
the start of an n-gram would be unhelpful since it       classification ability based off of that feature.
gives no vision into how they are being                          In figure two it is clear that the feature
manipulated. Further paring involved removing            helps to distinguish between the two classes, and
items that began with “loc_” or “sub_” as they           the entropy in each new class evidences that. By
are somewhat file specific, relying on file              examining the change in entropy, the helpfulness
locations, and are, therefore, not merely specific       of a feature is revealed. Unfortunately, to train a
to a type of malware.                                    classifier on 14 million possible features is a
        These heuristic decisions were made due          lengthy process. In order to compensate for this I
to the limitations of my system being unable to          took two different measures.
handle the enormous number of n-grams that
would be generated from a larger set of 1-grams.
It is possible that the 1-grams that were removed
could have produced potent n-grams for
classification, but based on our reasoning, they
were less likely to. After all of paring was
complete we were left with 855 different 1-
grams.

3. n-gram Selection
With the 1-grams selected, we ran through all of
                                                                  Figure 2. Change in entropy by classifying on a
the files a second time, grabbing every possible 2
                                                                              single feature
through 4-gram that began with a 1-gram in the
list. Upon completion there were 14,566,780 n-                   First, rather than running the information
grams. Each with a particular frequency per              gain on each individual feature, we split the list of
malware file.                                            features into batches of five. Training the mini-
                                                         classifier on five features instead of one allowed
4. n-gram Paring                                         for a quick check to see if any of the five features
The most important step to generating an accurate        were helpful. Even with four unhelpful features,
classifier is to select the n-grams from the 14          we assumed that after training, the one helpful
million that will actually aid in separating some        feature would still significantly effect class-
families of malware from others. The best way to         ification, leading to a noticeable information gain.
evaluate the benefit of an individual feature as an              Second, we distributed the work across
element of a classifier is to test its ability to        several different machines in Taylor University’s
classify on its own. If a simple classifier trained      Computer Science Department. By re-purposing a
on the individual feature is able to classify the        bit of code written for distributing graphics
malwares in a way that exceeds randomness then           rendering tasks, we were able to create a task list
that feature would be helpful to include in a larger     of n-grams to run through the algorithm and
classifier.                                              return the result to a central server. Even with up
        Information gain is an algorithm that            to sixty-two machines churning through the
encapsulates this idea. First, it calculates the         features, it took three full days to complete the
entropy in a dataset. Then, it classifies that dataset   information gain on the 14 million n-grams.
based on some feature. For each class, the                       We iterated through the information gain
algorithm calculates the entropy of the items            values for each set of five features, finding each
within. Subtracting the average entropy of the           set with a value above zero. There were approx-


                                                     168
Mitchell Mays et al.                              MAICS 2017                                      pp. 165–170


imately 72,000 sets of five which provided a                       The testing methodology that was used for
positive information gain. We split these sets of          both of these networks was 10-fold cross
five into individual tasks and re-ran the infor-           validation. The entire set of 10,000 malwares was
mation gain test. We were left with 10,985 indi-           shuffled and split into ten separate sets with 90%
vidual n-grams, to use as features.                        of files used for training and 10% for testing.
                                                           Each iteration of cross validation took place in
Testing                                                    the following manner:
                                                                   First, both networks were trained to
In order to test the the n-gram features, we used a        completion on the training set. Next, the networks
deep feed-forward neural network. This network             attempted to classify the test sets. The features
took 10,985 inputs, one for each n-gram. It had            (byte image or n-gram frequency) were used as
two hidden layers, each with 7,300 nodes. The              inputs to the classifier and the output was a set of
output had nine nodes, one for each family of              float values assigned to the nine class nodes. The
malware. The structure of the network was based            node with the highest value was the network’s
off of a tutorial by Nathan Lintz [6], and was built       classification. Finally, we utilized the output float
using the TensorFlow library.                              values from each classifier to create an ensemble
        The convolutional neural network was               classifier. To get a final classification from the
also built in the TensorFlow library, taking an            ensemble classifier we multiplied the accuracy of
input of a 16x64 pixel image. It held four                 each classifier on the test set by its output values
convolutional layers and finished with nine                for each malware. We summed these class values
output nodes. In order to prevent over-fitting,            for each classifier. This gave a weighted cons-
both networks employed dropout at every layer.             ensus between the two classifiers.
Dropout ignores a percentage of the weights in                     The feed-forward neural network trained
the neural network during training. This prevents          on the malware n-grams twenty times, and the
the network from hyper fitting to the training data        convolutional neural network trained on the
and, therefore, failing on the testing data.               16x64 pixel images fifty times. We chose the


                        Figure 3. Testing framework for individual and ensemble classifiers


                                                       169
Feature Selection for Malware Classification                                                   pp. 165–170


number of iterations after overseeing training.      Future Work
Each network took approximately that number of
iterations to settle.                                Ultimately, the use of ensembles of complex
                                                     classifiers has barely scratched the surface of
                                                     what is possible. Feature gathering took up a
Results                                              large portion of research time, and there is plenty
After running the cross validations on both of the   more to be done on the classifiers. Manipulating
networks, we were left with a final average          the structure and number of layers in both the
accuracy for each. The deep feed-forward neural      convolutional neural network and feed-forward
network using n-gram features trained to be          neural network could drastically effect the
96.7% accurate. The convolutional neural             performance of the classifiers. Adding new
network, on the other hand, trained to 88.5%         classifiers also has the potential to add signifi-
accuracy on the test data. By multiplying the        cant improvements to the malware classification.
nodes by these accuracies, we created weighted       Now that using an ensemble method has been
nodes. Combining these nodes as an ensemble          applied to neural networks with success, it is
classifier returned an accuracy of 97.7%.            worth investing more time to perfect this method
        The training time for the feed-forward       and compare it to ensembles which depend on
neural network was much longer than the training     simpler classifiers.
for the convolutional neural network. This was
expected since the feed-forward neural network       Acknowledgements
was training weights for every input node rather
than the training the smaller convolutions.          [1]. Schultz, M.g., E. Eskin, F. Zadok, and S.j. Stolfo.
                                                     "Data Mining Methods for          Detection of New
                                                     Malicious Executables." Proceedings 2001 IEEE
Conclusion                                           Symposium on Security and Privacy. S&P 2001 (n.d.):
The final outcome of my research was an              n. pag. Web.
ensemble classifier with 97.7% accuracy on the       [2]. Abou-Assaleh, T., N. Cercone, V. Keselj, and R.
10,000 malwares. This value is one percent           Sweidan. "N-gram-based Detection of New Malicious
higher than the best individual classifier in the    Code." Proceedings of the 28th Annual International
coalition (96.7% percent from the feed-forward       Computer Software and Applications Conference,
neural network). Yi-Bin, Shu-Chang Din, Chao-        2004. COMPSAC 2004. (n.d.): n. pag. Web.
Fu Zheng, and Bai-Jian Gao [3] showed that           [3]. Lu, Yi-Bin, Shu-Chang Din, Chao-Fu Zheng, and
simpler classifiers such as support vector           Bai-Jian Gao. "Using Multi-Feature and Classifier
machines and decision trees could be used as         Ensembles to Improve Malware Detection." Journal
components to build an improved ensemble             of Chung Cheng Institute of Technology 39.2 (2010):
classifier. Our work shows that neural networks      n. pag. Web.
can be used as constituents to an ensemble           [4]. Kolter, Jeremy Z., and Marcus A. Maloof.
classifier and still create an improved class-       "Learning to Detect Malicious Executables in the
ification.                                           Wild." Proceedings of the 2004 ACM SIGKDD
        While limited in our time to tinker with     International Conference on Knowledge Discovery
the networks and improve their performance, the      and Data Mining - KDD '04 (2004): n. pag. Web.
principle that ensembles improve the overall         [5]. Wang, Xiaozhou, Jiwei Liu, and Xueer Chen.
classification will most likely still hold as the    "Microsoft Malware Classification Challenge (BIG
individual classifiers become more accurate.         2015): First Place Team: Say No to Overfitting."
Furthermore, adding more classifiers to the          (2015): n. pag. Web.
ensemble should add to the accuracy of the           [6]. https://github.com/nlintz/TensorFlow-
system as a whole.                                   Tutorials/blob/master/03_net.py


                                                 170

</pre>