=Paper=
{{Paper
|id=Vol-1964/S1
|storemode=property
|title=Feature Selection for Malware Classification
|pdfUrl=https://ceur-ws.org/Vol-1964/S1.pdf
|volume=Vol-1964
|authors=Mitchell Mays,Noah Drabinsky,Stephan Brandle
|dblpUrl=https://dblp.org/rec/conf/maics/MaysDB17
}}
==Feature Selection for Malware Classification==
Mitchell Mays et al. MAICS 2017 pp. 165–170
Feature Selection for Malware Classification
Mitchell Mays, Noah Drabinsky, Dr. Stefan Brandle
Taylor University
Abstract could provide further improvements to the
system.
In applying machine learning to malware
identification, different types of features have Introduction
proven to be successful. These features have also
been tested with different kinds of classification The increased use of technology by citizens in
methodologies and have had varying degrees of recent decades has made society dependent on it.
success. Every time a new machine learning Technology is not going away anytime soon, and
methodology is introduced for classifying since software has become so entangled in daily
malware, there is the potential for increasing the life (e.g. monetary transactions, social media, and
overall quality of malware classification in the storage of personal information), it has become
field. Even new classifiers with the same an increasingly prevalent target for malicious
accuracy as those used previously can be individuals. With highly motivated and intelligent
combined using one of a few different ensemble attackers creating malware, many of the simple
techniques sharpen the classification and raise the defenses currently in place are now insufficient to
accuracy to new heights. protect against them. The common anti-virus
For our purposes, we have attempted to softwares are unable to guarantee protection
create a coalition of classifiers which each use against new malware, even if that malware has
different features. These classifiers when trained, only been slightly modified from one that is
provide multiple angles to the same problem and similar.
can be used to test ensemble techniques. With this in mind research has turned to
Eventually, such an ensemble of individual machine learning techniques as a possible
malware classifiers could create a highly precise solution. Anti-virus softwares rely on specific
means of picking out malware from other malware signatures (longer byte sequences that
software. are only found in a particular malware executable
Specifically, we have created a convol- or highly similar executables). These signatures
utional neural network which processes byte data are a sure sign that a file is a malware.
as an image, and a deep feed forward neural Unfortunately, the trade off of such high confi-
network which utilizes opcode N-gram features. dence is a failure to generalize well to new or
Both of these classifiers, while not perfect, unseen malware. Minor changes to a malware can
provide a significant level of classification. They disrupt the signature and allow it to slip past
achieve this independently of one another, and detection. Another method used in current anti-
when combined, they each contribute enough to virus software is a collection of heuristics created
improve the final accuracy. by experts. These painstakingly created rules
The majority of the effort in this research attempt to capture the essence of a type of
was placed on gathering the N-gram features, a malware. They can do better at generalizing, but
time and resource intensive process. Tinkering the effort needed to keep an anti-virus software
with the parameters or structure of classifiers up-to-date with new rules for each new type of
malware is unsustainable. Machine learning
165
Feature Selection for Malware Classification pp. 165–170
attempts to identify particular indicative features production of better protection against malware to
of a malware that can be generalized across all be available to businesses and tech-owners.
malicious software of its kind. Schultz, Eskin, Zadok, and Stolfo [1] were
Previous machine learning attempts have the first to publish a work attempting to apply
used shorter byte or assembly opcode sequences machine learning to malware. They looked at
as a feature. These sequences, called n-grams, are executable files and found features for classifi-
representative of specific actions taken within the cation within. Their system used a Naïve Bayes
execution of a malware. The same actions are classifier. By training on general indicators from
replicated in all malware that are in the same the executable file, they ended up with a system
family. that was significantly more accurate than an anti-
For the sake of identifying malware, it is virus, although it had potential to falsely identify
helpful to organize them by family. Each family benign software as malicious.
of malware has a similar course of action or As with many developments in machine
purpose. For example, a Keylogger is a type of learning, a lot of effort initially was spent on
malware that records all of the text typed into a feature discovery. Many components of the
keyboard, reporting this information back to the executable file were tested as features, including
attacker who can extract usernames, passwords, gathering n-grams from the bytes. N-grams were
and other important data from it. Although two first used in text classification, but Abou-Assaleh,
Keyloggers may accomplish their tasks in unique Cercone, Keˇselj, and Sweidan [2] applied this
ways from one another, they both rely on the technique for malware classification. They
same essential components to do their work. extracted these specific sequences of bytes from
Therefore, identifying malware by family allows the executable files and used their frequency
for a system to find multiple types of malware within the file as a feature. The combined power
without treating them all as one identical of these n-gram features served as a good
classification. Furthermore, if a particular family classification method for malware.
of malware can be accurately classified, then the In more recent years, with a successful set
features used to classify it are clear indicators of of features for classifying, research has been
that specific malware family. Some digging may directed toward different classifier alternatives.
then uncover how that feature correlates to the This set of possible classifiers including support
malware family in question, providing more vector machines, k-nearest neighbor, and artificial
insight into how attackers are doing their work. neural networks have performed at varying
degrees, but many have formed a high accuracy
Related Work classification system. In the pursuit of the most
accurate classifier possible, new research has
In recent years, the topography of malware offered up an ensemble classifier [3, 4]. An
research has been churning. Due to a recent surge ensemble classifier is a classification system
of success, as well as the nature of the problem, which utilizes multiple classifiers to come to a
machine learning techniques have served as a final class decision. The benefit of using multiple
backbone for a significant portion of this classifiers is that where one classifier fails, others
research. The quantity of effort put forth recently may succeed, further cutting down error rates.
has stemmed from the multiple large scale and
damaging malware attacks. Under the public eye,
funding and attention have turned to a more Preprocessing
innovative solution to the persistent problem.
These recent years have brought about many For this research we used a dataset made
fruitful developments, and have allowed for the available by Microsoft through the machine
166
Mitchell Mays et al. MAICS 2017 pp. 165–170
learning website Kaggle. It includes nine different n-gram may have been: “mov eax, [ebp+arg_0]”.
malware families and over 10,000 malwares. This is a 3-gram, a sequence of three consecutive
Based off of this dataset, we chose to create a data points in the assembly code. For each
classifier to distinguish between the different malware, the number of times that this 3-gram
malware families. The winner of the Kaggle occurred in its assembly would be an input to the
competition provided a paper with the techniques classifier. Due to results from Abou-Assaleh, T.,
that they used [5]. This served as a launching N. Cercone, V. Keselj, and R. Sweidan [2] as well
point for our research. From it, we settled on as findings from Xiaozhou Wang, Jiwei Liu, and
using a convolutional neural network to treat byte Xueer Chen [5], we settled on n-grams ranging
data as an input image, and opcode n-grams for from two to four data points.
use in a more standard neural network. The feature gathering for this classifier
Each of the 10,000 malwares provided, was more complicated than for the convolutional
had both a byte file and an assembly file, which neural network. A method was needed to identify
has been disassembled from the byte file. The certain n-grams as possible features and then to
byte file was used to create the byte “images” in decide which of those chosen n-grams would
the following manner. Each byte was converted provide the most classification power. There were
into a pixel value to be used in the black and four stages to this process.
white image. Since each line of the byte file is
sixteen bytes long, the image created represents 1. 1-gram Selection
this same shape: sixteen pixel wide and sixty-four 2. 1-gram Paring
pixels tall. 3. n-gram Selection
4. n-gram Paring
1. 1-gram Selection
Unless intending to evaluate every possible two
to four-gram of operations in an assembly file (a
number easily in the billions) we needed a
criterion to select particular n-grams. One way of
doing this is to build larger n-grams off of
Figure 1. Example Byte image as input to Convolutional
Neural Network particular 1-grams. Based off of the work by
Xiaozhou Wang, Jiwei Liu, and Xueer Chen [5],
Figure one shows how a byte image is used as an we mimicked the execution of the assembly,
input to the convolutional neural network. The following every unconditional jump and counting
network performs convolutions on the pixels the frequency of each 1-gram. If it occured at
creating a matrix of smaller convolved images. least two hundred times in a file it was considered
As it passes through each layer of the network, a valid 1-gram.
the information from the image is condensed and
routed into a final output layer, giving a value to 2. 1-gram Paring
each possible class node. The class node with the Even after using a specific criterion to limit the
maximum value is the projected class given by number of 1-grams selected, there were too many
the network. Our network had nine class nodes to build n-grams off of. We used a few further
corresponding with the nine possible malware criteria to further pare down the list of 1-grams.
families. First, we removed 1-grams with only one
For the second classifier in our coalition, character. We then removed 1-grams that started
the features were opcode n-grams. One possible with a register. While we could assume that the
167
Feature Selection for Malware Classification pp. 165–170
malware would be manipulating the registers, we each designated class from the original entropy
intuited that having the registers themselves be gives the information gain, or the improvement in
the start of an n-gram would be unhelpful since it classification ability based off of that feature.
gives no vision into how they are being In figure two it is clear that the feature
manipulated. Further paring involved removing helps to distinguish between the two classes, and
items that began with “loc_” or “sub_” as they the entropy in each new class evidences that. By
are somewhat file specific, relying on file examining the change in entropy, the helpfulness
locations, and are, therefore, not merely specific of a feature is revealed. Unfortunately, to train a
to a type of malware. classifier on 14 million possible features is a
These heuristic decisions were made due lengthy process. In order to compensate for this I
to the limitations of my system being unable to took two different measures.
handle the enormous number of n-grams that
would be generated from a larger set of 1-grams.
It is possible that the 1-grams that were removed
could have produced potent n-grams for
classification, but based on our reasoning, they
were less likely to. After all of paring was
complete we were left with 855 different 1-
grams.
3. n-gram Selection
With the 1-grams selected, we ran through all of
Figure 2. Change in entropy by classifying on a
the files a second time, grabbing every possible 2
single feature
through 4-gram that began with a 1-gram in the
list. Upon completion there were 14,566,780 n- First, rather than running the information
grams. Each with a particular frequency per gain on each individual feature, we split the list of
malware file. features into batches of five. Training the mini-
classifier on five features instead of one allowed
4. n-gram Paring for a quick check to see if any of the five features
The most important step to generating an accurate were helpful. Even with four unhelpful features,
classifier is to select the n-grams from the 14 we assumed that after training, the one helpful
million that will actually aid in separating some feature would still significantly effect class-
families of malware from others. The best way to ification, leading to a noticeable information gain.
evaluate the benefit of an individual feature as an Second, we distributed the work across
element of a classifier is to test its ability to several different machines in Taylor University’s
classify on its own. If a simple classifier trained Computer Science Department. By re-purposing a
on the individual feature is able to classify the bit of code written for distributing graphics
malwares in a way that exceeds randomness then rendering tasks, we were able to create a task list
that feature would be helpful to include in a larger of n-grams to run through the algorithm and
classifier. return the result to a central server. Even with up
Information gain is an algorithm that to sixty-two machines churning through the
encapsulates this idea. First, it calculates the features, it took three full days to complete the
entropy in a dataset. Then, it classifies that dataset information gain on the 14 million n-grams.
based on some feature. For each class, the We iterated through the information gain
algorithm calculates the entropy of the items values for each set of five features, finding each
within. Subtracting the average entropy of the set with a value above zero. There were approx-
168
Mitchell Mays et al. MAICS 2017 pp. 165–170
imately 72,000 sets of five which provided a The testing methodology that was used for
positive information gain. We split these sets of both of these networks was 10-fold cross
five into individual tasks and re-ran the infor- validation. The entire set of 10,000 malwares was
mation gain test. We were left with 10,985 indi- shuffled and split into ten separate sets with 90%
vidual n-grams, to use as features. of files used for training and 10% for testing.
Each iteration of cross validation took place in
Testing the following manner:
First, both networks were trained to
In order to test the the n-gram features, we used a completion on the training set. Next, the networks
deep feed-forward neural network. This network attempted to classify the test sets. The features
took 10,985 inputs, one for each n-gram. It had (byte image or n-gram frequency) were used as
two hidden layers, each with 7,300 nodes. The inputs to the classifier and the output was a set of
output had nine nodes, one for each family of float values assigned to the nine class nodes. The
malware. The structure of the network was based node with the highest value was the network’s
off of a tutorial by Nathan Lintz [6], and was built classification. Finally, we utilized the output float
using the TensorFlow library. values from each classifier to create an ensemble
The convolutional neural network was classifier. To get a final classification from the
also built in the TensorFlow library, taking an ensemble classifier we multiplied the accuracy of
input of a 16x64 pixel image. It held four each classifier on the test set by its output values
convolutional layers and finished with nine for each malware. We summed these class values
output nodes. In order to prevent over-fitting, for each classifier. This gave a weighted cons-
both networks employed dropout at every layer. ensus between the two classifiers.
Dropout ignores a percentage of the weights in The feed-forward neural network trained
the neural network during training. This prevents on the malware n-grams twenty times, and the
the network from hyper fitting to the training data convolutional neural network trained on the
and, therefore, failing on the testing data. 16x64 pixel images fifty times. We chose the
Figure 3. Testing framework for individual and ensemble classifiers
169
Feature Selection for Malware Classification pp. 165–170
number of iterations after overseeing training. Future Work
Each network took approximately that number of
iterations to settle. Ultimately, the use of ensembles of complex
classifiers has barely scratched the surface of
what is possible. Feature gathering took up a
Results large portion of research time, and there is plenty
After running the cross validations on both of the more to be done on the classifiers. Manipulating
networks, we were left with a final average the structure and number of layers in both the
accuracy for each. The deep feed-forward neural convolutional neural network and feed-forward
network using n-gram features trained to be neural network could drastically effect the
96.7% accurate. The convolutional neural performance of the classifiers. Adding new
network, on the other hand, trained to 88.5% classifiers also has the potential to add signifi-
accuracy on the test data. By multiplying the cant improvements to the malware classification.
nodes by these accuracies, we created weighted Now that using an ensemble method has been
nodes. Combining these nodes as an ensemble applied to neural networks with success, it is
classifier returned an accuracy of 97.7%. worth investing more time to perfect this method
The training time for the feed-forward and compare it to ensembles which depend on
neural network was much longer than the training simpler classifiers.
for the convolutional neural network. This was
expected since the feed-forward neural network Acknowledgements
was training weights for every input node rather
than the training the smaller convolutions. [1]. Schultz, M.g., E. Eskin, F. Zadok, and S.j. Stolfo.
"Data Mining Methods for Detection of New
Malicious Executables." Proceedings 2001 IEEE
Conclusion Symposium on Security and Privacy. S&P 2001 (n.d.):
The final outcome of my research was an n. pag. Web.
ensemble classifier with 97.7% accuracy on the [2]. Abou-Assaleh, T., N. Cercone, V. Keselj, and R.
10,000 malwares. This value is one percent Sweidan. "N-gram-based Detection of New Malicious
higher than the best individual classifier in the Code." Proceedings of the 28th Annual International
coalition (96.7% percent from the feed-forward Computer Software and Applications Conference,
neural network). Yi-Bin, Shu-Chang Din, Chao- 2004. COMPSAC 2004. (n.d.): n. pag. Web.
Fu Zheng, and Bai-Jian Gao [3] showed that [3]. Lu, Yi-Bin, Shu-Chang Din, Chao-Fu Zheng, and
simpler classifiers such as support vector Bai-Jian Gao. "Using Multi-Feature and Classifier
machines and decision trees could be used as Ensembles to Improve Malware Detection." Journal
components to build an improved ensemble of Chung Cheng Institute of Technology 39.2 (2010):
classifier. Our work shows that neural networks n. pag. Web.
can be used as constituents to an ensemble [4]. Kolter, Jeremy Z., and Marcus A. Maloof.
classifier and still create an improved class- "Learning to Detect Malicious Executables in the
ification. Wild." Proceedings of the 2004 ACM SIGKDD
While limited in our time to tinker with International Conference on Knowledge Discovery
the networks and improve their performance, the and Data Mining - KDD '04 (2004): n. pag. Web.
principle that ensembles improve the overall [5]. Wang, Xiaozhou, Jiwei Liu, and Xueer Chen.
classification will most likely still hold as the "Microsoft Malware Classification Challenge (BIG
individual classifiers become more accurate. 2015): First Place Team: Say No to Overfitting."
Furthermore, adding more classifiers to the (2015): n. pag. Web.
ensemble should add to the accuracy of the [6]. https://github.com/nlintz/TensorFlow-
system as a whole. Tutorials/blob/master/03_net.py
170