Heuristic Malware Detection Mechanism Based on
                                          Executable Files Static Analysis
                                                                          1             2                     1
                                                  A.V. Kozachoka,∗, M.V. Bochkovb , E.V. Kochetkova
                                a1Academy of the Federal Guard Service, 35, Priborostroitelnaya Street, Oryol, 302034, Russia
                               2
                               b Business risk educational center, 27, Professor Popov Street, Saint-Petersburg, 197022, Russia


Abstract
To ensure the protection of information processed by computer systems is currently the most important task in the
construction and operation of the automated systems. The paper presents the application justification of a new set of
features distinguished at the stage of the static analysis of the executable files to address the problem of malicious code
detection. In the course of study, following problems were solved: development of the executable files classifier in the
absence of a priori data concerning their functionality; designing class models of uninfected files and malware during
the learning process; development of malicious code detection procedure using the neural networks mathematical
apparatus and decision tree composition relating to the set of features specified on the basis of the executable files
static analysis. The paper also describes the functional model of malware detection system using the executable files
static analysis. The conclusion contains the results of experimental evaluation of the developed detection mechanism
efficiency on the basis of neural networks and decision tree composition. The obtained data confirmed the hypothesis
about the possibility of constructing the heuristic malware analyzer on the basis of features distinguished during
the static analysis of the executable files. However, the approach based on the decision tree composition enables to
obtain a significantly lower false negative rate probability with the specified initial data and classifier parameter values
relating to neural networks.
Keywords: Anti-virus protection, Malware, Neural networks, Decision trees, Heuristic analysis, Machine learning.


1. Introduction

    Security of information processed by computer systems poses the most important task for building and operating
of automated systems today. Along with that, one of the most dangerous threat is computer malware that can modify
(delete) user data, steal confidential information, slowdown or disable operating system. This research substantiates
the possibility of using feature space based on the static analysis of executable files for solving the task of heuristic
malware detection using the neural networks and decision trees composition mathematical apparatus.
    Today there exist different malware detection techniques. The most widely known techniques are the following
ones [1]:
     – signature-based search (malicious code detection based on byte sequence which definitely characterizes it);
     – heuristic search (code detection based on indirect attributes which characterize it as being malicious);

     – behavioral mechanisms that affect executing forbidden operations by different processes (e.g., access to critical
       memory areas or executable code injection into other processes).

   ∗ Corresponding author. Tel.: +7-4862-54-99-33

    Email address: a.kozachok@academ.msk.rsnet.ru (A.V. Kozachok)


Preprint submitted to Elsevier                                                                                                    May 22, 2017
 3rd International conference “Information Technology and Nanotechnology 2017”
                                                                                                                                          132
              Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


    All above-listed techniques have essential weak points, i.e. they possess limited capabilities for detection of
modified and new viruses, or require the user to be involved in the decision-making process with respect to file
belonging to a certain class.
    Today the antivirus software cannot guarantee 100% malware protection. The results of tests performed by AV-
Comparatives in March 2017 show that heuristic detection rate of new malware strains amounts to approximately
95-98% for most of the modern antivirus software [2].
    To detect new malware, heuristic methods or more generally statistical approaches are the most promising research
trends nowadays. Some of them based on structural analysis and executable file features [3, 4, 5, 6]. One of the
solutions for increasing the effectiveness of heuristic malware detection process is the development of new tools and
techniques for malware detection. The purpose of this study is to substantiate the possibility to build a heuristic
technique for malware detection based on static analysis of executable files. The distinctive feature of the approach
suggested consists in the use of new feature space for building a heuristic detector based on the well-known machine
learning techniques, i.e. neural networks and decision trees composition. In this context, a decision on the malware
presence will be taken according to a certain law based on availability or absence of totality of features from criteria
array defined at the stage of executable file static analysis.


2. Forming feature space based on static executable files analysis

    In order to substantiate the possibility of using suggested feature space for solving the task of heuristic malware
detection, the neural networks and decision trees composition technique has been applied in this study. Let us consider
the totality of the features being studied.
    The whole feature space may be divided into eight conditional feature groups. Group 1 comprises the features
based on the results of characteristics evaluation for the following executable files parts: file header size, optional
header, MS-DOS header, digital certificate. Since the structure of the headers has been defined, in case of their
size change relevant attribute will be detected. Group 2 comprises the features associated with the use of packing,
archiving and encrypting utilities for executable files such as UPX, MPRESS, PeCompact etc. Group 3 comprises
information about dynamic libraries, as well as functions exported and imported by the executable file. The rate of
certain API-functions and dynamic libraries usage by malware has been precomputed. As a result, two classes have
been identified. The first class comprises API-functions by means of which malicious actions can be performed. The
second class comprises the rest of the functions. In this context, belonging to a certain class of API-functions is to be
regarded as a feature. Group 4 comprises data on digital certificates, namely, whether they are available in the file,
whether data are out-of-date or have been recalled. Group 5 comprises the features based on the information about
PE-file structure, namely, availability of anonymous section, whether overlay technique is applied, whether the first
section is available for writing, whether the control function is transferred by the file to other files, entry point address
is out of the file section boundaries, in the first or other sections, whether the last and other sections are of executable
type. Feature group 6 is formed based on the manifest, its availability, correspondence of the manifest structure to
standard format, whether the administration privilege is requested by the manifest etc. Group 7 comprises information
about executable file interface with the operating system, namely, whether the use of Structured Exception Handling
(SEH), Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR) is ignored, whether
the application is executed in Visual Basic virtual machine, whether Thread Local Storage (TLS) is used. Group 8
comprises information not included in previous seven groups; for example, whether the file contains rigidly fixed
IP-addresses, whether direct cookie links are present, whether databases are in use etc.
    Each analyzed executable file is described in the form of a Boolean vector, where one means the feature is avail-
able, zero means no feature is available.


3. Static heuristic malware detection mechanisms

3.1. Machine learning detection mechanism based on static executable files analysis
    To increase the effectiveness of heuristic analysis of executable files we suggest using the malware detection
technique based on neural networks. Utilization of the neural network mathematical apparatus together with the
created feature space will enable us to solve the following tasks:


3rd International conference “Information Technology and Nanotechnology 2017”
                                                                                                                                   133
         Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


   1. generating class models in the course of learning (uninfected files and malware);
   2. developing malware detection procedure through the use of feature vector based on static analysis of executable
      files;
   3. classifying executable files without priori data on their infection with malicious code.
   Solution of the task for developing the neural network malware detection technique based on static analysis of
executable files comprises two main stages as follows:
   1. learning the neural network which defines malware and uninfected file classes (learning subsystem);
   2. calculating output values of neural network based on the sequence of features singled out of the files analyzed,
      and decision-making on files belonging to a certain class (classification subsystem).
   Let us consider the learning process in detail. It consists of two main stages:
   1. supervised learning the neural networks;
   2. adjusting hyper parameter values for decision-making on the file infection status.
     At the first stage, the neural network is learnt in a supervised mode. The network is provided with values of both
input and priori known output signals, whereas weight coefficients are subject to corrective adjustment in order to
increase the accuracy of the neural network learning.
     The result of the first learning stage is a weighting coefficients matrix (model) of the learnt neural network.
     At the second learning stage for the given learning set it is necessary to evaluate mistake probability values of the
first and second grade depending on the hyper parameter values, and adjust them in accordance with the requirements
to the first and second grade mistake criticality for further effective functioning of the neural network as a malware
detection tool.
     As a result of the second learning stage hyper parameter values to be used for executable file classification muat
be chosen. The learning procedure result is a weighting coefficients matrix and hyper parameter values.
     The detection procedure consists of two main stages [7]. During the first stage the detection system input receives
the feature sequence singled out of the file analyzed. Then the neural network output values are calculated using
weighting coefficients matrices entered into the database.
     Depending on output value classifier makes a decision on the analyzed file belonging to a certain software class.

3.2. Heuristic malware detection mechanism based on decision trees composition
    As an alternative approach, in order to substantiate the possibility to build heuristic malware detection tool based
on static analysis of executable files this study provides the results of classifier effectiveness evaluation based on the
decision trees composition.
    As a rule, composition of algorithms is regarded as a combination of N algorithms d1 (x), . . . , dN (x) in a single
one. The idea consists in learning the algorithms and averaging of the obtained responses:

                                                                 1 X
                                                                     N
                                                        a(x) =         dn (x).                                                (1)
                                                                 N n=1
   This formula directly answers the regression problem. For the case of binary classification d1 (x), . . . , dN (x) it is
necessary to take a sign from the resulting formula:

                                                                    1 X
                                                                         N
                                                      a(x) = sign         dn (x).                                             (2)
                                                                    N n=1
    To build a decision trees composition [8] first it is necessary to learn the basic N algorithms on different subsets
singled out from the learning set. To single out random sets, an approach based on the random sets generation from
the learning set through removal followed by return procedure (bootstrap) has been applied in this study. At the same
time, the size of each subset amounts to 0, 632L, where L is the learning set size.
    Additionally, random subspace technique [9] has been applied. The technique consists in choosing the random
subset of features for learning each basic algorithm. The number of the features chosen is a hyperparameter of the
given technique.


3rd International conference “Information Technology and Nanotechnology 2017”                                                 134
      Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


4. Results and Discussion

    At the learning stage for both approaches described above it is necessary to create two representative learning
sets: uninfected files and malware. A test set is to be created using the files that are not included in the learning file
set. During the pre-processing stage, a totality of features is singled out in the form of a Boolean vector from the
executable files sent to the system input.
    For learning block it is required to single out a totality of feature sequences from the whole totality of files of the
representative learning set. For detection block it is required to single out a feature sequence from one file which was
received at the classification system input.
    Initial data used in the study:

   – 1862 uninfected files (system and program files collected from different Windows operating systems);
   – 1910 malware files (authors private collection);
   – 353 features singled out based on static analysis of executable files.

     As a result of performed static analysis of provided file sets, a feature vector has been generated for each executable
file of both classes. Groups of performed experiments have confirmed the hypothesis that building a malicious code
heuristic analyzer based on the static analysis of executable files is possible.
     To evaluate the effectiveness of the neural network technique the following initial data have been used:
   – learning rate factor (constant) 0.001;

   – accuracy 0.0001;
   – number of iterations for reaching the required learning accuracy has been limited by 200.
   Hyper parameters to adjust were [10]:
   – activation function selection:

          – relu, the rectified linear unit function, returns f (x) = max(0, x);
          – tanh, the hyperbolic tan function, returns f (x) = tanh(x);
          – logistic, the logistic sigmoid function, returns f (x) = 1/(1 + exp(−x));
          – identity, no-op activation, useful to implement linear bottleneck, returns f (x) = x;

   – solver function selection:
          – adam refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba [11];
          – sgd refers to stochastic gradient descent;
          – lbfgs is an optimizer in the family of quasi-Newton methods;

   – number of hidden layers and neurons.
    The first group of experiments allowed us to select appropriate activation function and solver values. Figure 1
shows the results of experimental evaluation with crossvalidation based on our dataset divided in two sets train (0.7)
and test (0.3) and combination of various parameter values.We used one hidden layer with one hundred neurons. As
a score value we selected F-measure.
    A box plot (box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons
between variables or across levels of a categorical variable. The box shows the quartiles of the F-measure values while
the whiskers extend to show the rest of the distribution, except for points that are determined to be ”outliers” using a
method that is a function of the inter-quartile range. The mean value is shown by line inside box. Figure 1 shows us
approximately equal three possible combination variants: logistic lbfgs, tanh lbfgs, relu adam.


3rd International conference “Information Technology and Nanotechnology 2017”                                              135
          Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


                               1.00


                               0.99


                               0.98
               F1 score


                               0.97


                               0.96


                               0.95
                                       relu_sgd


                                                                                                                    tanh_sgd
                                                             identity_sgd


                                                                                           logistic_sgd


                                                                                                                                    relu_lbfgs


                                                                                                                                                                                                              relu_adam
                                                                                                                                                        identity_lbfgs


                                                                                                                                                                            logistic_lbfgs


                                                                                                                                                                                              tanh_lbfgs


                                                                                                                                                                                                                                 identity_adam


                                                                                                                                                                                                                                                           logistic_adam


                                                                                                                                                                                                                                                                                       tanh_adam
                                                                                                                                                     Combination variant

                                           Figure 1: F-measure depending on activation function and solver combination variant.

                               1.000

                               0.995

                               0.990

                               0.985
               F1 mean score


                               0.980

                               0.975

                               0.970

                               0.965

                               0.960
                                                  relu_sgd

                                                                            identity_sgd

                                                                                                     logistic_sgd

                                                                                                                         tanh_sgd

                                                                                                                                        relu_lbfgs


                                                                                                                                                                                                           relu_adam
                                                                                                                                                           identity_lbfgs

                                                                                                                                                                            logistic_lbfgs

                                                                                                                                                                                             tanh_lbfgs


                                                                                                                                                                                                                          identity_adam

                                                                                                                                                                                                                                                 logistic_adam

                                                                                                                                                                                                                                                                           tanh_adam


                                                                                                                                                     Combination variant

                Figure 2: F-measure and crossvalidation time depending on activation function and solver combination variant.


    Another group of experiments allowed us to select exactly one combination: tanh lbfgs, because it is the fastest
one. Figure 2 shows mean F-measure value for all variants (black dots) and blue rounds show relative crossvalidation
time.
    The third group of experiments allowed us to select a number of hidden layers and neurons. The initial state was
with one hidden layer with 10 neurons. We used maximum three hidden layers with 100 neurons. The step was equal
to 10 neurons.
    Table 1 shows 15 best F-measure values got during experimental evaluation sorted from max to min.


3rd International conference “Information Technology and Nanotechnology 2017”
                                                                                                                                                                                                                                                                                                   136
               Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


                      Table 1: F-measure mean crossvalidation score depending on number of hidden layers and neurons
                                         Hidden layer neurons
                                                                      F-measure mean score
                                         1st 2nd        3rd
                                         20   70        80                      0.99110
                                         20   20         0                      0.99109
                                         50   10       100                      0.99092
                                         20   10        70                      0.99083
                                         20   40        40                      0.99082
                                         10   20        60                      0.99074
                                         20   30        70                      0.99074
                                         70   20        30                      0.99074
                                         20   40         0                      0.99074
                                         20   70        60                      0.99073
                                         20   50       100                      0.99047
                                         10 100         90                      0.99047
                                         20   50        50                      0.99047
                                         20   60         0                      0.99047
                                         20   10        50                      0.99047


    First two rows have closed values, but the first neural network configuration consists of three hidden layers, in
spite of the second one with two layers. The first one consumes for about 60% much more time to classify a test set
than the second.
    The performed groups of experiments have enabled us to substantiate this hyper parameter values:
    – activation function – tanh:
    – solver function – lbfgs:
    – 2 hidden layers with 20 neurons.
   As a result of the developed malware detection technique based on neural network application the following values
have been obtained:
    – Accuracy = 0.99125;
    – F-measure = 0.99110;
    – Sensitivity = 0.99425;
    – Specificity = 0.99409.
    Then the classifier has been learnt based on the decision trees composition, and the obtained result has been cross
validated. Figure 3 shows the relation between detection accuracy and number of decision trees. Along with that, the
sets of both classes have been divided with the proportion of 0.7 (learning), 0.3 (test).
    The performed experimental evaluation of the developed solution allows us to substantiate the following hyper
parameters of the classifier:
    – number of decision trees is 85;
    – number of valuable features is 55 (figure 4).
   The following evaluation values have been obtained when choosing the given parameters of the decision trees
composition classifier:
    – Accuracy = 0.99240;


3rd International conference “Information Technology and Nanotechnology 2017”                                                       137
         Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


                      1.005
                                                                                                         Training score
                                                                                                         Crossvalidation score


                      1.000
           Accuracy


                      0.995


                      0.990


                      0.985
                                0                20                40                 60                   80                    100
                                                                 Number of decision trees
                                               Figure 3: Accurarcy value depending on number of trees.


                      1.0
                      0.9
                      0.8
                      0.7
                      0.6
           Value


                      0.5
                      0.4
                      0.3
                      0.2           Accuracy
                                    FPR(%)
                      0.1           FNR(%)
                                    Feature importances
                      0.0
                            0       10         20       30        40        50       60        70         80        90       100
                                                                  Number of features
                                           Figure 4: Accurarcy value depending on number of features.


– F-measure = 0.99226;

– Sensitivity = 0.99616;
– Specificity = 0.99605.


3rd International conference “Information Technology and Nanotechnology 2017”                                                          138
                                           Image Processing, Geoinformation Technology and Information Security / A.V. Kozachok, M.V. Bochkov, E.V. Kochetkov


                                   5. Conclusion

                                       It should be noted that both approaches have confirmed the hypothesis that building a malicious code heuristic
                                   analyzer based on features singled out during static analysis of executable files is possible. However, the approach
                                   based on the decision trees composition allows obtaining better accuracy value relative to neural network tool with
                                   the above-mentioned initial data and classifier hyper parameter values.
                                       The obtained values of mistake probabilities for developed prototypes comply with the requirements of the Federal
                                   Service for Technical and Export Control [12] imposed to antivirus software.
                                       In conclusion it should be noted that the suggested approach consisting in the application of the space of features
                                   singled out from the executable files at the stage of static analysis and well known machine learning techniques enables
                                   us to implement a new mechanism for heuristic detection of malicious code which provides the possibility to reveal
                                   new and modified malware samples.


                                   References
                                    [1] Kozachok, A. V. Mathematical model of destructive software recognition tools based on hidden markov models [Text] / A. V. Kozachok //
                                        ”Vestnik SibGUTI”. — 2012. — Vol. 3. — P. 29–39. — (in Russian).
                                    [2] Anti-Virus Comparative — malware protection test [Electronic resource]. — 2017. — URL: https://www.av-comparatives.org/wp-
                                        content/uploads/2017/04/avc mpt 201703 en.pdf.
                                    [3] Siddiqui, M. A survey of data mining techniques for malware detection using file features [Text] / Muazzam Siddiqui, Mor-
                                        gan C. Wang, Joohan Lee // Proceedings of the 46th Annual Southeast Regional Conference on XX. — ACM-SE 46. — New York, NY,
                                        USA : ACM, 2008. — P. 509–510. — URL: http://doi.acm.org/10.1145/1593105.1593239.
                                    [4] Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey [Text] / Asaf Shabtai,
                                        Robert Moskovitch, Yuval Elovici, Chanan Glezer // Information Security Technical Report. — 2009. — Vol. 14, no. 1. — P. 16–29. —
                                        Malware. URL: http://www.sciencedirect.com/science/article/pii/S1363412709000041.
                                    [5] Opem: A static-dynamic approach for machine-learning-based malware detection [Text] / Igor Santos, Jaime Devesa, Félix Brezo [et al.] //
                                        International Joint Conference CISIS12-ICEUTE´ 12-SOCO´ 12 Special Sessions / Springer. — Ostrava, Czech Republic : Springer-Verlag
                                        Berlin Heidelberg, 2013. — P. 271–280.
                                    [6] David, B. Structural analysis of binary executable headers for malware detection optimization [Text] / Baptiste David, Eric Filiol,
                                        Kévin Gallienne // Journal of Computer Virology and Hacking Techniques. — 2017. — Vol. 13, no. 2. — P. 87–93. — URL:
                                        http://dx.doi.org/10.1007/s11416-016-0274-2.
                                    [7] Dropout: A simple way to prevent neural networks from overfitting [Text] / Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky [et al.] // J.
                                        Mach. Learn. Res. — 2014. —jan. — Vol. 15, no. 1. — P. 1929–1958. — URL: http://dl.acm.org/citation.cfm?id=2627435.2670313.
                                    [8] Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees [Text] / Helmut Schmid. — Manchester, UK : UMIST, 1994.
                                    [9] Shi, T. Unsupervised learning with random forest predictors [Text] / Tao Shi, Steve Horvath // Journal of Computational and Graphical
                                        Statistics. — 2006. — Vol. 15, no. 1. — P. 118–138.
                                   [10] API design for machine learning software: experiences from the scikit-learn project [Text] / Lars Buitinck, Gilles Louppe, Mathieu Blondel
                                        [et al.] // ECML PKDD Workshop: Languages for Data Mining and Machine Learning. — [S. l. : s. n.], 2013. — P. 108–122.
                                   [11] Kingma, D. Adam: A method for stochastic optimization [Text] / Diederik Kingma, Jimmy Ba // arXiv preprint arXiv:1412.6980. — 2014.
                                   [12] Federal Service for Technology and Export Control. Informational report on antivirus software requirements approval [Text]. — 2012. — (in
                                        Russian).


                                    3rd International conference “Information Technology and Nanotechnology 2017”                                                               139


Powered by TCPDF (www.tcpdf.org)