<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative Analysis of Machine Learning Models with Different Types of Data Representations for Detecting Malicious Files</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alan Nafiiev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hlib Kholodulkin</string-name>
          <email>kholodulkin.hlib@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii Rodionov</string-name>
          <email>andrey.rodionov@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Lande</string-name>
          <email>dwlande@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technology</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ukraine</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Information Recording of National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Institute of Physics</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Taras Shevchenko National University of Kyiv, Faculty of Computer Science and Cybernetics</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>367</fpage>
      <lpage>375</lpage>
      <abstract>
        <p>Nowadays, malicious software authors are creating more and more advanced, sophisticated malware. To detect such programs, machine learning methods are increasingly used. However, these solutions can consume a lot of computing resources to perform their operations. Therefore, the problem arises of creating an optimal machine learning model regarding learning rate and the accuracy of malware detection. Also, usually one method is not enough for high-quality file detection, so it is more efficient to use several types of data representation. The purpose of this work is to conduct research on several machine learning models based on the support vector machine. Compare them with each other and identify the best model for two different types of data representation. The set of data used was collected from various Internet sources. It consists of 12824 executable files in .exe format, 11844 of which are malicious and 980 are benign. This article presents recommended methods for feature selection and input data generation for a machine learning model. These methods allow you to find the best option for preparing features that describe a malicious file, which will be used in the process of training and determining the model with the best parameters. PE format, malware detection, feature selections, machine learning, intrusion detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recently, various methods for detecting malware that are not based on the signature approach have
been increasingly proposed. These methods mainly use heuristic analysis, behavioral analysis, or a
combination of both to detect malicious software [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Such methods are being actively researched due
to their ability to detect zero-day malware without any prior knowledge of it [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. These approaches
often involve using different machine learning models. However, their use always entails several
important challenges that require careful study. Firstly, there is the task of determining and extracting
a suitable set of features that will display the input data set in the best way. Secondly, these features
must be correctly formed for the machine learning model so that the operations use as little time and
computing resources as possible. This work is aimed at solving these two tasks. In recent years, many
studies have been published offering various solutions to these issues [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. In these works, fairly
common methods of feature extraction were proposed. In our last article, we analyzed a set of files
using various machine learning algorithms, where the binary code of the file was used as the features
of the machine learning model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To obtain features, we have presented the executable file in PE
format. Also, we have selected the most optimal parameters for each of the classification algorithms.
The support vector algorithm showed good accuracy results and showed the promise of its use in further
research. More specifically, the research described in this article focuses on the following points:
EMAIL:
      </p>
      <p>Alan.nafiev@gmail.com
(A.</p>
      <p>2022 Copyright for this paper by its authors.
1. Determine the approach for generating input data for a machine learning model.
2. Test the resulting models and use several metrics to identify the optimal model for detecting
malicious files.</p>
      <p>The structure of the article is implemented as follows: Subsection 1.1 introduces general information
about the PE format.</p>
      <p>Section 2 describes the research dataset, how it was generated, and what types of files it contains.
Different types of data representation and feature extraction methods are described in Section 3. Section
4 describes the support vector machine. The conducted experiments and comparison of the results of
the accuracy of machine learning models are described in Section 5. The evaluation of the results of the
analysis of experimental data is described in Section 6. Section 7 concludes this article, which contains
the conclusions of the studies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Portable Executable Format</title>
      <p>
        Most malware targets the Windows operating system. The Portable Executable (PE) file format is a
standard format used by executable files and dynamic link libraries in the Windows operating system. In
addition to the binary code itself, the PE file contains structural information, describes how the OS should
map the program in memory, and which external functions are called through the import address table [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Features can be extracted from this metadata in the form of n-grams of bytes or n-grams of instructions
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Examining these features indicates the behavior of the program at run time and allows you to
distinguish between malicious and harmless Windows binaries. Figure 1 shows a diagram of a PE file.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Construction</title>
      <p>The most important step in building a machine learning model is to create an input dataset. It is
also necessary that such a set be as well balanced as possible in order to obtain the most
representative results. The input data was presented by a data set containing 30750 .exe files that
were collected from various Internet sources. Next, you need to filter the data set by removing
identical files from it. For each file, a sequence of its bytes was collected. Subsection 3.1 details
this process. The identity of the sequence is determined based on all the bit information that could
be collected from the file: DOS_HEADER, FILE_HEADER, OPTIONAL_HEADER,
SECTION_HEADER, Export, Import and also .rsrc section. Basically, in a set of viruses there are
duplicates under different names and very similar files.</p>
      <p>After files with identical sequences were removed from the set, the data set used in this study was
obtained, which contains 11844 malicious and 980 benign .exe files. Malware files are divided into 5 types:
Locker (300), Mediyes (463), Winwebsec (657), Zbot (1973), Zeroaccess (407), CryptoRansom (5856),
Zeus (1413), InstallCore (731). The distribution infographic can be seen in Figure 2. It is worth noting that
in the data set used, there are much fewer benign files than malicious ones. On a real system, this can be a
problem, but in our case, such a distribution is acceptable, since our main task is to compare the methods of
generating features with each other. All malware files were taken from the websites «virusshare.com» and
«malicia-project.com». The benign files were taken from the installed application folders of legal software
from different categories. They can be downloaded at «download.cnet.com/windows».</p>
    </sec>
    <sec id="sec-4">
      <title>4. Feature Selection</title>
      <p>As discussed earlier, PE executable files contain a lot of different kinds of information that can be
used to detect malware. We can view each file as a series of bytes, a trace of instructions or function
calls. Different file representation types require different approaches to feature extraction, selection,
and preprocessing.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1. Binary</title>
      <p>
        To generate input data for a machine learning model, you need to extract useful bit information from
the executable file. To do this, each file is processed in turn, using the pefile library from Python. The
functions of this module allow you to get the bytes of specific fields from various sections of the PE
file as a string. For each file from the dataset, a list is formed containing a sequential sequence of bytes
of the file. A visualization of this process can be seen in Figure 4. In this study, two models of binary
representation of data are considered. One model only uses the byte sequence from the Header:
DOS_HEADER, FILE_HEADER, OPTIONAL_HEADER, SECTION_HEADER, Export and Import.
WORD e_maxalloc; /* 0c: Maximum extra paragraphs needed */
WORD e_ss; /* 0e: Initial (relative) SS value */
WORD e_sp; /* 10: Initial SP value */
WORD e_csum; /* 12: Checksum */
WORD e_ip; /* 14: Initial IP value */
WORD e_cs; /* 16: Initial (relative) CS value */
WORD e_lfarlc; /* 18: File address of relocation table */
WORD e_ovno; /* 1a: Overlay number */
WORD e_res[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; /* 1c: Reserved words */
WORD e_oemid; /* 24: OEM identifier (for e_oeminfo) */
WORD e_oeminfo; /* 26: OEM information; e_oemid specific */
WORD e_res2[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]; /* 28: Reserved words */
      </p>
      <p>DWORD e_lfanew; /* 3c: Offset to extended header */
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;</p>
      <p>
        N  gram representation was one of the first feature generation methods used for malware detection
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. An important difference that this study contains is the use of transition probabilities to build a
Markov chain instead of 2-grams as the underlying representation of the data. Our Markov chain can
be represented as a graph in which the vertices are the states of the process (all possible 256 bytes), and
the edges are transitions between states, and pij is written on the edge from i to j  the probability
of transition from one byte to another.
following conditions are imposed:
      </p>
      <p>pij  0, .
i pij  1 . (2)</p>
      <p>j</p>
      <p>To construct such a matrix of transition probabilities for each byte to each, first, based on the
collected sequence of bytes for all files, it is necessary to construct an adjacency matrix of the form (00,
01, ..., ff) * (00, 01, ..., ff), where for each pair of bytes the matrix counts how many times the first byte
was immediately followed by the second byte. Then, after the formation of adjacent matrices, transition
matrices are constructed for all files.</p>
      <p>After that, the final matrix of dimension 12824 (the number of all files)*2562 is formed, in which
each line corresponds to one file and includes its transition matrix translated into vector form. Each
such vector contains 65536 values. A diagram of this process can be seen in Fig 3. Before passing the
final matrix to the machine learning algorithm, you first need to divide this matrix into a training set
(1)
and test set. Since our dataset does not include many different types of malicious files, it was decided
to use 40 % of the final matrix for training and 60 % for the test.</p>
      <p>4.2. Trace</p>
      <p>To detect malware, executable file instruction tracing is also used. The disassembled code is
generated by the IDA Pro program, which is opened via the command line using a python script. The
program saves the disassembled instructions of each element into .asm files. An example ASM file is
shown in Figure 5.
of the adjacency matrix. As Figure 6 shows, increasing this hyperparameter can significantly reduce the
dimension of the final matrix and, consequently, reduce the training time of the machine learning model.</p>
      <p>After that, an adjacency matrix is built for each file in the same way as in the binary representation.
Unique instructions define the dimension of a two-dimensional quadratic adjacency matrix, in which
for each pair of instructions in the matrix, it is counted how many times the second instruction
immediately followed the first one. Finally, transition matrices are constructed using adjacency
matrices.</p>
      <p>The algorithm receives 40 % of the final matrix as input, where each row contains the transition
matrix of one file. Hyperparameters are selected using cross-validation.</p>
      <p>The transition matrices are added into vectors, forming the final matrix, as in Fig 3. The number of
features depends on the parameter  , which indicates the number of deleted instructions. As in the
binary representation of the data, the final matrix is divided into two sets – training set and test set, with
a ratio of 60 % of the test set.</p>
    </sec>
    <sec id="sec-6">
      <title>Support vector machine</title>
      <p>
        The support vector machine has been chosen as one of the most popular algorithms. According to
numerous studies [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], it shows good results specifically for binary classification. In the classical
SVM based on kernels, the weight vector is trained, which determines the contribu- tion of each sample
to the separating hyperplane between classes, by solving the following optimization problem:
With restrictions:
 1 n n n 
   i j yi y j K  xi , x j    i   Sk   .
 2 i1 j1 i1 
n
 i yi  0, .
i1
0   i  C .
      </p>
      <p>K  x, y   exp   x  y 2  .</p>
      <p>The model is trained using the SVM algorithm with a square exponential kernel:
(3)
(4)
(5)
(6)</p>
    </sec>
    <sec id="sec-7">
      <title>6. Results</title>
      <p>Tests were carried out for two types of data presentation with a ratio of 40 % of the test to training
set. The indicators of the main metrics can be observed in Table 2.</p>
      <p>The results show that the number of features used has little effect on the result of the classifier.
However, it is still possible to isolate the best model by relying on the F-score metric, which is a
harmonic average between recall and precision. The highest result for the tracing type of feature
collection was shown by the variant with 74 instructions (Trace_74). On Fig 8 you can observe the
ROC and Precision-Recall curves. Among the binary type of data representation, a model that includes
a resource section (Binary_r) showed a slight advantage in all metrics. However, the training time also
increased as more data was used. In addition, it is worth noting that the binary representation of the data
showed a comparable accuracy result with the best model trained on instruction traces.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Discussion</title>
      <p>In this study, we created 6 machine learning models, four of which are trace type feature collection,
and two are binary. Based on the results, the optimal model for the tracing type of data representation
will be a model with 74 instructions. In addition, the training time of this model is much less than that
of the variants with 126, 187 and 272 features, which is a significant plus. Therefore, it can be argued
that the principle of using all possible data to train the model does not always give the best accuracy
result. Sometimes it is useful to select a small part of the entire set of features on which the trained
model will show the highest accuracy result.</p>
      <p>In the binary form of the data representation, the model, which includes the file data section, showed
a slightly higher result for all metrics. However, the training time increased slightly. Therefore, for this
type of data representation, the model with a data section will be optimal for use in real malware
detection systems.</p>
    </sec>
    <sec id="sec-9">
      <title>8. Conclusions</title>
      <p>In this work, we have created machine learning models derived from two different approaches for
feature extraction. The first method was the bit information of a file with a PE structure. Using the
pefile library, bytes were parsed from various sections of the file, which were then used in the
construction of transition matrices used to train the model. The second approach used executable file
instruction traces. The disassembled code was generated by IDA Pro. Then instructions were parsed
and, as in the binary representation of data, adjacency matrices were built for each file using 2-grams.
All models demonstrate a very good result in detecting malicious files.</p>
      <p>The best accuracy indicators were demonstrated by models trained on a binary representation of
data. The model containing the resource section scored the highest across all metrics. In particular, the
metrics F0-score – 0.9090, F1-score – 0.9928 and roc_auc – 0.9961. The results of the accuracy of our
models may seem too high, but it is worth considering some limitations. Firstly, the dataset used is not
well balanced. There are an order of magnitude fewer benign files than viral ones, so the machine
learning model will be better at recognizing malicious files. In the framework of our study, this is
acceptable, since our goal is to compare all models with each other and identify the best one. However,
on a real file recognition system, a balanced and carefully filtered data set should be used. Secondly,
there are not many different types of malicious files in the collected dataset. Viruses of the same type
are very similar to each other, which leads to the fact that it is easy for the machine learning algorithm
to recognize such files. Future studies should use as many varieties of malicious files as possible to
obtain a more representative result of detection accuracy.</p>
      <p>We also managed to identify the optimal variant of the number of trace instructions used for training
the model using the support vector machine. The best was the model trained on the least number of
instructions – 74 (roc_auc – 0.9946). In addition, this model requires the least time for its training,
which is a significant plus for using the model in real file classification systems.</p>
      <p>It is worth noting that in this paper we have chosen a support vector machine algorithm for building
machine learning models. One of the main reasons for this choice is future research. In the dissertation
we use 6 types of transformations of exe files, both static and dynamic. And in order to combine all the
branches of file transformations into one generalized machine learning model, the issue of data
matching arises. To solve this task, we decided to use SVM, since this algorithm uses a kernel, which
can be used to combine different types of data.</p>
    </sec>
    <sec id="sec-10">
      <title>9. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Abri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Siami-Namini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Khanghah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Soltani</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S. Namin,</surname>
          </string-name>
          <article-title>The performance of machine and deep learning classifiers in detecting zero-day vulnerabilities (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K. S.</given-names>
            <surname>Anand</surname>
          </string-name>
          <string-name>
            <surname>Hand</surname>
          </string-name>
          , Ashu Sharma,
          <article-title>Machine learning in cybersecurity: A review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (July</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <article-title>Zero-day malware detection based on supervised learning algorithms of api call signatures</article-title>
          ,
          <source>Australasian Data Mining Conference (AusDM11)</source>
          ,
          <year>December 2011</year>
          .P. Chaudhary,
          <article-title>Pe file-based malware detection using machine learning</article-title>
          (
          <year>January 2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <article-title>Pe file-based malware detection using machine learning</article-title>
          (
          <year>January 2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kutlay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Karađuzović-Hadžiabdić</surname>
          </string-name>
          ,
          <article-title>Static based classification of malicious software using machine learning methods (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O. N. U.</given-names>
            <surname>Hasan H. Al-Khshali</surname>
          </string-name>
          ,
          <article-title>Muhammad Ilyas, Effect of pe file header features on accuracy</article-title>
          ,
          <source>2020 IEEE Symposium Series on Computational Intelligence (SSCI)</source>
          ,
          <year>December 2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Alan</surname>
          </string-name>
          <string-name>
            <surname>Nafiiev</surname>
          </string-name>
          , Hlib Kholodulkin,
          <article-title>Comparative analysis of machine learning methods for detecting malicious files, Theoretical and Applied Cybersecurity, Algorithms and methods of cyber attacks prevention and counteraction (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>M. M. F. M. Zubair Shafiq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Momina</given-names>
            <surname>Tabish</surname>
          </string-name>
          ,
          <article-title>Pe-miner: Mining structural information to detect malicious executables in realtime</article-title>
          (
          <year>September 2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. L. Imad</given-names>
            <surname>Abdessadki</surname>
          </string-name>
          ,
          <article-title>A new classification based model for malicious pe files detection</article-title>
          ,
          <source>International Journal of Computer Network and Information Security</source>
          (
          <year>June 2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dösinger</surname>
          </string-name>
          ,
          <year>2016</year>
          . https://github.com/wine-mirror/wine/blob/master/include/winnt.h#
          <fpage>L2253</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Edward Raff</surname>
          </string-name>
          , Richard Zak,
          <article-title>other, An investigation of byte n-gram features for malware classification</article-title>
          ,
          <source>Journal of Computer Virology and Hacking Techniques (February</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lifshits</surname>
          </string-name>
          ,
          <article-title>Algorithms for internet: Support vector machines (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>An empirical comparison of supervised learning algorithms (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>