<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Malware Classi cation Using Static Disassembly and Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>nshuo Ch</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eoin Brophy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Network and system security are incredibly critical issues now. Due to the rapid proliferation of malware, the traditional analysis methods struggle with enormous samples. In this paper, we propose four small-scale and easy-to-extract features, including sizes and permissions of PE sections, content complexity, and import libraries, to classify malware families, and use automatic machine learning to search for the best model and hyper-parameters for each feature and their combinations. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. The analysis is based on static disassembly scripts and hexadecimal machine code. Unlike dynamic behavior analysis, static analysis is resource-e cient and o ers complete code coverage, but is vulnerable to code obfuscation and encryption. The results demonstrate that features which work well in dynamic analysis are not necessarily e ective when applied to static analysis. For instance, API 4-grams only achieve 57.96% accuracy and involve a relatively high dimensional feature set (5000 dimensions). In contrast, the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40% and the feature vector is of much smaller dimension (40 dimensions). We demonstrate the e ectiveness of this approach through integration in IDA Pro, which also facilitates the collection of new training samples and subsequent model retraining.</p>
      </abstract>
      <kwd-group>
        <kwd>Malware Classi cation</kwd>
        <kwd>Reverse Engineering</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>System Security</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Network and system security are incredibly critical issues at the moment.
According to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], 142 million threats were being blocked every day in 2019.
Furthermore, new types of malware are appearing all the time and are increasingly
aggressive. For instance, the use of malicious PowerShell scripts increased by
1000% in the same year. To make matters worse, anti-anti-virus techniques used
by attackers are also steadily improving. The use of polymorphic engines allows
malware developers to mutate existing code while retaining the original
functions unchanged. This is achieved, for example, through the use of obfuscation
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
and encryption. Code obfuscation uses needlessly roundabout expressions and
data to make source or machine code challenging for humans to understand. A
simple way is to use roundabout expressions. Attackers can also use jump
instructions to make the real execution ow di erent from the disassembly script. Code
encryption packs and encrypts executable les on the disk. They will decrypt
themselves during execution. It means they are nearly impossible to analyze just
by static disassembly relying instead on execution and reviewing of system logs.
      </p>
      <p>These anti-anti-virus techniques have now led to a rapid proliferation of
malware which traditional analysis methods struggle to cope with as these rely on
signature matching and heuristic rules. They have the same drawback: unseen
samples must be manually analyzed before creating signatures or heuristic rules.</p>
      <p>
        However, analysts cannot review each unknown le in practice. Machine learning
approaches in contrast do not rely on understanding code and malicious
behaviors. After training with a wide range of known samples, such methods can more
easily identify potential malware compared to human experts. Some automatic
models have been applied in related elds, such as malware homology analysis
by dynamic ngerprints in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and gray-scale image representation of malware
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which did not require disassembly or code execution.
      </p>
      <p>We adopt a machine learning approach in this work. The primary exploration
and experiments of this paper are as follows:
{ The API n-gram, an e cient dynamic behavior feature usually generated by
system event logging, is applied to static analysis. The result demonstrates
that actual API sequences are hard to extract from disassembly scripts.</p>
      <p>Inaccurate n-grams have a substantial negative impact on classi cation.
{ A simpler variant of the import library feature is proposed, described in
Section. 3.4. It uses One-Hot Encoding to indicate whether a library is imported
by malware. Compared with the number of APIs imported by each library,
this variant is easier to extract and more reliable. Because hiding import
libraries is more di cult than hiding API calls. However, this feature does
not provide as much detailed information as API calls.
{ Three new small-scale features are proposed: section sizes, section
permissions, and content complexity. The feature descriptions are in Section. 3.5,
Section. 3.6, and Section. 3.7 respectively. They can provide macroscopic
information about malware.
{ Automatic machine learning is used to nd the best model and
hyper</p>
      <p>parameters for each feature and its combination.
{ A method of using the classi er in practice is proposed. With the help of</p>
      <p>IDA Pro3, the most popular reverse analysis tool, new training data can be
generated from the latest known malware. It also provides a Python
development kit and using this the classi er proposed here is implemented as an
IDA Pro plug-in. This allows an analyst using IDA Pro to process a malware
sample and perform classi cation immediately within their work ow.</p>
      <sec id="sec-1-1">
        <title>3 https://hex-rays.com/products/ida/support/idadoc/index.shtml</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RELATED</title>
    </sec>
    <sec id="sec-3">
      <title>WORK</title>
      <p>In this section, we primarily discuss malware analysis combined with machine
learning and hand-designed static features. Since their processes can be
explained in terms of these underlying, interpretable features, analysts can
undertake deeper exploration according to the model output. We also examine the
small number of deep learning models which have been applied to the problem.
These utilize image and byte representations to achieve their goals.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Zheng Rongfeng et al. used strings, registry changes and API
sequences to distinguish whether a new sample was a variant of a known sample.
One shortcoming of this method was that accurate registry changes and API
sequences must be recorded at execution, which requires costly virtual machines.
And due to the lack of enforcement conditions, not all malicious behaviors can
be recorded. Another serious issue is that malware can hide signatures through
polymorphic engines and packers with the purpose of being harder for anti-virus
software to detect. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Maleki Nahid et al. proposed a binary classi cation
system for packed samples. They unpacked les and extracted PE Headers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
then used the forward selection method to pick seven features. Unlike the
previous, their model did not completely rely on detailed assembly instructions and
system behaviors, but introduced macroscopic information, such as the number
of executable sections and the debug information ag. However, it was unsuitable
for les that could not be unpacked.
      </p>
      <p>
        In the face of these problems which are di cult to solve by ngerprinting,
Nataraj et al. proposed an innovative method in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They converted a malware
sample's byte content into a gray-scale image. The images belonging to the same
family appear similar in layout and texture. They used K-Nearest Neighbors
algorithm to determine whether the samples were derived from the same origin.
Neither disassembly nor code execution was required, and simple transformations
by polymorphic engines usually do not a ect the general image layout. As a
limitation, image representation has a problem that two malware images can be
similar even if they belong to di erent families, because the same visual resources
are used among these samples, like icons and user interface components. And
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Gibert et al. thought image representation might produce non-existing
spatial correlations between pixels in di erent rows.
      </p>
      <p>
        In recent years, some deep learning models have also been applied in the
eld of malware classi cation. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Kalash et al. combined Convolutional
Neural Networks with gray-scale image representation. Their model architecture was
based on VGG-16 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and achieved 99:97% accuracy, which is the best result we
have found to date. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Ra et al. built a Convolutional Neural Network and
used all bytes of a sample as raw data. But instead of training the network on
raw bytes, they inserted an embedding layer to map each byte to a xed-length
feature vector. It could reduce incorrect correlations between two bytes. In other
words, the certain bytes are closer to each other than other values, which is
incorrect in terms of the assembly instruction context. And using Convolutional
Neural Networks with a global max-pooling could increase the robustness when
facing minor alterations in bytes. In contrast, traditional byte n-gram methods
are dependent on exact matches. For deep learning models with byte-based
representation, Gibert et al. thought the main advantage of such an approach is
that it can be applied to samples from di erent systems and hardware, because
they are not a ected by le formats [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, the size of byte sequences is
too large and the meaning of each byte is context-dependent. Byte-based
representation does not contain this information. Another challenge is that adjacent
bytes are not always correlated because of jumps and function calls.
      </p>
      <p>
        Apart from these models, Gibert et al. mentioned several challenges in the
face of malware analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One of them is Concept Drift. In many other machine
learning applications like digit classi cation, the mapping learned from
historical data will be valid for new data in the future, and the relationship between
input and output does not change over time. But for malware, due to function
updates, code obfuscation and bug xes, the similarity between previous and
future versions will degrade slowly over time, decaying the detection accuracy.
Furthermore, the interpretation of models and features should also be considered.
When an incorrect classi cation happens, analysts need to understand why and
know how to x it. This is challenging without clear interpretability and
explainability. Even in the absence of miss-classi cations, analysts prefer to understand
how a classi cation has been arrived at. This is the main reason why we did not
choose a deep learning model.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>FEATURE EXTRACTION</title>
      <p>
        The dataset used in the paper is from the 2015 Microsoft Malware Classi
cation Challenge [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It contains 10868 malware samples representing a mix of nine
families. Each sample has two les of di erent forms: machine code and
disassembly script generated by IDA Pro. In practice, malware is in the form of
executable les. However for safety reasons, Microsoft does not provide raw les
but only processed machine code and disassembly scripts. Without executable
les, dynamic analysis cannot be conducted. All features can only be extracted
from static text. The features described in Section. 3.4, Section. 3.5, Section. 3.6
and Section. 3.7 are proposed by us. API n-grams in Section. 3.2 is an e ective
feature in dynamic behavior analysis. We tested it to check if it is also applicable
for static disassembly analysis.
3.1
      </p>
      <sec id="sec-4-1">
        <title>File Size</title>
        <p>The le size is the simplest feature, containing the sizes of disassembly and
machine code les, and their ratios. File sizes vary according to the functional
complexity of di erent malware families. And size ratios may represent code
encryption. If a sample is encrypted, disassembly may fail and the ratio of its
machine code size to the disassembly size will be di erent from the other samples.</p>
      </sec>
      <sec id="sec-4-2">
        <title>API 4-gram</title>
        <p>The API sequence is almost the most commonly used feature. It directly uses
malicious or suspicious API sequences to classify malware. Each malware family has
distinct functions. For instance, Lollipop is Adware showing advertisements as
users browse websites. Ramnit can disable the system rewall and steal sensitive
information. These functions rely on di erent APIs.</p>
        <p>API sequences should be extracted by dynamic execution since it can reduce
the negative impact of code encryption. However, because of the dataset
limitation, we can only use regular expressions to match call and jmp instructions
whose target is an import API from static disassembly scripts. It has a huge
negative impact on accuracy and we will discuss this in detail in a later section.
Finally, 402972 API 4-grams were extracted and only the 5000 most frequent
items were retained.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Opcode 4-gram</title>
        <p>The opcode sequence is also commonly used. It focuses on disassembly
instructions. Opcodes are de ned by CPU architectures, not by systems as in the case
of APIs. So they are compatible with di erent systems built on the same
architecture. 1408515 opcode 4-grams were extracted and only the 5000 most frequent
items were retained.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Import Library</title>
        <p>As mentioned in Section. 3.2, each malware family has distinct functions. They
must import system or third-party libraries to achieve. So a typical machine
learning feature is the number of APIs per imported library used by malware.
But API numbers would be inaccurate if malware calls an API with dynamic
methods such as by GetProcAddress.</p>
        <p>We proposed a simpler variant, using One-Hot Encoding to indicate whether
a library is imported by malware. It is easier to extract and more reliable because
in terms of system security, it is not as susceptible to anti-anti-virus techniques
as the number of APIs. There are 570 di erent import libraries in the dataset.
The 300 with the highest number of occurrences among them were retained.
Fig. 2 demonstrates how this feature distinguishes Obfuscator.ACY from others
and provides rough ideas about functionality. Crypt32 is a cryptographic library.
Obfuscator.ACY may rely it on encryption. Most samples from other classes are
not encrypted so they do not import this library. We also extracted top libraries
based on Gini Impurity as in Fig. 1.
3.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>PE Section Size</title>
        <p>PE les consist of several sections. Each section stores di erent types of bytes
and has attributes. The number of sections, their uses and attributes are de ned
by software development tools and programmers based on functionality.</p>
        <p>This feature focuses on section sizes. Each section has two types of sizes: a
virtual size and a raw size. They are VirtualSize and SizeOfRawData elds of
structure IMAGE_SECTION_HEADER in PE Headers, respectively. The raw size is
exactly the size of a section on the disk, and the virtual size is the size of a section
when it has been loaded into memory. For instance, a section may store only
uninitialized data whose values are only available after startup. There is no need
to allocate space for it on the disk, so the raw size is zero, but the virtual size is
not. The ratio of the two types of sizes is also included in the feature. The dataset
contains 282 sections with di erent names. Each section has three attributes, so
the full feature has 846 dimensions. After feature selection using Random Forest
based on Gini Impurity, only the 25 most essential dimensions were retained.
Most of them are standard sections de ned by software development tools as in
Fig. 3.
3.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>PE Section Permission</title>
        <p>PE sections have access permissions, which are combinations of readable, writable
and executable. We calculated the total size of readable data, writable data and
executable code separately for each malware sample. Like the previous, each
permission has three attributes: a virtual size, a raw size and a ratio of the two
sizes. This feature can be regarded as a summary of PE section sizes and
provides a more general view with only nine xed dimensions. For example, Fig. 4
shows the distribution of writable virtual sizes. Four Backdoor classes (3, 5, 7,
and 9) have relatively large writable space. This is possibly because they need
to steal sensitive information or download other les from the Internet to run,
which require enough memory space.</p>
        <p>Additionally, we think these two PE section features (sizes and permissions)
have compatibility with Linux systems. Linux uses the Executable and Linkable
Format (ELF) for executable les. It has similar section structures to the PE
format.
3.7</p>
      </sec>
      <sec id="sec-4-7">
        <title>Content Complexity</title>
        <p>Content complexity is a new feature type for malware classi cation. What we
propose here has six xed dimensions: the original sizes, compressed sizes and
compression ratios of disassembly and machine code les. We used Python's
zlib library to compress samples and recorded size changes. This approximates
function complexity, code encryption and obfuscation. Fig. 5 is from the sample
with the largest disassembly compression ratio of 12:8. It might be obfuscated
with repetitive, roundabout instructions. In contrast, Fig. 6 has the smallest
disassembly compression ratio of 2:3. The disassembly failed and IDA Pro can
only output its original machine code. This is because the sample is encrypted
and packed by UPX, a famous open-source packer for executable les. In addition
to this, the use of complex, rare instructions can also lead to low compression
ratios.</p>
        <p>In theory, this feature can be used directly in the malware classi cation of
any other CPU architecture and system. It has better compatibility than others
because it does not rely on any platform-related characteristics and structures.
However, CPUs have di erent instruction sets. For instance, Intel x86 is based
on Complex Instruction Set Computing, while ARM is based on Reduced
Instruction Set Computing. It may a ect classi cation accuracy.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS</title>
      <p>
        For each feature and its combination, we used automatic machine learning
library auto-sklearn to search for the best parameters, relying on Bayesian
optimization, meta-learning and ensemble construction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. 80% of the dataset was
used as a training set and auto-sklearn evaluated models on it using 5-fold
cross-validation. The models include K-Nearest Neighbors, Support Vector
Machine and Random Forest. All experiments were conducted on 64-bit Ubuntu,
Intel(R) Core(TM) i7-6700 CPU (3.40GHz) with 12GB RAM. Each model's
parameter search process lasted up to one hour. After auto-sklearn had
determined a model's optimal parameters, we used the remaining 20% as a test set
to calculate classi cation accuracy. The results are shown in Table. 1, sorted in
increasing order of accuracy. Random Forest provided the best performance in
all experiments.
      </p>
      <p>Among individual features, opcode 4-grams provided the highest accuracy of
99:08%, meaning static disassembly does not invoke many negative impacts on
opcode 4-grams. They are e ective both in dynamic and static analysis, but their
extraction requires much time and computational resources. The original
dimension of opcode 4-grams before feature selection is the largest (1408515). Content
complexity, PE section sizes and PE section permissions achieved 98:11%, 97:75%
and 97:01% accuracy respectively, which are satisfactory considering they are low
dimensional representations. Import libraries did not perform very well, but the
prediction paths generated by a Decision Tree of import libraries can provide
functionality comparisons between malware families, like Fig. 2. Other features
except API 4-grams cannot do this. At the beginning, we expected that the API
sequences would be an e ective feature in static disassembly analysis, as it does
in dynamic behavior analysis. Unexpectedly, the API 4-gram is the worst. Its
accuracy is only 57:96% and involves a 5000 dimensional feature vector
representation. Our result shows that API sequences may only be applicable to dynamic
behavior analysis. The data errors caused by static disassembly extraction have
a very negative e ect on feature validity.</p>
      <p>
        Among integrated features, the combination of PE section sizes, PE section
permissions and content complexity is almost the best with 99:40% accuracy
and 40 dimensions. If all features were used, the accuracy was only improved by
0:08%, while the number of dimensions increased dramatically to 10343.
Additionally, the highest accuracy of 99:48% we achieved is still lower than 99:97% in
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. If interpretability is not considered, the combination of Convolutional
Neural Networks and gray-scale images they used is obviously an excellent model.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>LIMITATIONS OF STATIC DISASSEMBLY</title>
      <p>The dataset contains only static text. In general, it negatively a ects classi
cation accuracy. We identi ed three speci c problems. They mainly a ect the
extraction of API sequences and import libraries, and do not have serious
impacts on the three new features we proposed.
5.1</p>
      <sec id="sec-6-1">
        <title>Lazy Loading</title>
        <p>In the process of extracting import libraries, only the libraries in the Import Table
can be extracted, which is a structure in PE Headers used to import external
APIs. These libraries will be automatically loaded when malware starts. In order
to make malicious behavior more hidden, developers can use lazy loading to
load a library just before it is about to be used. Lazily loaded libraries cannot
be extracted from static disassembly scripts. As shown in Fig. 1, top libraries
are ubiquitous and have no special signi cance for malware classi cation. A
reasonable speculation is that sensitive libraries are lazily loaded and PE Headers
only contain regular libraries.
5.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Name Mangling</title>
        <p>Compared with import libraries, the API sequence is more negatively a ected.
We found two reasons. The rst is Name Mangling. It allows di erent
programming entities to be named with the same identi er, like C++ overloading.
Compilers can select the appropriate function based on parameters. It is convenient
for programmers. Internally, compilers need di erent identi ers to distinguish
them. Name mangling adds noise to the API n-gram extraction. For the same or
similar functions, we may extract more than one name. A theoretical solution is
to convert mangled names back to the same original name. However, in practice,
it is challenging to develop converters for every possible compiler and language.
Moreover, some compilers do not disclose their detailed name mangling
mechanism.
5.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Jump Thunk</title>
        <p>Jump Thunk is the second reason for the poor performance of API sequences.
Many compilers generate a jump thunk, a small code snippet, for each external
API, then convert all calls to the API into calls to its jump thunk. This
mechanism can provide an interface proxy. But it makes API sequences inaccurate
when we used linear scanning to extract external API calls. Theoretically, we can
recognize jump thunks and match them to external APIs. But thunks' names
are random and their contents may be more complex than jump instructions.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>PRACTICAL APPLICATION</title>
      <p>
        As discussed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the similarity between previous and future malware will
degrade over time due to function updates and polymorphic techniques.
Polymorphic techniques can automatically and frequently change identi able
characteristics like encryption types and code distribution to make malware unrecognizable
to anti-virus detection. To solve this, we designed an automatic malware
classication work ow to apply and enhance our classi er in practice with IDA Pro's
Python development kit, as shown in Fig. 7. The source code is available on the
GitHub4 and makes available as practical features the following contributions
      </p>
      <sec id="sec-7-1">
        <title>1. Data Generation</title>
        <p>In general, analysts can only collect raw executable malware, not disassembly
scripts like those provided in the dataset. To generate new training data, we
developed an IDA Pro script that can be run from the command line with
IDA Pro's parameters -A and -S, which launch IDA Pro in autonomous
mode and make it run a script. For each executable le, it produces
disassembly instructions and hexadecimal machine code, relying on IDA Pro's
disassembler. These two output les are in the same format as the les used
for training in the dataset.</p>
      </sec>
      <sec id="sec-7-2">
        <title>2. Automatic Classi cation</title>
        <p>
          We used another automatic machine learning library TPOT to search for
the best model for the feature combination of PE section sizes, PE
section permissions and content complexity. We think this combination
maintains a good balance between accuracy and the number of dimensions. TPOT
achieved 99:26% accuracy, slightly lower than auto-sklearn (99:40%).
Unlike auto-sklearn, TPOT uses Genetic Programming to optimize models [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
Once the search is complete, it will provide Python code for the best pipeline.
auto-sklearn does not have a similar function. With the tted model, we
developed an IDA Pro classi er plug-in. When an analyst opens a malware
sample with IDA Pro, the plug-in will produce the required raw input les
before features are calculated and the classi cation performed as in Fig. 8.
        </p>
      </sec>
      <sec id="sec-7-3">
        <title>3. Manual Classi cation</title>
        <p>Although automatic classi cation is very useful, the result may be inaccurate
or in doubt therefore the plugin provides a simple means for analysts to
perform in-depth analysis manually to determine a sample's exact family.</p>
      </sec>
      <sec id="sec-7-4">
        <title>4. Model Training</title>
        <p>With su cient output les and labels of the latest samples, the classi er can
be retrained and strengthened either manually or in an automated fashion.</p>
        <p>Our model was trained by these nine malware families only, so if an
input sample does not belong to them, the model will get an incorrect result or
classify the sample into the family that is most similar to its actual type. But
theoretically, these features are applicable to more families if more datasets are
available.</p>
        <sec id="sec-7-4-1">
          <title>4 https://github.com/czs108/Microsoft-Malware-Classi cation</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION AND FUTURE</title>
    </sec>
    <sec id="sec-9">
      <title>WORK</title>
      <p>This paper demonstrates how novel, highly discriminative features of relatively
low dimensionality when combined with automatic machine learning approaches
can provide highly competitive classi cation accuracy for malware classi cation.
Compared with traditional manual analysis, machine learning can provide a fast
and accurate classi er after training on the latest malware samples. It does not
rely on an understanding of code. Unlike API and opcode n-grams, which aim
to match speci c malicious operations, our features focus more on macroscopic
information about malware. In theory, these features are more compatible with
multiple operating systems and not susceptible to code encryption. One
shortcoming is that they cannot o er detailed understanding of malicious behaviors
like API sequences. Analysts must combine multiple features in order to
perform more in-depth analysis. In addition, the negative limitations and e ects of
static text are more severe than we thought, especially for API n-grams. It is
challenging to extract exact API sequences from disassembly scripts with linear
scanning.</p>
      <p>We conclude with a number of open avenues for research that might reduce
the negative e ects of static disassembly and improve machine learning models
for malware processing:
{ Remove regular libraries from the import library feature. Machine learning
models are forced to use only sensitive libraries to classify samples. Note a
potential problem here is that only a tiny number of sensitive libraries may
be extracted.
{ Although many C/C++ compilers exist, there are not many commonly used
versions. We can consider developing name demangling for common
compilers and renaming APIs using our de ned convention.
{ The core of a disassembly script is assembly instructions. So assemblers may
be helpful to perform code analysis to determine the correspondence between
APIs and jump thunks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Feurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eggensperger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blum</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
          </string-name>
          , F.:
          <article-title>E cient and robust automated machine learning</article-title>
          . In: Cortes,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.D.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.D.</given-names>
            ,
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Garnett</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          , pp.
          <volume>2962</volume>
          {
          <fpage>2970</fpage>
          . Curran Associates, Inc. (
          <year>2015</year>
          ), http://papers.nips.cc/paper/5872-e cient-and
          <article-title>-robustautomated-machine-learning</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Gibert</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mateu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planes</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The rise of machine learning for detection and classi cation of malware: Research developments, trends and challenges</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          <volume>153</volume>
          ,
          <issue>102526</issue>
          (
          <year>2020</year>
          ). https://doi.org/https://doi.org/10.1016/j.jnca.
          <year>2019</year>
          .
          <volume>102526</volume>
          , https: //www.sciencedirect.com/science/article/pii/S1084804519303868
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kalash</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rochan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruce</surname>
            ,
            <given-names>N.D.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Malware classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/NTMS.
          <year>2018</year>
          .8328749
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>T.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Scaling tree-based automated machine learning to biomedical big data with a feature set selector</article-title>
          .
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <volume>250</volume>
          {
          <fpage>256</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Microsoft</given-names>
            <surname>Corporation</surname>
          </string-name>
          :
          <article-title>PE format</article-title>
          . Available at https://docs.microsoft. com/en-us/windows/win32/debug/pe-format (
          <year>2021</year>
          /05/10) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Nahid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehdi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamid</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An improved method for packed malware detection using PE header and section table information</article-title>
          .
          <source>International Journal of Computer Network and Information Security</source>
          <volume>11</volume>
          ,
          <issue>9</issue>
          {
          <fpage>17</fpage>
          (09
          <year>2019</year>
          ). https://doi.org/10.5815/ijcnis.
          <year>2019</year>
          .
          <volume>09</volume>
          .02
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Nataraj</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karthikeyan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacob</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjunath</surname>
            ,
            <given-names>B.S.:</given-names>
          </string-name>
          <article-title>Malware images: Visualization and automatic classi cation</article-title>
          .
          <source>In: Proceedings of the 8th International Symposium on Visualization for Cyber Security. VizSec '11</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2011</year>
          ). https://doi.org/10.1145/2016904.2016908, https://doi.org/10. 1145/2016904.2016908
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ra</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sylvester</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catanzaro</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicholas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Malware detection by eating a whole EXE (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ronen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feuerstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yom-Tov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Microsoft malware classi cation challenge</article-title>
          . ArXiv abs/
          <year>1802</year>
          .10135 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for largescale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Symantec</surname>
            <given-names>Corporation</given-names>
          </string-name>
          :
          <article-title>Internet security threat report</article-title>
          .
          <source>Tech. rep., Symantec Corporation</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Homology analysis of malicious code based on dynamic-behavior ngerprint (in Chinese)</article-title>
          .
          <source>Journal of Sichuan University (Natural Science Edition</source>
          )
          <volume>53</volume>
          (
          <issue>004</issue>
          ),
          <volume>793</volume>
          {
          <fpage>798</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>