Malware detection through low-level features and stacked
                 denoising autoencoders
           Alessandra De Paola1 , Salvatore Favaloro2 , Salvatore Gaglio1 ,
                      Giuseppe Lo Re1 , and Marco Morana1
                                 1
                                  Università degli Studi di Palermo
                                  firstname.lastname@unipa.it
                            2
                              firstname.lastname@community.unipa.it

                                              Abstract
          In recent years, the diffusion of malicious software through various channels has gained
      the request for intelligent techniques capable of timely detecting new malware spread. In
      this work, we focus on the application of Deep Learning methods for malware detection,
      by evaluating their effectiveness when malware are represented by high-level, and low-
      level features respectively. Experimental results show that, when using high-level features,
      deep neural networks do not significantly improve the overall detection accuracy. On the
      other hand, when low-level features, i.e., small pieces of information extracted through a
      light processing, are chosen, they allow to increase the capability of correctly classifying
      malware.


1     Introduction
Malware detection is one of the most critical issues faced by computer security. Nowadays, the
accidental execution of malicious software coming from different channels makes IT systems
constantly exposed to risks. In order to effectively detect threats hidden behind heterogeneous
software, which is susceptible to unpredictable variations, intelligent techniques are required.
    The most common approach implemented by popular malware-detection tools is to perform
a static analysis of a short sequence of bytes, i.e., a signature [24]. Nevertheless, the availability
of softwares allowing the automatic generation of several variants of a certain malware greatly
reduces the effectiveness of this technique [23].
    One of the most promising direction followed to guarantee a high detection rate even with
the constantly increase of threats, is the adoption of cloud-based approaches where the malware
detection is performed remotely through machine learning techniques capable of analyzing a
huge amount of malware files. The effectiveness of this solution often relies on the adoption
of high-level features, extracted from executable files, designed to be more invariant to code
obfuscation and polymorphism than classic signature-based methods.
    Nevertheless, it is worth noticing that designing “good” set of features to realize efficient
and effective systems is not trivial, due to the underlying and not explicit dependences among
different parts of a file and its behavior. A promising way to face such a problem is represented
by deep learning, which is expected to enable the automatic extraction of relevant features,
directly from raw information.
    Although the potentiality of applying deep learning on simple raw data is confirmed by some
recent results, many works adopt deep learning to perform malware detection or classification
on high-level features obtained from executable files.
    The first goal of the proposed work is to evaluate the real effectiveness of malware detection
through deep learning, while processing high-level features. To this aim, we compare the system
proposed in [16] with other classifiers of various complexity.
Malware detection through low-level features and stacked denoising autoencoders    A. De Paola et al.


     Moreover, we want to verify if deep networks can be effectively adopted to perform mal-
ware detection by exploiting a reduced set of low-level features extracted from the executable
files. The solution we propose here is based on a deep network implemented through stacked
denoising autoencoders, which analyzes raw information contained in the portable executable
(PE) packaging of Windows executable files. Our aim is to compare the performance obtained
by such lighter malware detection system with the most complex approach proposed in [16], in
order to establish whether deep learning is able to achieve good performance even with fewer
and less pre-processed data.
     The following of the paper is structured as follow. Sect. 2 presents an overview of the related
work. Sect. 3 describes the high-level features and the system proposed in [16], besides the
simpler classifiers we use for the comparative analysis. Sect. 4 provides a description of the
proposed malware detection system based on low-level features. Finally, Sect. 5 reports the
experimental evaluation, and Sect. 6 presents our conclusions and possible future work.


2     Related Work
Several works which adopt deep learning to perform malware detection have been presented in
the recent literature, based both on dynamic and static analysis [10]. The dynamic analysis
exploits a protected environment, i.e., a “sandbox”, in which the malware can be executed, and
its behavior observed, without threaten a real working system, while the static analysis focuses
only on information included in the target file, and do not require that the potential malware is
run. As a consequence, static methods are generally faster and less greedy for resources, even
if they can exploit less information and are vulnerable to malicious code obfuscation.
     The system proposed in [5] performs a dynamic analysis, by adopting deep learning to
automatically generate the signature which represents such behavior. The authors use a deep
belief network implemented through stacked denoising autoencoders, which processes the text
file containing the transcription of all the events occurred during a file run, adequately converted
in a binary form. The automatically-generated signature is then processed by a SVM to perform
the effective classification. In [8], the sequence of system calls recorded while the executable
file is running is exploited by a Deep Neural Network (DNN) which combines convolutional
layers and recurrent layers, using Long Short Term Memory (LSTM) cells to increase malware
detection capabilities.
     In order to overcome the limitation of dealing with less information than those obtained
by dynamic analysis, some works based on static analysis performs deep learning on high-level
features extracted according to a well-designed process. The authors of [4] propose to process a
file to build a large vector of features (179 thousands of sparse binary features), which is reduced
through a random projection process. The resulting vector, composed of 4,000 elements, is then
analyzed by a deep classifier which is pre-trained through a Restricted Bolzmann Machines
(RBM). The reduction of the feature vectors through random projection is also adopted in [20],
where JavaScript sources are processed through a 5–layer deep neural network implemented
with stacked denoising autoencoders. In [16] four different sets of static features, converted in
a 1024-length binary vectors and classified by a classic DNN, are used. A more complex set of
features is propose in [22], where data is obtained by merging information coming from static
and dynamic analysis. Such set of features includes a sub-set automatically obtained by DNNs.
The resulting advanced set of features is classified through Multiple Kernel Learning.
     Even though malware detection algorithms which combine deep learning and high-level fea-
tures usually provide better accuracy values, some authors emphasize the convenience to adopt
features as simple as possible so as to design light and efficient malware-detection systems [2].

2
Malware detection through low-level features and stacked denoising autoencoders     A. De Paola et al.


   The potentiality of applying deep learning on simple raw data is confirmed by some works
recently presented in the literature, such as [11], where an android malware detection system
which applies a deep convolutional neural network to the raw sequences of opcode extracted
from disassembled programs is presented.
   Our work aims to confirm this idea, by proving that deep learning can be efficiently adopted
to perform malware detection when executable files are represented by simple features.


3     High-level features for malware detection
Over the years, several type of features have been adopted for malware detection through static
analysis. Some of them are obtained from a heavy pre-processing of the executable file to
summarize its global characteristics, or through a simple and light pre-processing phase. The
former can be considered high-level features, since they allow to represent the executable file at
a higher level of abstraction. The latter can be defined low-level features, since they are closer
to the raw file representation.
     Authors of [13] proved the effectiveness of several machine learning methods using high-level
features for distinguishing between packed and non-packed executables. This work adopts a
small set of features which summarize high-level aspects of the analyzed software, i.e., the num-
ber of standard and non-standard sections in the portable executable (PE) packaging of a Win-
dows executable file, the number of executable sections, the number of readable/writable/ex-
ecutable sections, the number of entries in the Import Address Table, and the entropy of PE
header, code section, data section, and of the whole executable file. The authors of [15] analyze
the frequencies of opcode sequences classifying them through Support Vector Machines. Some
works focus on the analysis of the frequency of Windows API calls, such as [3, 25, 21]. In [17],
sequences of strings and bytes extracted from the file are considered together with the set of
API calls. Authors of [1] propose to use n-grams of strings, whereas in [12] and [4] n-grams
based on n-length system calls sequences are used. Authors of [2] propose to merge a large set of
complex features by selecting the best subset of features through a forward stepwise selection.
The set of features includes n-grams of bytes, the byte-level entropy vector, some descriptors
of the image obtained by interpreting each byte as a grey level, the histogram of the length
of strings, the frequency of some specific operation codes, the frequency of API calls, together
with many other features.
     One of the goal of our work is to verity whether, still using high-level features, deep learning
can further improve the detection accuracy. As target system for our analysis, we consider
the work proposed in [16], which adopts a deep neural network to classify vectors of high-
level features extracted from executable files. The feature vector is obtained by merging four
complementary histograms, which depend on different information extracted from the analyzed
file. Two mono-dimensional histograms capture information from the PE, while the other two
bi-dimensional histograms depend on the whole file. The concatenation of these four histograms,
after the row-by-row concatenation of the bi-dimensional components, produces a single mono-
dimensional vector composed by 1024 elements. The first mono-dimensional part of the feature
vectors is the PE import histogram, obtained by reading the import address table header of PE
and generating the tuples of DLLs and functions listed in the header. Each tuple is mapped in
the corresponding hash value and the component is computed as the histogram of such hash
values. The second mono-dimensional part is the PE metadata histogram, which depends on
labels and values of numerical fields contained in the PE packaging. Each label-value pair is
mapped in the corresponding hash value and the histogram of the hash values is computed.
The bi-dimensional components of the feature vector are the byte/entropy histogram and the

                                                                                                    3
Malware detection through low-level features and stacked denoising autoencoders    A. De Paola et al.


string histogram. The former is created by sliding a 1024-byte-sized window over the file with
step of 256 bytes, and computing the base-2 entropy for each window. For each byte in the
window, a byte-entropy pair is computed; then, the set of pairs is processed to create a 16x16
byte/entropy matrix, i.e., the final 2D histogram. The string histogram is obtained by extracting
the file strings long at least 6 characters. Each string is processed to obtain a pair containing
the hash value of the string itself and the logarithm of its length. All the resulting pairs are
considered to obtain a 16x16 hash/length histogram.
    The malware detection process is based on a deep neural network composed by four layers:
an input layer consisting of 1024 nodes, two hidden layers, each consisting of 1024 nodes,
and a final layer with a single node. Nodes of the two hidden layers adopt the Parametric
Rectified Linear Unit (PReLU) [6] as activation functions, whereas activation of the node in
the final layer depends on a sigmoid. PReLU is an activation function, recently proposed in the
literature, which autonomously modifies its form according to a specific parameters that can be
changed to speed up the training phase. Training is performed by means of a back-propagation
algorithm and the Adam gradient-based optimizer [7], a recent stochastic optimizer that uses
the first and second moments of the gradient to minimize the objective function, i.e., the binary
cross-entropy. Finally, to prevent overfitting, dropout regularization [18] is performed on the
first three layers. The dropout technique disables some units of the neural network during the
training so as to process a narrow sub-graph rather than the whole network. On every learning
round, different units are randomly sampled to be disabled, so guaranteeing that the training
is performed on different sub-networks.
    Sect. 5 presents the comparative analysis of such target system with other classifiers of
decreasing complexity, which adopt the same set of high-level features. Surprisingly, the results
we obtained prove that with this well-designed set of high-features, deep learning does not have
a relevant impact on detection accuracy as compared with other simpler classifiers.


4     Low-level features and DNNs for malware detection
Rather than exploiting high-level features, we propose to adopt deep networks to process low-
level features extracted through static analysis, in order to automatically learn which infor-
mation is relevant to malware detection. The system we propose is based on a two-phase
training deep neural network, where a first unsupervised pre-training with stacked denoising
autoencoders [19] is followed by a supervised fine-tuning based on back-propagation.
    Features extraction is performed by accessing DOS Header, File Header, Optional Header
and Section Table of the PE packaging. Each header contains different fields, each analyzed
by extracting the corresponding value and two offsets. Simple values are handled as unsigned
integers, whilst timestamps, arrays and strings, are processed via a hashing function. Offset
values allow to preserve spatial information: the local offset specifies the position of a field
within the header section, and the global offset represents the position in the file.
    Since a file can have a variabile number of sections and the classifier is designed to process
fixed-size data, we limit the number of Section Tables to be processed. To this aim, some
experiments were performed on real data showing that a reasonable threshold on the number of
sections is 13. If a file contains less sections than the threshold then the elements of the feature
vector related to the missing sections are set to zero; otherwise, if the number of sections is
higher than the threshold, the extra sections will be ignored. Thus, 19 fields of the DOS header,
7 of the file header, 30 of the optional header, 12 of the section section header (for each of the
13 sections) were analysed, obtaining feature vectors of 636 elements in the range [0, 1]. These
vectors are about 38% smaller than those proposed in [16], and described in Sect. 3. Moreover,

4
Malware detection through low-level features and stacked denoising autoencoders                                            A. De Paola et al.


                                                                                                   corrupted	
  
                                                                                                     input	
                     decoder	
  
                                              labeled	
                                                            encoder	
  
                                              dataset	
  
    Unlabeled	
         Unsupervised	
                                                 input	
  
                                                            Fine	
  tuning	
  
     dataset	
           pre-­‐training	
  

                    Figure 1: Two-phase training.                                Figure 2: Structure of the denoising
                                                                                 autoencoder.


since data are not extracted from the entire file but only from its headers, the feature extraction
process is fast and independent of the file size.
    The deep network proposed here is consists of five layers. The input layer has 636 elements,
the three hidden layers contain 256, 64, and 16 nodes respectively, whilst the output layer
consists of a single node. Each node use a sigmoid activation function. In this neural network,
the total amount of trainable hyper-parameters is of 180, 577 elements.
    The training of the deep network is performed in two steps (see Fig.1). The first phase
performs an unsupervised pre-training by using unlabeled dataset to obtain a first estimation of
weights and biases of the hidden layers. Such phase is implemented through stacked denoising
autoencoders. Each hidden layer is pre-trained individually by means of a support network
consisting of an input layer, a corrupted layer, an encoder, and a decoder layer (see Fig. 2),
where the encoder layer corresponds to the hidden layer to be pre-trained. On the contrary,
the input, the corrupted, and the decoder layers have the same number of nodes, which is
equal to the number of inputs of the hidden layer to be pre-trained. This support network is
trained through back-propagation and processes each input vector by i) corrupting it with some
kind of noise, ii) encoding the resulting noisy signal, and then iii) reconstructing the encoded
signal. The aim of the corruption layer, which adds isotropic Gaussian noise to the input
signals, is to force the encoding layer to learn the most useful information from input vectors,
by automatically neglecting noise from the corrupted input. The iteration of the pre-training
for all the hidden layers produces a deep neural network in which each layer is able to extract
and represent features at a higher level of abstraction than its predecessors. The pre-training
of the first hidden layer is performed using the original dataset, while the other hidden layers
are pre-trained through an encoded version of the dataset, obtained by exploiting the current
trained denoising autoencoders to build a temporary network which extracts the output of the
encoding layer that precedes the layer to be trained.
    The second training step is the fine-tuning of the network, which is implemented through
a supervised back-propagation algorithm with the Adam stochastic optimization, applied by
considering the binary cross entropy as objective function. During this stage, weights and bias
of all hidden layers are initialized with the values produced by the pre-training.


5       Experimental Evaluation
The dataset used for the experimental evaluation was obtained by merging 12.000 samples of
malware obtained from VirusShare 1 and 11.874 samples of certified software obtained from a
clean Windows 10 installation. The malware samples are not labeled according to the specific

    1 https://virusshare.com


                                                                                                                                               5
Malware detection through low-level features and stacked denoising autoencoders        A. De Paola et al.


       Classifier        Input    Hidden Layer 1      Hidden layer 2     Output   n. parameters
       Benchmark [16]     1,024        1,024              1,024            1         2,102,273
       Shallow            1,024        2,048                0              1         2,103,297
       Classic-1024       1,024        1,024                0              1         1,050,625
       Classic-512        1,024         512                 0              1          525,313
       Classic-8          1,024          8                  0              1           8,209

Table 1: Number of nodes composing each layer of the analyzed neural networks. The last
column reports the number of trainable parameters.


malware family they belong to, but this characteristic does not limit the experimental evaluation
presented here, since it aims only to assess the detection accuracy.
    The evaluation of each classifier is performed through a K-fold cross validation with a
stratified sampling, in order to preserve the percentage of samples per class. Due to the limited
size of the dataset used here, we adopted K = 5 for each test so as to guarantee that each
validation set contains an adequate number of samples. Each model was trained until the loss
value dropped below 0.02, or the number of training epochs exceeded 200, as proposed in [16].
    The performance of each classifier were evaluated by analyzing the trend of the ROC curve
(Receiver Operating Characteristic) with respect to the training epochs. Furthermore we com-
puted the final loss, the final accuracy, as well as several other metrics, i.e., TPR (True Positive
Rate), FPR (False Positive Rate), Precision, and AUC (Area Under Curve) of ROC curve. This
last metrics allows to evaluate the classification performance independently on the threshold
adopted for the last layer.


5.1     Deep Learning performance with high-level features
In order to verify the impact of deep learning methods when high-level features are used to
represent files, we performed a comparative evaluation between the benchmark described in
Sect. 3, and different classifiers based on neural networks, which adopt the same set of features.
    The first neural network involved in the comparison, named shallow classifier, is obtained
by merging the two hidden layers of the benchmark network into a single double-sized layer.
The shallow network differs from the benchmark only for its topology, since it has almost
the same numbers of parameters. The comparison between the benchmark and the shallow
networks allows to evaluate the effect of the hierarchical stratification on the performances of
the classifier. The second network is obtained by removing the second hidden layer from the
benchmark network, by replacing the PReLUs with sigmoids, and by removing the dropout
step, thus obtaining a classic neural network. Then, by varying the number of nodes contained
in the remaining hidden layer, we obtained three neural networks with 1024, 512 and 8 nodes in
the hidden layer. Such networks have less parameters than the benchmark, and their evaluation
allows to analyze how much deep learning techniques can improve the classification accuracy,
when high-level features are adopted. Table 1 summarizes the topology and the number of
parameters for each network.
    Results are summarized in Table 2. Surprisingly, even if the benchmark network allows to
achieve the best results for almost every metric, the results obtained by the other classifiers
are quite similar, even those obtained by the classic network with only 8 nodes on the hidden
layer. Moreover, by analyzing the trend of the average accuracy with respect to the training
epochs (Fig. 3a) and the ROC curves (Fig. 3b), no relevant difference among the considered
neural networks arises. Such results confirm that, if a set of well-designed high-level features

6
Malware detection through low-level features and stacked denoising autoencoders              A. De Paola et al.


             Classifier         Loss     Accuracy     Precision    TPR      FPR           AUC
             Benchmark [16]    0.0707     98.55         98.66      98.45    1.20        99.76809
             Shallow           0.0931      98.29       99.07       97.53    0.93        99.75425
             Classic-1024      0.0926      98.44        98.60      98.30    1.41        99.72532
             Classic-512       0.0858      98.40        98.66      98.17    1.36        99.70791
             Classic-8         0.0770      98.41        98.87      97.96    1.13        99.73027

Table 2: Experimental evaluation of different neural networks using high-level features as input.


                         (a)                                                      (b)

Figure 3: Average accuracy (a) and ROC curves (b) of different networks using HL features.


is adopted to represent malware files, deep neural networks do not significantly impact on the
performances.


5.2     Deep Learning performance with low-level features
Several tests were performed to evaluate both the effectiveness of the proposed approach, based
on the adoption of deep learning with low-level features, and the impact of the design choices
we made.
    Firstly, we wanted to investigate whether there is a connection between the number of
training epochs and performance. For this purpose we compared the performance obtained by
varying the maximum training epoch threshold (i.e., 600, 800, and 1000) for the pre-training
phase. The second evaluation aimed to assess the effectiveness of the pre-training phase, by
comparing our system with a deep network characterized by the same topology, but trained only
through the second training phase, that is without pre-training. Finally, we intended to verify
the impact of the hierarchical stratification of our deep classifier on the overall performances
of the malware detection system. To this aim, we compared our system with a classic neural
network obtained by removing the last two hidden layers from our classifier, so obtaining three
layer of 636, 256 and 1 nodes respectively.
    In Table 3 the average values of the considered performance metrics are shown. We can
observe that higher accuracy values are obtained by using the pre-trained classifiers, and this
confirms the effectiveness of the two-phase training. However, increasing the number of pre-
training epochs (e.g., from 600 to 800 or 1000) does not significantly improve the accuracy, nor
reduce the values of loss.

                                                                                                             7
Malware detection through low-level features and stacked denoising autoencoders          A. De Paola et al.


    Classifier                          Loss     Accuracy      Precision      TPR       FPR    AUC
                                                    (%)           (%)          (%)       (%)    (%)
    Deep model (600)                   0.075       97.39         97.48        97.33     2.55   97.39
    Deep model (800)                   0.074      97.48         98.09         96.88     1.91   97.48
    Deep model (1000)                  0.074      97.49          97.92        97.08     2.09   97.49
    Deep model (fine tuning only)      0.087       96.73         96.69        96.82     3.37   96.73
    Classic-623-256                    0.095       96.48         95.98        97.10     4.15   96.47

Table 3: Average values Loss, Accuracy, Precision, TPR (True Positive Rate), FPR (False
Positive Rate), and AUC (Area Under Curve) using different classifiers. The number of pre-
training epochs is reported in parentheses.


                         (a)                                                      (b)

Figure 4: Average accuracy (a) and ROC curves (b) of different models. Pre-train epochs are
reported in parentheses.


    The performances of the model that uses fine tuning only, i.e., trained only with the second
training phase, are worse than the pre-trained models on all metrics. Furthermore, the lowest
values are achieved by using the classic model, suggesting that the hierarchical stratification is
effective, but only when combined with the pre-training phase.
   A more in-depth analysis of the average accuracy achieved using different models is shown in
Fig. 4a, which uses results from the second training phase. Results confirm that the pre-trained
deep models exhibit a better trend along all the time line. Fig. 4b shows the average ROC
curves from which AUC score was calculated. Even in this case, pre-trained models outperform
the others, with no significative differences when using 800 or 1000 pre-training epochs.
   When adopting low-level features, our deep model produces an increase of more than 2% of
precision and of 1% for accuracy with respect to a traditional neural network. On the contrary,
the adoption of a deep network in the benchmark system increase the accuracy and precision
values of only 0.06% and 0.11% respectively.
    It is worth noticing that, whatever is the classifier, the adoption of optimal high-level features
yields to better classification performances. Thus, even though using a smaller and simpler set
of features makes the malware detection process easier, the proposed deep learning solution is
not able to achieve the same level of accuracy.

8
Malware detection through low-level features and stacked denoising autoencoders      A. De Paola et al.


6     Conclusions and Future Work
In this work we addressed the scenario of malware detection through deep learning approaches.
Firstly, we wondered if deep learning can makes a relevant difference when combined with well-
designed high-level features. Experiments proved that, when using an optimal set of features,
the adoption of a deep neural network does not significantly influence the performance.
    Since deep learning is often adopted to achieve a high classification accuracy even by pro-
cessing raw data, we also investigated whether the use of low-level features (computed only on
the first 1 or 2 KB of a file) combined with a deep learning approach may lead to comparable
performances. We proposed a deep neural network whose parameters are learned through a two-
phase training which combines a unsupervised technique, i.e., stacked denoising autoencoders,
with a supervised step to perform fine tuning.
    Results showed the capability of our system of extracting relevant knowledge from low-level
features obtained through a static analysis of the considered files. Moreover, the two-phase
training was proved to be very effective regardless of the number of pre-training epochs.
    However, regardless of the classifier, the use of optimal high-level features leads to the
best classification performances. Nevertheless, it is worth noticing that with a total of 180,577
parameters, the 9% of those used by the benchmark system [16], our system exhibits a reduction
of only 0.57% and 1.07% in precision and accuracy respectively.
    Such results confirm the potentiality of our approach, and also highlight the necessity of a
further update of the method we propose in order to overcome the limitation introduced by
adopting a smaller and simpler set of features.
    As future work, we want to verify if other classifier can be adopted for the first, unsupervised,
training phase. For example a noisy-based autoencoder, like the DA-IC (Denoising Autoencoder
with Interdependent Codes) [9], or a deterministic one, like the contractive autoencoder [14] could
lead to better performances.


References
 [1] T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious
     code. In Proc. of the 28th Annual Computer Software and Applications Conf. (COMPSAC 2004),
     volume 2, pages 41–42. IEEE, 2004.
 [2] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto. Novel feature extraction,
     selection and fusion for effective malware family classification. In Proc. of the Sixth ACM Conf.
     on Data and Application Security and Privacy, pages 183–194. ACM, 2016.
 [3] M. Alazab, S. Venkatraman, P. Watters, and M. Alazab. Zero-day malware detection based on
     supervised learning algorithms of api call signatures. In Proc. of the Ninth Australasian Data
     Mining Conf. - Volume 121, AusDM ’11, pages 171–182, 2011.
 [4] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. Large-scale malware classification using random
     projections and neural networks. In Proc. of the 2013 IEEE Int. Conf. on Acoustics, Speech and
     Signal Processing (ICASSP), pages 3422–3426. IEEE, 2013.
 [5] O. E. David and N. S. Netanyahu. Deepsign: Deep learning for automatic malware signature
     generation and classification. In Proc. of the 2015 Int. Joint Conf. on Neural Networks (IJCNN),
     pages 1–8. IEEE, 2015.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
     performance on imagenet classification. In Proc. of the IEEE Int. Conf. on computer vision, pages
     1026–1034, 2015.
 [7] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. of the 3rd Int. Conf.
     on Learning Representations (ICLR 2015), pages 1–15, 2014.

                                                                                                     9
Malware detection through low-level features and stacked denoising autoencoders        A. De Paola et al.


 [8] B. Kolosnjaji, A. Zarras, G. D. Webster, and C. Eckert. Deep learning for classification of malware
     system call sequences. In Proc. of the Australasian Conf. on Artificial Intelligence, volume 9992
     of Lecture Notes in Computer Science, pages 137–149. Springer, 2016.
 [9] H. Larochelle, D. Erhan, and P. Vincent. Deep learning using robust interdependent codes. In
     AISTATS, pages 312–319, 2009.
[10] K. Mathur and S. Hiranwal. A survey on techniques in detection and analyzing malware ex-
     ecutables. Int. J. of Advanced Research in Computer Science and Software Engineering, 3(4),
     2013.
[11] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei,
     E. Trickel, Z. Zhao, A. Doupé, and G. Joon Ahn. Deep android malware detection. In Proc.
     of the Seventh ACM on Conf. on Data and Application Security and Privacy, CODASPY ’17,
     pages 301–308. ACM, 2017.
[12] S. B. Mehdi, A. K. Tanwani, and M. Farooq. Imad: in-execution malware analysis and detection.
     In Proc. of the 11th Annual Conf. on Genetic and evolutionary computation, pages 1553–1560.
     ACM, 2009.
[13] Roberto Perdisci, Andrea Lanzi, and Wenke Lee. Classification of packed executables for accurate
     computer virus detection. Pattern Recognition Letters, 29(14):1941 – 1946, 2008.
[14] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit
     invariance during feature extraction. In Proc. of the 28th Int. Conf. on machine learning (ICML-
     11), pages 833–840, 2011.
[15] I. Santos, F. Brezo, B. Sanz, C. Laorden, and P. G. Bringas. Using opcode sequences in single-class
     learning to detect unknown malware. IET information security, 5(4):220–227, 2011.
[16] J. Saxe and K. Berlin. Deep neural network based malware detection using two dimensional binary
     program features. In Proc. of the 2015 10th Int. Conf. on Malicious and Unwanted Software
     (MALWARE), pages 11–20. IEEE, 2015.
[17] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo. Data mining methods for detection of new
     malicious executables. In IEEE Symposium on Security and Privacy, pages 38–49. IEEE Computer
     Society, 2001.
[18] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
     way to prevent neural networks from overfitting. J. of Machine Learning Research, 15(1):1929–
     1958, 2014.
[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoen-
     coders: Learning useful representations in a deep network with a local denoising criterion. J. of
     Machine Learning Research, 11(Dec):3371–3408, 2010.
[20] Y. Wang, W.-D. Cai, and P.. Wei. A deep learning approach for detecting malicious javascript
     code. Security and Communication Networks, 9(11):1520–1534, 2016.
[21] M. Weber, M. Schmid, M. Schatz, and D. Geyer. A toolkit for detecting and analyzing malicious
     software. In Proc. of the 18th Annual Computer Security Applications Conf., pages 423–431. IEEE,
     2002.
[22] L. Xu, D. Zhang, N. Jayasena, and J. Cavazos. Hadm: Hybrid analysis for detection of malware.
     In Proc. of the SAI Intelligent Systems Conf. (IntelliSys), pages 1037–1047, 2016.
[23] Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar. A survey on malware detection using data mining
     techniques. ACM Computing Surveys (CSUR), 50(3):41, 2017.
[24] Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. Abdulhayoglu. Combining file content
     and file relations for cloud based malware detection. In Proc. of the 17th ACM SIGKDD Int.
     Conf. on Knowledge Discovery and Data Mining, pages 222–230, 2011.
[25] Y. Ye, D. Wang, T. Li, D. Ye, and Q. Jiang. An intelligent pe-malware detection system based on
     association mining. J. in computer virology, 4(4):323–334, 2008.


10