Deep Networks in Online Malware Detection

                                             Jiří Tumpach1 , Marek Krčál2 , Martin Holeňa3
               1 Faculty of Mathematics and Physics, Charles University, Malostranské nám. 2, Prague, Czech Republic
                                         2 Rossum Czech Republic, Dobratická 523, Prague
            3 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic


Abstract: Deep learning is usually applied to static                      of data can cause quick overtraining, especially in case of
datasets. If used for classification based on data streams, it            slow drift.
is not easy to take into account a non-stationarity. This pa-                This paper investigates faster retraining of neural net-
per presents work in progress on a new method for online                  works on data with slow drift. Such a research is highly
deep classification learning in data streams with slow or                 relevant for the application domain of malware detec-
moderate drift, highly relevant for the application domain                tion because most of the malware is evolving, entailing
of malware detection. The method uses a combination                       a drift in data. The main idea is to have multiple pairs of
of multilayer perceptron and variational autoencoder to                   generator-discriminator for each time interval. The cur-
achieve constant memory consumption by encoding past                      rent generator is trained with the last subset of training
data to a generative model. This can make online learn-                   data (moving window) with the addition of generated sam-
ing of neural networks more accessible for independent                    ples based on the previous generator. Its job is to esti-
adaptive systems with limited memory. First results for                   mate the distribution of past data points and to use that
real-world malware stream data are presented.                             distribution for generating new examples. A discriminator
                                                                          uses also labels generated by the previous discriminator if
                                                                          labels are not provided explicitly. The generative model
1    Introduction                                                         stores some information about the importance of different
                                                                          training cases (weights) and acts as an implicit decay. For
Deep network architectures have many benefits. The most                   the generative model, we currently use variational autoen-
obvious one is the lack of need for comprehensive prepa-                  coders (VAEs) and intend to include also deep belief net-
ration of data. A large enough network probably finds rel-                works (DBN) soon. However, this idea can be generalized
evant features automatically. So it is easier to pass data to             to any suitable classifier and generative model.
training than to guess about the correct match in the triple                 In Section 2, we present the state of the art in online
problem-transformation-classifier.                                        malware detection. The used methods are described in
   However, deep network needs a lot of training data to                  Section 3. In Section 4, strategies for training and evalua-
perform in this way. Fortunately, many areas constantly                   tion are proposed. In Section 5, our data and experiments
generate large amounts of data.                                           on a real word malware dataset are presented.
   Too much data may be a problem because parallel train-
ing for deep neural networks can be expensive. Some
training examples may be unnecessary and contain only                     2   Online Malware Detection
repeating relevant information with some random noise.
In this case, they function as a weight for the relevant in-              Malware is continuously evolving by exploiting new vul-
formation.                                                                nerabilities and examining evading techniques [8]. More-
   Consider a situation where there is no expected change                 over, detection has to deal with significant data drift. It can
of the target function during its use (offline training). In              make use of a signature database of previously detected
this case, one can save similarity filtered latent features               malware. When the file is scanned, at first its is compared
of the trained network. For example, latent features can                  with the items in the database. So only modified and new
be outputs of some middle layer. One application can be                   malware needs to be detected giving high priority to gen-
transfer learning where some trade-off between network                    eralization. Therefore, online detection methods, capable
performance and speed of training is already expected.                    of keeping up with and adapt to such evolution, are desir-
   Online problems are specific because they are intended                 able.
for situations, when some drift of information is expected.                  Malware detection techniques can be divided into static
So training on all available data can be harmful. One easy                and dynamic methods [12]. The static methods focus on
solution is to train a model only on the most recent subset               an analysis of program code while dynamic methods infer
of training examples. This method reduces the need for                    from program behaviour. They can log used resources and
parallel training, however, discarding a large proportion                 privileges, system or APIs calls or track sensitive data an
                                                                          inside application [8]. In connection with online learning,
      Copyright c 2019 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY         DroidOL [8] uses the analysis of inter-procedural control-
4.0).                                                                     flow graphs to achieve robustness against hiding attempts.
                                                                                                                    Output layer

i1             ·w(1, j)
                                                                                                                    Hidden layer
i2             ·w(2, j)

i3             ·w(3, j)                  f               oj
                                                                                                                    Input layer

i4             ·w(4, j)
                                                                           Figure 2: Multilayer perceptron with two layers.

 1               ·b j                                                                                     1
                                                                       1


                                                                   0.8
                                                                                                        0.5

                                                                   0.6
                                                                                                          0
Figure 1: The neuron j, its inputs (ii ) are multiplied by         0.4
corresponding weights (w(i, j) ) then summed together with                                             −0.5
                                                                   0.2
a specific bias bi . The resulting value is called activation.
This value is mapped by the activation function f (x) to the           0
                                                                            −4    −2   0       2   4
                                                                                                        −1
                                                                                                         −2        −1     0       1   2
output o j of the neuron j.                                                  sigm(x) = 1+e1 −x                relu(x) = max(0, x)
                                                                                 (a) sigmoid                       (b) relu
It is trained with a fast online linear algorithm adapted
to growing dimensionality. On real Android applications,           Figure 3: Important examples of activation functions.
DroidOL outperforms state-of-the-art malware detectors
with 84.29% accuracy.
   Another dynamic online method [11] reports using on-          are usually inefficient [2, 17]. On the other hand, more
line learned Support Vector Machines with RBF kernel to          successful methods are attempting to approximate second
detect malware from application behavior.                        order behavior. One of the strategies is to have differ-
   Users can have different sensitivity to give their data       ent learning rates (sizes of steps) for each learned vari-
like location, contacts, or files to an author of a spe-         able (Adam, AdaGrad, RMSProp, SGD with Nestorov
cific application. Antimalware programs then need to             momentum, ...) [3]. Alternatively, some methods ap-
profile each user to not restrict them or overly bother.         proximate second order derivatives from gradients history
XDroid [12] tackles this problem by online hidden Markov         (Adam) [5].
model (HMM) learning.                                               One of the most popular loss functions used in regres-
                                                                 sion problems is the Mean Square Error (MSE) loss [3]
                                                                 LMSE = N1 ∑Ni=1 (yi − ŷi )2 where ŷi is output of MLP given
3     Methodological Background                                  sample xi from feature space with corresponding correct
                                                                 value yi , N is the number of samples in one training cycle.
3.1   Multilayer Perceptron (MLP)                                In classification, to be able to learn probabilities of labels,
                                                                 one can employ cross-entropy loss. For classification into
A multilayer perceptron is composed of neurons (Figure 1)
                                                                 G classes, it is defined as − N1 ∑Ni=1 ∑G l=1 yil log(ŷil ) and the
arranged into layers (Figure 2) [3]. The first layer is called
                                                                 predicted probability ŷil of the label l is given by the soft-
input layer, and its function is to receive values of the in-
                                                                 max activation function ŷil = exp(ŷil )/∑G   s=1 exp ŷis .
puts. The last layer is called output layer and it has a sim-
ilar structure as the remaining, aka hidden layers. Their
neurons are connected to the output of each neuron in the
previous layer. Figure 2 depicts a two layer MLP. It is a        3.2       Autoencoders (AEs)
non-linear regression or discrimination model because its
neurons use non-linear activation functions (Figure 3).          Autoencoders are neural networks capable of learning data
   MLP is learned through minimizing some loss function          representations called codings, usually with a much lower
usually by some kind of smooth optimization. The most            dimension than is the dimension of the input data [3]. They
simple, but still used kind of smooth optimization is gra-       learn to copy the input to its output and are consisted of
dient descent, in the area of neural networks also known         two parts: an encoder and a decoder, cf. the example in
as backpropagation, due to the flow of gradient computa-         Figure 4. By restricting the flow of information, one can
tion. In high-dimensional spaces, its stochastic variant is      achieve interesting properties, for example denoising, de-
commonly used, stochastic gradient descent. Exact sec-           tecting anomalies, generating unseen samples with a simi-
ond order methods like such as the Gauss-Newton method           lar distribution as the training one and so on.
                                                                                 µ3

                                                                                 µ2

                                                                                 µ1               +

                                                                                                  +

                                                                                 σ3       ×       +

                                                                                 σ2       ×

                                                                                 σ1       ×


                encoder                       decoder                     Gausian noise generator

Figure 4: Autoencoder – the output of the encoder is the
input to the decoder.                                           Figure 5: Variational Autoencoder. Gray nodes are opera-
                                                                tions, µ , σ nodes have linear activation fucntion.
3.3     Variational Autoencoders (VAEs)
Codings in basic autoencoders can have nonstandard dis-         .
tributions [3]. This property makes it difficult to generate       In [3] has been proposed to speed up convergence in
samples similar to the training dataset. VAEs solve this        training by predicting logarithm of variance (log(σi2 ) = vi )
problem by employing the Kullback-Leibler (KL) diver-           instead of standard deviation. Then LVAE will be:
gence. KL divergence between two distributions p and q
is defined as:                                                                           1 N G
                                                                                           ∑ ∑ 1 + vil − µil2 − evil
                                                                                                                     
                                                                      LVAE = LMSE −
                                                                                        2N i=1 l=1
  DKL (p||q) = H(p, q) − H(p)
               Z ∞                      Z ∞
          =−         p(x) ln q(x)dx +    p(x) ln p(x)dx            The VAE encoder input is now ~µ +~ε ·~v, where ~ε is a
                −∞                    −∞                        vector of samples from standard normal random distribu-
                                                                tion. VAEs backpropagation is unchanged, all operations
                                                      
                                                  p(x)
                                    Z ∞
                                  =     p(x) ln         dx,     should be considered without any skipping.
                                     −∞           q(x)
                                                                   If VAE is properly learned, sampling becomes easy. We
where H(p, q) is cross-entropy and H(p) is entropy. The         can expect a normal distribution of its codings if we sam-
KL divergence is a measure of difference between two dis-       ple from a real learned distribution. The encoding part is
tributions. If p(x) and q(x) are the same, the divergence       then redundant and can be skipped. The result is only a
equals 0, otherwise it is positive value.                       random sampler which gives inputs to the decoder.
   Because codings in AEs are deterministic, it is not pos-
sible to define KL divergence. The important idea in [6]
is to map the codings to normal distributions, using a suit-    3.4   Support Vector Machine (SVM)
able neural network. The i-th coding now corresponds to
one pair of output neurons of the network, and their activi-    A support vector machine will be tested as an alternative
ties represent a normal distribution for the i-th codding. So   to a multilayer perceptron for the starting classification of
the first neuron defines the mean (µi ) and the second one      available data, due to a frequent use of SVMs in malware
the standard deviation (σi ) of that normal distribution. The   detection [7, 9, 10, 16].
normal distributions for different codings are mutually in-        A SVM is constructed with the objective of best gen-
dependent.                                                      eralization, i.e., maximal probability that the classifier φ
   VAEs learn to minimize LVAE where LVAE =                     classifies correctly with respect to the random variables X
DKL (N (µ , σ )||N (0, 1)) + LMSE . So they are learned to      and Y producing the inputs and outputs, respectively,
copy their inputs to the outputs, while maintaining approx-
imately a normal distributions in the codings. In [6] has                             max P(φ (X) = Y ).                  (1)
been proven that this divergence can be computed as
                                                                For our high-dimensional feature space X ⊂ Rn , it is suf-
                     1 N G                                      ficient to consider only a linear SVM, which classifies ac-
                               1 + log(σil2 ) − µil2 − σil2
                                                            
      LVAE = LMSE −    ∑   ∑
                    2N i=1 l=1                                  cording to some hyperplane Hw = {x ∈ Rn |x> w + b = 0}
with w ∈ Rn , b ∈ R,                                                     4     Proposed Strategy for Online Learning
                                    (                                          with VAEs
                                         1     if x> w + b < 0,
  (∀x ∈ X ) φ (x) = φw (x) =                                    (2)
                                        -1     if x> w + b ≥ 0.          We propose an online learning strategy which focuses on
                                                                         more effective learning and a constant memory require-
It can be shown [1, 14] that on quite weak conditions,
                                                                         ments of fetures. The strategy uses two deep learning
searching for maximal generalization (1) is equivalent to
                                                                         architectures: MLP and VAE. While a MLP is trained
searching for maximal margin between the representatives
                                                                         to replicate labels, a VAE is used as a feature generator.
of both classes in the training data,
                                                                         Hence, a VAE can generate new unseen samples for a MLP
              ρ                             ρ                            representing the history. The pseudocode of the algorithm
   max           with constraints ck xk> w ≥ , k = 1, . . . , p,
             kwk                            2                            can be found in Algorithm 1 and a diagram of training data
                 where ρ is the scaled margin and                        paths is depicted in Figure 6.
                                                                            In the first week of training, the VAE is trained on cur-
      (xk , ck ) ∈ Rn × {−1, 1} are the training samples, (3)            rent moving window, which act as a memory limit. The
and that using the standard Lagrangian approach for in-                  same applies for the MLP, but it also uses label informa-
equality constraints, (3) can be transformed into the dual               tion. Next weeks are different. The VAEs use also data
task                                                                     sampled from previous weeks VAE, this provides some-
                                                                         thing like a moving average. The problem is in choos-
               1 p                           ρ p                         ing the right 1. time to update, 2. size of the generated
   max −          ∑    α j αk c j ck x>j xk + ∑ αk
   (α ,ρ )     4 j,k=1                       2 k=1                       data, 3. relative importance of generated data. All MLPs
                                                                         are also trained from VAEs generated data; because gener-
        with constraints KKT, α1 , . . . , α p ≥ 0, ρ > 0,
                                                                         ated data lacks label information, the previous weeks MLP
               where α1 , . . . , α p are Lagrange multipliers. (4)      must be employed to add them.
The objective function in (4) is quadratic, thus it has a sin-
gle global maximum, which can be found in a straightfor-                              week 1       week 2          week 3
ward way. The abbreviation KKT in (4) stands for Karush-
Kuhn-Tucker conditions
                          ρ
                      αk ( − ck xk> w) = 0, k = 1, . . . , p.      (5)             VAE            VAE            VAE
                          2
Due to KKT, the classifier (2) in terms of the solution
α1∗ , . . . , α p∗ , ρ ∗ of (4) turns to
                              (
                                   1 if ∑xk ∈S αk∗ ck x> xk + ρ ∗ ≥ 0,
(∀x ∈ X ) φw (x) =
                                  -1 if ∑xk ∈S αk∗ ck x> xk + ρ ∗ < 0,                  MLP           MLP            MLP
                                                                   (6)
where S = {xk |αk∗ > 0}. The vectors in S lie in the sup-                Figure 6: Training data paths for VAEs and MLPs for each
port hyperplanes of the representatives of both classes in               week. Red indicates generated data, blue adds label clas-
the training data. Therefore, they are called support vec-               sifications to features.
tors.
   Because the size of input features is 540, and at least
40% of them are binary or look almost as constants, we                   5     Experiments with Malware Detection
decided to use a linear SVM. Moreover when polynomial                          Data
kernel (p = 2) was used, the speed of convergence was too
slow.                                                                    In this section, we describe several experiments with real-
                                                                         world data from the area of malware detection.
3.5    Linear Regression
To estimate the trend of a time series of model accuracies,              5.1   Data
we need to perform a linear regression [13] for C points in              We use real-word anonimized data, which feature malware
two dimensions (x0 , y0 ), (x1 , y1 ), . . . , (xC , yC ). More pre-     and clean software in several categories, but we consider
cisely, the trend of the time series is described by the slope           only two by merging some of them. The semantics of
a of the line ŷi = axi + b where                                        the individual features has not been made available by the
                       xy − x y                                          company. The feature space is very complex, there are
                  a=                         b = y − ax                  540 features with various distributions. This makes partic-
                       x2 + x2
                                                                         ularly difficult to choose the correct data scaling. In Figure
with t = C1 ∑Ci=1 ti .                                                   7, several groups of features are differentiated:
                                                                      Almost constant
Algorithm 1 Proposed online learning algorithm                                                           Highly skewed
Require:                                                                                19%
   number N . N is the number of inputs generated by                                            20%
                 the previous generator.
   number M . M is the size of the considered most re-               Binary     21%
                 cent training data.
   function data_for_iteration(number)                                                        30%
               . It gives access to stored data for some                           10%
                 iteration with provided number.
   function labels_for_iteration(number)                                Gaussian                      Other
               . Same as previous function, but for la-
                 bels.                                                   Figure 7: Distribution in the feature space.
Ensure:
   Provides discriminator updates for each client
               . Discriminator can predict labels for new        • Binary feature
                 data.                                           • Normally distributed feature: both absolute skewness
                                                                   and kurtosis is less than 2
 1: procedure C LIENT
                                                                 • Highly skewed feature: skewness > 30
 2:    discriminator ←function(x){return default class}
 3:    while workstation runs do                                 • Almost constant feature: more than 99.9 % values are
 4:        if exists new version of discriminator then             identical
 5:            discriminator ← update_discriminator()            • Other unknown distributions
 6:        end if
 7:       if new undecided file exists then                       The data are initially divided by week. We decided to
 8:           input ←get_features(x)                           keep this natural division even though some of the weeks
 9:           label ← discriminator(input)                     are mostly empty. We have used 375 weeks in our experi-
10:           send_to_server(x)                                ments, the number of files and proportion of malware files
11:           do task specific operation with file as label.   are for them depicted in Figure 8.
12:       end if                                                  [p]
13:    end while
14: end procedure                                              5.2     Performed Experiments and Their Results

15: procedure S ERVER                                          To be able to decide if a neural network is a good model for
16:    iteration ← 0                                           this task, we compare it with a linear SVM. The number of
17:    while not last iteration do                             recent training examples is chosen M = 150.000; it corre-
18:           iteration ←iteration + 1                         sponds to about 5.5 average weeksand at least 309 MiB of
19:           data ←most_recent_data(M)                        RAM. In order to evaluate the full dataset, one must pro-
20:           labels ←most_recent_labels(M)                    cess 113 GiB of data, and train, sample and evaluate about
21:           if iteration > 1 then                            370 SVMs and VAEs.
22:                gen_data ← generator(N)                        Both the MLP and the SVM models are Bayesian op-
23:                gen_labels ← discriminator(gen_data)        timized on first week following the first M of excluded
24:                data ← [data; gen_data]                     data points by the GpyOpt library [15] using the maxi-
25:                labels ← [labels; gen_labels]               mum probability of improvement as acquisition function
26:           end if                                           and mixed sequential and local penalization evaluation.
27:           generator ← learn_generator(data)                The MLP model is using the Adam algorithm with early
28:           discriminator ← learn_discriminator(data, la-    stopping after 10 unimproved evaluation of the validation
      bels)                                                    data (25 % of the actual training data). The MLP uses
29:           publish_discriminator(discriminator)             only densely connected layers with cross entropy loss, the
30:           while updating the discriminator is not needed   SVM uses squared hinge loss. The resulting hyperparam-
      do                                                       eters are in Tables 1 and 2. Table 1 shows a noticeably
31:           wait()                                           larger network size together with a lot of regularization.
32:       end while                                               We have applied the Bayesian optimization also to the
33:    end while                                               VAE, but the results were not conclusive. Layers prefer
34: end procedure                                              to be as large as possible because the LMSE part of the loss
                                                               can be reduced more with larger layers. Unfortunately, this
                                                               does not reveal whether some increase in history size (M)
                           35000                                                                                                                100%
                                                                                                          Proportion of malware files
                           30000                                                                                     Number of files
 Number of files in week


                                                                                                                                                80%
                           25000
                           20000                                                                                                                60%

                           15000                                                                                                                40%
                           10000
                                                                                                                                                20%
                            5000
                                  0                                                                                                            0%
                                           0            50          100           150         200         250          300          350     400
                                                                                         Week number

                                              Figure 8: Number of analyzed files and the proportion of malware files in each week

                            1
                           0.9
 Median of accuracy


                           0.8
                           0.7
                           0.6
                           0.5                 Significant result
                                          MLP1 median accuracy
                           0.4            MLP1 linear regression
                           0.3            SVM median accuracy
                                          SVM linear regression
                           0.2
                                      0                50           100            150              200          250          300         350
                                                                                        Week number

Figure 9: Comparison between two models trained on the data from the first week. The trend in the time series indicates
that a data drift is present.

                           0.95

                            0.9
 Median of accuracy


                           0.85

                            0.8
                                               Significant
                           0.75                     SVM
                                                   MLP1
                                                   MLP2
                            0.7
                                          0                  50             100               150               200           250           300
                                                                                        Week number

Figure 10: Comparison between SVMs and MLPs retrained for each week. There is no clear gradual increase in the
difficulty of problem. MLP2 seems to be the best of the compared models. A result is highlighted if it is significantly
better than another worse result in the respective week.
                         0.94
                         0.92
                          0.9
 Median of accuracy


                         0.88
                         0.86
                         0.84
                         0.82
                          0.8          MLP2
                                       MLP1
                         0.78
                                       SVM
                         0.76      VAE-MLP1
                                   0                    10                   20                 30                40                 50
                                                                                     Week number

Figure 11: Results of our algorithm in first 55 weeks. The significantly worst MLP1 gains significant performance
advantage when combined with a VAE, to the point of basically matching MLP2 and SVM. A summary of comparison
results is given in Table 4.


    Table 1: Results of MLP hyperparameter optimization.                                 Table 2: Results of SVM hyperparameter optimization.

           Name                        Selected value           Possibilities                Name           Selected value       Possibilities
        Learning rate                     0.00763               0.0001-0.01                 Penalty             64.44             0.001-80
        Batch norm.                         yes                    yes/no                 Penalty type            l1                l1/l2
          Dropout                           0.22                    0-0.7                                                      Standard Robust
                                                                                          Data scaling        Standard
          Gaussian                                                                                                                 MinMax
                                           0.795                    0-1.0
           noise
                                       354-322-316-             up to 400-400-
                       Layers
                                          305-2                   400-400-2             each week and it does not seem that the difficulty of the
                                                              elu, selu, softplus,      problem is increasing. The models are not clearly over-
                                                             softsign, relu, tanh,      trained because both achieved a rather high accuracy with
                      Activation           relu.                   sigmoid,             a rather small training dataset. The results were statisti-
                                                                 LeakyReLU,             cally analyzed by the Wilcoxon ranksum test with Holm
                                                                 PReLU, ELU             correction on the 5% family-wise significance level [4].
                  Minibatch                                                             For models trained only once, the results showed that the
                                            730                   10-1000
                     size                                                               SVM was better 88.3% of weeks while being significantly
                  L1 regular.              0.01                    0-0.1                better 45.9% of them. MLP1 was significantly better only
                  L2 regular.             0.0998                   0-0.1                in 0.8% of weeks. It is important to say that the MLP1 in
                                                              Standard Robust           this test does not have optimal hyperparameters we, only
            Data scaling                 Standard
                                                                  MinMax                want to see if its behaviour is evolving with time. The
                                                                                        results of this comparison can be seen in Figure 9.
                                                                                           Subsequently, MLP1 , MLP2 and SVM were trained re-
helps more than the appropriate extension of a network.                                 peatedly each week with a corresponding history of size M
Layers also tend to have elu as the most suitable activation                            and then tested on the next week. The results are depicted
function together with batch normalization.                                             in Figure 10, whereas a summary is in Table 3.
   Altogether for baselines, we were using MLP1 , repre-                                   Our VAE-MLP algorithm is rather slow, due to inherent
senting a small slightly regularized MLP, MLP2 with op-                                 sequential training. For the VAE, we used the 540-200-
timal hyperparameters (Table 1), representing a large and                               200-10-200-200-540 fully connected architecture with elu
highly regularized network, and a SVM.                                                  activations and LMSE . The network is updated with data
   In Figure 9, we see a data drift is indeed present and                               from each week with M = N = 150.000. Figure 11 de-
both models are similarly penalized in time. This observa-                              picts an interesting property. The previously clearly infe-
tion is confirmed by Figure 10 where learning is done for                               rior MLP1 is improved by VAE to the point of matching
Table 3: Summary of baseline consideration, the MLP1 is        validated on a large set of real-world malware-detection
a small network with little regularization, MLP2 is a large    data. This dataset contains Windows executable files from
network with a lot of regularization and linear SVM is con-    375 weeks, in the amount up to 30.000 binary files from
sidered because it may have superior generalization prop-      each week. Due to the large size of the dataset, only the
erties.                                                        baseline detection using a MLP alone has been tested up
                                                               to now, and also compared to classification based on linear
                               MLP1 MLP2 SVM                   SVMs, frequently used in malware detection. The com-
      MLP1                           6.1% 2.9%                 putational demands of testing the proposed new approach
              is
      MLP2                     93.9%       61.6%               allowed to accomplish it so far for only 55 weeks. Results
              better than
      SVM                      97.1% 38.4%                     of the ongoing experiment will be available and presented
                                                               at the workshop.
                               MLP1 MLP2 SVM
      MLP1    is                     0.0% 0.0%
      MLP2    significantly    18.1%      6.1%                 Acknowledgement
      SVM     better than      12.8% 0.0%
                                                               The research reported in this paper has been supported by
Table 4: Summary of the results of the first 50 weeks be-      the Czech Science Foundation (GAČR) grant 18-18080S.
tween baselines (MLP2 and SVM) and MLP1 with and               Access to computing and storage facilities owned by par-
without VAE.                                                   ties and projects contributing to the National Grid In-
                                                               frastructure MetaCentrum provided under the programme
                       MLP2  MLP1 SVM VAE1                     "Projects of Large Research, Development, and Innova-
    MLP2                     100.0% 100.0% 82.0%               tions Infrastructures" (CESNET LM2015042), is greatly
         is                                                    appreciated.
    MLP1               0.0%         0.0% 0.0%
         better than
    SVM                0.0% 100.0%         8.0%
    VAE1               18.0% 100.0% 92.0%
                                                               References
                       MLP2     MLP1 SVM          VAE1
    MLP2 is                     100.0% 84.0%      4.0%          [1] P.J. Bartlett and J. Shawe-Taylor. Generalization perfor-
    MLP1 significantly 0.0%            0.0%       0.0%              mance of support vector machines and other pattern classi-
                                                                    fiers. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, edi-
    SVM better than 0.0%        100.0%            0.0%
                                                                    tors, Advances in Kernel Methods – Support Vector Learn-
    VAE1               0.0%     100.0% 14.0%
                                                                    ing, pages 43–54. MIT Press, Cambridge, 1999.
                                                                [2] W. L. Buntine and A. S. Weigend. Computing second
baselines performance. It clearly shows the potential of            derivatives in feed-forward networks: a review. IEEE
                                                                    Transactions on Neural Networks, 5(3):480–488, May
this algorithm, not only we do not optimize MLP and VAE
                                                                    1994.
together, but also we do not tune the vaues M and N. Ta-
                                                                [3] Aurélien Géron. Hands-on machine learning with Scikit-
ble 4 further confirms the findings from Figure 11 as a nice
                                                                    Learn and TensorFlow: concepts, tools, and techniques to
summary.                                                            build intelligent systems. O’Reilly Media, Boston, first edi-
   If you are interested, you can try our model or help with        tion edition, 2017.
development at the following links:                             [4] S. Holm. A simple sequentially rejective multiple test pro-
                                                                    cedure. Scandinavian Journal of Statistics, 6:65–70, 1979.
     Bayesian hyperparameter optimization framework
     https://github.com/tumpji/Bayesian-optimizer.git           [5] Diederik P. Kingma and Jimmy Ba. Adam: A Method for
                                                                    Stochastic Optimization. arXiv:1412.6980 [cs], December
     Implementation of the proposed method                          2014. arXiv: 1412.6980.
     https://github.com/tumpji/VAE-NN-Tensorflow.git
                                                                [6] Diederik P. Kingma and Max Welling. Auto-Encoding
     Deep belief networks in Tensorflow                             Variational Bayes. arXiv:1312.6114 [cs, stat], December
     https://github.com/tumpji/DBN-Tensorflow.git                   2013. arXiv: 1312.6114.
                                                                [7] M. Mursleen, A.S. Bist, and J. Kishore. A support vector
                                                                    machine water wave optimization algorithm based predic-
6     Conclusion                                                    tion model for metamorphic malware detection. Interna-
                                                                    tional Journal of Recent Technology and Engineering, 7:1–
This paper presented work in progress on a new approach             8, 2019.
to online deep classification learning in data streams with     [8] A. Narayanan, L. Yang, L. Chen, and L. Jinliang. Adap-
slow or moderate drift. Such kind of learning is highly rel-        tive and scalable Android malware detection through online
evant for the application domain of malware detection. In           learning. In 2016 International Joint Conference on Neural
the paper, the employed methods have been recalled and              Networks (IJCNN), pages 2484–2491, July 2016.
the principles of the proposed approach has been outlined.      [9] N. Nissim, R. Moskowitch, L. Rokach, and I. Elovici.
In ongoing experiments, the approach is currently being             Novel active learning methods for enhanced PC malware
     detection in windows OS. Expert Systems with Applica-
     tions, 41:5843–5857, 2014.
[10] H.H. Pajouh, A. Dehghantanha, R. Khayami, and K.K.R.
     Choo. Intelligent OS X malware threat detection with code
     inspection. Journal of Computer Virology and Hacking
     Techniques, 14:212–223, 2018.
[11] B. Rashidi, C. Fung, and E. Bertino. Android malicious
     application detection using support vector machine and ac-
     tive learning. In 2017 13th International Conference on
     Network and Service Management (CNSM), pages 1–9,
     November 2017.
[12] Bahman Rashidi, Carol Fung, and Elisa Bertino. Android
     Resource Usage Risk Assessment using Hidden Markov
     Model and Online Learning. Computers & Security, 65,
     November 2016.
[13] Mathieu ROUAUD. Probability, Statistics and Estima-
     tion: Propagation of Uncertainties in Experimental Mea-
     surement. Mathieu ROUAUD, June 2017.
[14] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT
     Press, Cambridge, 2002.
[15] Machine Learning Group-University of Sheffield. GPyOpt.
[16] M. Stamp. Introduction to Machine Learning with Appli-
     cations in Information Security. CRC Press, Boca Raton,
     2018.
[17] William T. Vetterling, Brian P. Flannery, William H. Press,
     and Saul A. Teukolsky. Numerical Recipes: The art of
     scientific computing. Cambridge University Press, Cam-
     bridge, 3nd ed edition, 2007.