=Paper= {{Paper |id=Vol-2718/paper19 |storemode=property |title=Online Malware Detection with Variational Autoencoders |pdfUrl=https://ceur-ws.org/Vol-2718/paper19.pdf |volume=Vol-2718 |authors=Jiří Tumpach,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/TumpachH20 }} ==Online Malware Detection with Variational Autoencoders== https://ceur-ws.org/Vol-2718/paper19.pdf
          Online Malware Detection with Variational Autoencoders

                                            Jiří Tumpach1,2 , Martin Holeňa2
         1 Faculty of Mathematics and Physics, Charles University, Malostranské nám. 2, Prague, Czech Republic
    2   Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic

Abstract: This paper studies the application of                slow. This paper reports a work in progress inves-
variational autoencoders (VAEs) to online learning             tigating one of the possibilities to adapt neural net-
from malware detection data. To this end, it em-               works for malware detection. We prepared a hybrid
ploys a large real-world dataset of anonymized high-           model that consists of multiple pairs of a generator
dimensional data collected during 375 consecutive              and a classifier. Our idea is to make the generators
weeks. Several VAEs were trained on selected subsets           learn the history and later use them for data augmen-
of this time series and subsequently tested on different       tation because many malware evasion techniques are
subsets. For the assessment of their performance, the          shared. As the generators, we use variational autoen-
accuracy metric is complemented with the Wasser-               coders (VAEs), and classification is performed using
stein distance. In addition, the influence of different        multi-layer neural networks. Firstly, we present re-
kinds of data normalization on the VAE perfomance              sults of our hybrid classification-autoencoder model,
has been investigated. Finally, the combinations of a          which employs a kind of transient feature learning.
VAE with two multi-layer perceptorns (MLPs) have               Secondly, we investigate how an autoencoder behaves
been investigated, which has lead to the surprising            when it is trained with high amounts of outliers.
result that the impact of such a combination on mal-              In Section 2, we survey techniques for online mal-
ware detection is positive for a simple and superficially      ware detection. Section 3 is devoted to the principles
optimized MLP, but negative for a complex and well             of methods employed in our investigation. Section 4
optimized one.                                                 present the results of performed experiments.


1       Introduction                                           2    Online Malware Detection
Neural networks are popular due to their perfor-               Malware is continuously evolving by exploiting new
mance, versatility, scalability and learning of features.      vulnerabilities and examining evading techniques [7].
Due to feature learning, a neural network, especially          This causes significant changes in malware behaviour
a deep one generally does not need extensive feature           and consequently complicates the application of clas-
engineering. On the other hand, it requires a large            sification techniques for malware detection. These
amount of training data. There exist application ar-           techniques must be precise enough not to bother user
eas where feature distribution and optimal classifica-         with false positives, as well as highly adaptive in order
tion can change in time. Classification in such situa-         to detect new threats as soon as possible.
tions is often called online classification. One of such          Malware detection techniques are static and dy-
areas is malware detection. The reason for changing            namic [8]. While static methods are focused on an
distribution in malware detection is data drift, which         analysis of program code without actually running
comes from several sources. Firstly, benign programs           a particular program, dynamic methods analyse pro-
are using different frameworks or statically linked li-        grams behaviour. Their main benefit is a potentially
braries based on changing popularity. Secondly, mali-          finite number of inspected behaviours. Commonly
cious programs attempt to hide by copying clean pro-           used features for dynamic analysis are resources, priv-
gram’s code and behaviour, or using evading tech-              ileges, calls (system or APIs) and results of tracking
niques like the manipulation of code during runtime,           sensitive data (e-mails, etc.) inside of an applica-
and those techniques are also changing in time, as a           tion [7].
reaction on updating the databases of malware detec-              One of the successful examples of the dynamic ap-
tion software. For this reason, neural networks hardly         proach to malware detection is DroidOL [7]. It anal-
ever have enough training data to be promptly pre-             yses inter-procedural control-flow sub-graphs with an
pared for new threats. Moreover, even if they had              adaption of the Weisfeiler-Lehman Kernel (WL). WL
enough data, their training would probably be too              was successfully used to classify graph type with re-
                                                               markable speed thanks to its linear time complexity.
                                                               The authors of DroidOL found out that this represen-
     Copyright ©2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 Inter-   tation is robust against hiding attempts of malicious
national (CC BY 4.0).                                          software. After WL prepossessing, DroidOL applies
a passive-aggressive classifier, which is a kind of on-       formed by corrupted image while a crisp image is ex-
line learning methods suited for big data. On real            pected as an output [1, 5].
Android applications, DroidOL outperforms state-of-
the-art malware detectors.
   Online Support Vector Machines with RBF kernel             3.2    Variational Autoencoder (VAE)
were investigated in [9], and they use features based         Sometimes new data is required to be produced based
on application behaviour, too.                                on a small number of real samples. For these types
   A different approach to malicious software was in-         of problems, autoencoders offer a solution. Bottle-
vestigated in [8]. The authors understand that al-            necks significantly reduce chances for overtraining
lowing every application access to every information          while maintaining appropriate power due to nonlin-
(location, contacts, pictures, . . . ) is not ideal. Alter-   ear space reduction.
natively, constantly bothering users by attributions            One problem remains: if we would like to generate
of privileges could make them indifferent. Their sys-         a new sample, we need to know the distribution of
tem XDroid [8] tackles this problem by online hidden          codings. For this reason, VAE adds two updates to
Markov model (HMM) learning users priorities.                 regular autoencoder that makes codings obey some
   This paper reliews on the content of [10] where the        distribution (usually normal). The first update is to
main idea is firstly introduced, now we try to iden-          add another term in the model’s loss. The Kullback-
tify the problems we had and try to come up with              Leibler (KL) divergence is a measure that relates two
a solution that could be useful in similar situations.        distributions µ and ν and is defined by
We are especially interested in the difficulty of VAE
training since it is a model which is usually applied to
                                                                DKL (µ k ν ) = H(µ , ν ) − H(µ )
problems where pictures are involved. Pictures have
really nice property – all features have limited, sim-
                                                                          Z                          Z
                                                                     =−         µ (x) ln ν (x)dx +  µ (x) ln µ (x)dx
ilarly important values. It could also be the reason                       Rn                        Rn
why VAE usually smoothes generated samples [5], it
                                                                                                                   
                                                                                                              µ (x)
                                                                                                 Z
                                                                                              =    µ (x) ln          dx,
exploits spatial dependence, but the model does not                                             Rn            ν (x)
have enough power to learn proper borders between
objects. The malware detection requires a similar ap-         where H(µ , ν ) is cross-entropy and H(µ ) is entropy.
proach – some features are independent, but others            If µ and ν equals, the divergence is 0. Otherwise, it
are highly dependent. Nevertheless, the features are          is a positive value.
not similarly distributed.                                       The model can link small changes in codings with
                                                              large changes in result, giving an opportunity to over-
                                                              training. So the second update is to add uncertainty
3     Methodology                                             into codings. In practice, the coding layer of VAE
                                                              is split into two branches, one for means and one for
3.1    Autoencoders                                           variances. The second part of the network – decoder
Autoencoders are artificial neural networks capable of        then receives samples based on these parameters of
nonlinear space reduction. They are trained almost            the generating distributions. In fact, it is common to
like classical neural networks for regression, but their      use the logarithm of variance since it has a more suit-
objective is the reconstruction of their input. The re-       able range. Then a loss based on the KL divergence
construction function is any loss function that could         can be reduced to
be used for regression problems. A usual choice is
                                                                                             1 N G
the mean squared error (MSE) or the mean absolute             LVAE = LReconstruction −                 1 + vil − m2il − evil
                                                                                                                             
                                                                                               ∑   ∑
error (MAE). Their main benefit comes from the in-                                          2N i=1 l=1
troduction of a bottleneck. The bottleneck can be a
narrow part of the network or some heavy regularisa-          where N is the batch size, G is the coding dimension,
tion. Such a bottleneck is required to have a restricted      vi j is the coding variance and mi j is the coding mean
flow of information. This restriction causes an autoen-       [6]. A scheme of a small VAE is depicted in Figure 1.
coder to preserve only the most important patterns in
data and drop that kind of information that can be            3.3    Wasserstein distance
easily deduced or is noisy. An interesting fact is that
neural networks with identity activation function and         Let k · k be an arbitrary norm on Rn , n ∈ N and µ , ν be
MSE loss converge to the same mapping as the prin-            probability measures on Rn with finite first moments.
cipal component analysis transformation [6].                  For each n-dimensional random vector X, denote
   There exist several kinds of autoencoders. Most            D(X) the distribution of X. Then the Wasserstein
commonly, they are used as a dimensionality reduc-            distance of order 1, or simply Wasserstein distance
tion method, or for denoising – where the input is            (aka Wasserstein measure, Kantorovich-Rubinstein
                                                                   Table 1: Functions considered as normailizing trans-
                                                                   formations of data attributes.
                  µ3
                                                                     Function definition            Mnemonic name
                                                                                                    min-max
                  µ2                                                            a−min A
                                                                     T1 (a) = max A−min A
                                                                     T2 (a) = maxakAk               max-abs
                  µ1               +                                                          p
                                                                                        a−∑ j=1 a j
                                                                     T3 (a) = q p                                  standardize
                                   +
                                                                                                   p
                                                                                ∑i=1 (ai − 1p ∑ j=1 a j )2
                                                                     T4 (a) = A a−A  0.5
                                                                                                                   robust
                                                                               0.75 −A0.25
                  σ3      ×        +                                 T5 (a) = ψ (a, λ̂ ) with ψ                    power
                                                                     and λ̂ defined in (2)–(3)
                  σ2      ×
                                                                     T6 (a) = q such that Aq = a                   perc-uniform
                                                                     T7 (a) = φ −1 (T6 (a))                        perc-normal
                  σ1      ×
                                                                     T8 (a) = T1 (log(a + e − min A))              log-min-max
                                                                     T9 (a) = T2 (log(a + e − min A))              log-max-abs
         Gausian noise generator                                     T10 (a) = T3 (log(a + e − min A))             log-standardize
                                                                     T11 (a) = T4 (log(a + e − min A))             log-robust
                                                                     T12 (a) = T5 (log(a + e − min A))             log-power
                                                                     T13 (a) = T6 (log(a + e − min A))             log-perc-uniform
Figure 1: Variational autoencoder. Gray nodes are                    T14 (a) = T7 (log(a + e − min A))             log-perc-normal
operations, µi , σi nodes have linear activation fucn-
tion.
                                                                   of a sample of A in the training data, φ is the cumula-
distance, earth mover’s distance) between µ and ν                  tive distribution function of the standard normal dis-
corresponding to the norm k · k is defined [4, 11]:                tribution N(0, 1), and e is the Euler constant, on which
                                                                   natural logarithms are based. Because T6 , T7 , T13 , T14
                                                                   are based on quantiles prediction, the final transfor-
                                                                   mations are formed by substitute linear interpolations
  W1 (µ , ν ) = W (µ , ν ) =                                       for those quantiles. Due to performance reasons, the
     = inf {EkX −Y k | D(X) = µ ∧ D(Y ) = ν } . (1)                quantiles are predicted based on 100000 samples of A.
                                                                      This function is defined in four steps:
   We decided to use the Wasserstein distance as
a method for model comparison. Unlike Kullback-                      I. A parametrized version ψ of the intended trans-
Leibler divergence, it is a metric – it is symmet-                      formation is defined, which depends on a param-
ric and fulfills the triangle inequality. Moreover, it                  eter λ ∈ R:
does not require the considered measures to share a
probability space. The Wasserstein distance is also
                                                                                      
                                                                                       (a+1)λ −1
                                                                                                       if a ∈ R≥0 , λ 6= 0,
scale sensitive W (cµ , cν ) ≤ cβ W (µ , ν ) and sum invari-
                                                                                           λ
                                                                                                       if a ∈ R≥0 , λ = 0,
                                                                                      
                                                                                      
                                                                                      ln(a + 1)
ant W (µ + A, ν + A) ≤ W (µ , ν ) for any β , c > 0 and                   ψ (a, λ ) =     (−x+1)2−λ −1
A ∈ Rn [2]. Therefore, it provides us some intuition
                                                                                      
                                                                                      
                                                                                      −       2−λ     if a ∈ R<0 , λ 6= 2,
about our results.                                                                     − ln(−x + 1) if a ∈ R<0 , λ = 2.
                                                                                      
                                                                                      
                                                                                                                          (2)

3.4    Considered kinds of normalization
                                                                    II. It is assumed that for some range of values of the
                                                                        parameter λ , the value of ψ (a, λ ) is a random
All included attributes of the data were normal-
                                                                        variable with the distribution N(µ , σ 2 ) for some
ized with 14 different transforming functions, denoted
                                                                        µ ∈ R, σ ∈ R>0 . Consequently, the log-likelihood
T1 −T14 . The definitions of nearly all of them are quite
                                                                        of the parameters (λ , µ , σ 2 ) with respect to train-
simple, therefore, they are merely listed in Table 1.
                                                                        ing data is
The only exception is T5 , which performs a power
transformation proposed by Yeo and Johnson [12]. It
is based on the popular Box-Cox transform [3].                              `(λ , µ , σ 2 ; a1 , . . . , a p ) =
   We use the following notations: a ∈ R denotes an                         p          p          1 p
                                                                         = − log(2π ) − log σ 2 − 2 ∑ (ψ (ai , λ ) − µ )2
arbitrary value of a transformed attribute A, whereas                       2          2         2σ i=1
A denotes the set of values of A in the training data                                                         p
a1 , . . . , a p , kAk denotes the set of absolute values of A                               + (λ − 1) ∑ sign(ai ) log(kai k + 1).
in ka1 k, . . . , ka p k, Aq , q ∈ (0, 1) denotes the quantile q                                             i=1
III. The parameters µ and σ 2 are for each λ ∈ R esti-                        Almost constant
     mated, respectively, with the following estimates
     µ̂ (λ ) and σ̂ 2 (λ ):                                                                                        Highly skewed
                                                                                                19%
                               1 p                                                                        20%
                   µ̂ (λ ) =     ∑ ψ (ai , λ ),
                               p i=1
                                                                              Binary      21%
                          1 p
                 σ̂ (λ ) = ∑ (ψ (ai , λ ) − µ̂ (λ ))2 .
                    2
                          p i=1                                                                        30%
                                                                                             10%
IV. For the parameter λ , the maximum likelihood
    estimate is used, after replacing the other two                               Gaussian                      Other
    parameters by their estimates:
                                                                                  Figure 2: Distribution in the feature space.
         λ̂ = max `(λ , µ̂ (λ ), σ̂ 2 (λ ); a1 , . . . , a p ) =
               λ ∈R
                        p
                                                        p
       max(λ −1) ∑ sign(ai ) log(kai k+1)−                log σ 2 (λ ).     4.2    Experiments with Different
       λ ∈R           i=1                               2                         Normalizations
                                                                      (3)
                                                                            Our dataset has many outliers. We were interested
   We cannot measure the Wasserstein distance in                            in how outliers effect generator. Therefore, we com-
transformed space because there is no clear way of                          pared two common loss functions used for regression,
comparing the results between transformations. It                           both combined with the kinds of normalization con-
is not a problem for most of the transformations we                         sidered in Subsection 3.4. The first was the mean
use, but the perc-uniform and perc-normal are slightly                      square error (MSE) and the second was the mean ab-
problematic. We have to use a linear interpolation                          solute error (MAE). Both are tested on a rather large
on a sequence of quantiles in order to approximate                          network with 7 layers [541, 306, 173, 98, 56, 32, 18, 10],
the initial dataset. For this reason, we expect to                          with ELU as activation function and batch normal-
have some increase in the Wasserstein distance on                           isation on every layer except the coding and output
perc variants since it is nonzero value even if we com-                     one. The results can be seen in Figure 4 It looks
pare original dataset to its forward and then backward                      like the best normalisation for our data is the power
transformed version.                                                        transform, but the differences between different kinds
                                                                            of normalizations are, in general, small. It transforms
4      Experimental Evaluation                                              data to be more normal. Both perc-uniform and perc-
                                                                            normal performed excellently. It seems that small dis-
4.1     Available Malware Data                                              tortions created by linear interpolations of quantiles
We use real-word anonymized data, which feature                             do not deteriorate result.
malware and clean software in several categories, but                          It is better to use raw data than robust, min-max
we consider only two by merging some of them. Anti-                         and max-abs types of normalisations, especially if
malware software producers must protect their know-                         MSE is employed. We observe that some of the fea-
how by providing only data which were anonymized.                           tures consist of a small-valued majority with some
For us, the anonymization process is unknown. It                            huge values. Min-max and max-abs transformations
essentially hides the meaning of features in provided                       could make this range very small, demanding high
datasets by stripping it in the documentation and ap-                       precision on regression for the smaller ones. On the
plying unknown functions on those columns. The fea-                         other hand, the robust transformation could make
ture space is very complex, there are 540 features with                     the range larger because quantities could be evalu-
various distributions. This makes particularly diffi-                         ated only on the small values – artificially increasing
cult to choose the correct data scaling. In Figure 2,                       existing variance.
several kinds of features are differentiated:                                  It does not seem that the log variants have any pos-
                                                                            itive effect, but we saw a slight increase in min-max
    • Binary feature
                                                                            and max-abs transformations. This is expected be-
    • Gaussian feature: both absolute skewness and ex-                      cause logarithm reduces ranges of values.        Robust
      cess kurtosis are less than 2                                         transformation has a substantial incompatibility with
    • Highly skewed feature: skewness > 30                                  logarithm transformation because log-robust transfor-
    • Almost constant feature: more than 99.9 % val-                        mation operates on logarithmic space. After exponen-
      ues are identical                                                     tiation, the errors could be much larger. However, this
                                                                            explanation fails if we consider that log-perc transfor-
    • Other unknown distributions
                                                                            mations performed almost equally.
  MAE is generally better. MSE is equal to MAE              thought the MLP1 is significantly worse than MLP2
in perc-uniform, perc-normal, log-perc-uniform, log-        in 94% of weeks, when VAE was added, it became
perc-normal, power, log-power.                              significantly better on 18% of weeks while MLP2 on
                                                            none them.
          week 1        week 2        week 3                   In Figure 8, unusual behaviour of the compared
                                                            methods can be seen. The VAE with the MLP1 is
                                                            significantly better than the VAE with the MLP2 for
                                                            the weeks 1-71. This is interesting because a sepa-
        VAE           VAE           VAE                     rate MLP1 trained on a small amount of samples is
                                                            generally worse than MLP2 .
                                                               Several possibilities explain such behaviour.
                                                            Firstly, the hyperparameters of MLP2 can be sim-
                                                            ply overtrained on training samples. Instead of
                                                            learning about patterns, the model may learn small
             MLP          MLP           MLP
                                                            mistakes made by VAE and try to infer based on
                                                            them. Secondly, MLP1 may lack some form of
Figure 3: Training data paths for VAEs and MLPs for         regularization.
each week. Red indicates generated data, blue adds
label classifications to features.

   We also tried to investigate how an increase in cod-
                                                            5    Conclusion
ing size affect the resulting performance. It seems
there is no difference, 10 as a coding dimension (pre-      This paper investigated the application of variational
dicted normal distributions of 2 parameters) is suffi-        autoencoders to online learning from malware detec-
cient no matter which normalisation was chosen. The         tion data. To this end, it employed a large real-
results can be seen in Figure 5.                            world dataset of anonymized high-dimensional data.
   Figure 6 shows expected behaviour. The VAE has           We prepared a hybrid model that consists of multiple
trouble expressing features with higher variance. It is     pairs of a generator and a classifier. The basic idea
evident, especially on the highest tertile, where differ-   of our approach is to make the generators learn the
ences between distributions are almost non-existent.        history and later use it for data augmentation. In ad-
   A next view on the data reveals something interest-      dition, the influence of different kinds of data normal-
ing – there is a great difference between the data with     ization on the VAE perfomance has been investigated.
different signs of skew of a particular feature. Nega-      For the assessment of VAE performance, accuracy has
tive skew was much harder to imitate while positive         been complemented with the Wasserstein distance.
skew was easier to reproduce than in the other two             The experiments have proven our expectation that
middle quartiles. This could be a property of dataset       for highly heterogeneous data, normalization is rele-
features or under-training. It seems that max-abs and       vant. However, the most basic transformations like
min-max normalizations have a slightly different re-        min-max and max-abs were worse than using the raw
sponse.                                                     value without any transformation. The experiments
                                                            have also shown that variational autoencoders can be
4.3   Experiments with Different MLPs                       used for data augmentation in MLP-based classifica-
                                                            tion. However, one should be aware of the hyperpa-
We were interested in the ability of generative models
                                                            rameters determining the size of the classifying MLP.
to capture important information (e.g. vanishing or
                                                            If it is too large, generated noise produced by VAE
strange behaviour) and ignore the noise (waiting for
                                                            could be detrimental instead.
user input). As the next experiment, we prepared a
balanced dataset extended by a feature reporting true
benign and malware labels. Firstly, we let the genera-
tive model learn on the first part of data without true     Acknowledgement
labels. Then, every other part of the data was aug-
mented by classified features generated by the previ-       The research reported in this paper has been sup-
ous generator and classified by the previous classifier.    ported by the Czech Science Foundation (GAČR)
This process is depicted in Figure 3.                       grant 18-18080S. Computational resources were sup-
   We prepared two results of Bayesian optimisation         plied by the project ”e-Infrastruktura CZ” (e-INFRA
of MLPs hyperparameters. The first, MLP1 is the re-         LM2018140) provided within the program Projects of
sult of mid-optimisation and MLP2 is the final result.      Large Research, Development and Innovations Infras-
The hyperparameters are in Tables ?? and ??. Even           tructures.
                                                                                                                                                                                                                     mse loss
                                                                 99
                                                                 95
                                                                 75                                                                                                                                                  mae loss
                                                                 50
    generated distribution and heldout subsample                 25                                       Comparison between loss functions on performance
       Median of Wasserstein distance between

                                                                 5
                                                                 1
                                                      1

                                                      1
                                                      4

                                                     1
                                                     16

                                                     1
                                                     64

                                                     1
                                                    256

                                                    1
                                                   1024


                                                          lo              lo       lo       lo            lo           lo          lo         m         m          po              pe        pe            ra          ro          st
                                                               g-            g-m     g-        g-           g-           g-           g-         ax       in         we              rc         rc           w           bu          an
                                                                    m                   po        pe            pe           ro          st        -a        -m        r               -n          -u                        st        d.
                                                                        ax      in        w          r c-          rc-         b us        and       bs         ax                       or           ni
                                                                          -a      -m       er             n            u                                                                    m            fo
                                                                            bs       ax                     or           ni         t          .                                              al            rm
                                                                                                               m           fo
                                                                                                                 al           rm


                               Figure 4: The graph shows reactions of the VAE on the loss function given a dataset normalization.
                                                                                                                                                                                                                   10 codings
                                                                99
                                                                95
                                                                75                                                                                                                                                 25 codings
                                                                50                                                                                                                                                 50 codings
 generated distribution and heldout subsample




                                                                25
                                                                5                                                 Comparison of coding size on performace
    Median of Wasserstein distance between




                                                                1

                                                      1
                                                      1
                                                      4
                                                     1
                                                     16
                                                     1
                                                     64
                                                     1
                                                    256
                                                     1
                                                   1024


                                                          lo            lo          lo        lo           lo            lo          lo           m          m         po          pe        pe            ra         ro           st
                                                            g-             g-          g-        g-           g-           g-           g-          ax         in        we          rc         rc           w           bus         an
                                                                    m         m           po        pe           pe            ro          st          -a         -m        r           -no        -u                       t           d .
                                                                     ax         in          w          rc           r c          bu           an          bs        ax                     rm         ni
                                                                       -a          -m        er           -n            -u          st           d.                                                      fo
                                                                          bs          ax                     or            ni                                                                 al            rm
                                                                                                                m            fo
                                                                                                                  al            rm

Figure 5: The graph shows reactions of the VAE on the increase in coding size given a dataset normalization.



                                                                                                                                                                                                           <0.0,0.4756)
                                                                 99
                                                                 95
                                                                 75                                                                                                                                     <0.4756,0.8788)
                                                                 50                                                                                                                                      <0.8788,428.2)
    generated distribution and heldout subsample




                                                                 25
                                                                 5                        Comparison of normalizations on different features based on their variance
       Median of Wasserstein distance between




                                                                 1

                                                      1
                                                      1
                                                      4
                                                     1
                                                     16
                                                     1
                                                     64
                                                     1
                                                    256
                                                    1
                                                   1024


                                                          lo              lo              lo      lo           lo            lo          lo          m           m         po      pe           pe           ra        ro          st
                                                               g-           g-              g-       g-          g-            g-           g-          ax        in         we      rc            rc          w         bu          an
                                                                    m               m         po        pe           pe            ro          st          -a        -m        r       -n            -u                       st        d.
                                                                        ax           in         w         r c-          r c-         b us         and        b s        ax                  or          ni
                                                                          -a            -m       er            n             u                                                                 m           fo
                                                                               bs          ax                    or            ni         t           .                                          al           rm
                                                                                                                    m            fo
                                                                                                                      al            rm


Figure 6: The graph shows how the VAE model handle features with different variances given a dataset nor-
malization.
                                                                                                                                                                                                                 <-171.5,3.768)
                                                                          99
                                                                          95
                                                                          75                                                                                                                                     <3.768,19.51)
                                                                          50                                                                                                                                     <19.51,72.51)
                                                                                                                                                                                                                 <72.51,767.1)
                                                                          25
    generated distribution and heldout subsample




                                                                           5
       Median of Wasserstein distance between




                                                                           1
                                                                                            Comparison of normalizations on different features based on their skewness

                                                           1
                                                          1
                                                          4
                                                          1
                                                          16
                                                          1
                                                          64
                                                          1
                                                         256
                                                      1
                                                     1024

                                                                     lo           lo        lo        lo          lo            lo           lo          m          m          po      pe         pe             ra         ro           st
                                                                       g-            g-        g-        g-          g-            g-          g-          ax         in         we      rc            rc          w           b   us       and
                                                                            m           m         po        pe          pe             ro         st          -a         -m        r       -n             -u                         t          .
                                                                             ax          in         w         r c          r c           bu          an          bs         ax               or              nifo
                                                                                  -a       -m        er          -n            -u           st          d.                                        m
                                                                                     bs       ax                    or            ni                                                               al            rm
                                                                                                                       m             fo
                                                                                                                         al             rm

Figure 7: The graph shows how the VAE model handle features with different skew given a dataset normalization.




                                               0.95
                                                   0.9
Median of accuracy




                                               0.85
                                                   0.8
                                               0.75
                                                   0.7
                                               0.65
                                                                                                                                                                                   Significant result
                                                   0.6                                                                                                                                   VAE-MLP1
                                                                                                                                                                                         VAE-MLP2
                                               0.55
                                                               0                    10                        20                       30                       40                      50                             60
                                                                                                                                 Week number
                                                                   Figure 8: Comparison between the combinations VAE-MLP1 and VAE-MLP2 .


              Table 2: Results of MLP1 and MLP2 hyperparameter optimization using GPyOpt library.
                                                                               Selected values
     Name                              Possibilities
                                                                        MLP1                   MLP2
     Learning rate         0.0001-0.01                           0.0002                 0.00763
     Batch norm.           yes/no                                yes                    yes
     Dropout               0-0.7                                 0                      0.22
     Gaussian noise        0-1.0                                 0                      0.795
     Layers                1-1-1-2 up to 400-400-400-400-2       69-32-30-11-2          354-322-316-305-2
     Activation            ELU, SELU, softplus, softsign, ReLU, tanh                    ReLU
                           tanh, sigmoid, leaky ReLU, PReLU
     Minibatch size        10-1000                               410                    730
     L1 regularization     0-0.1                                 0.00016                0.01
     L2 regularization     0-0.1                                 0.09229                0.0998
     Data scaling          standard, robust min-max              min-max                standard
References
 [1] G. Aurélien. Hands-on machine learning with Scikit-
     Learn and TensorFlow: concepts, tools, and tech-
     niques to build intelligent systems. O’Reilly, 2 edition,
     2019.
 [2] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mo-
     hamed, B. Lakshminarayanan, S. Hoyer, and
     R. Munos. The cramer distance as a solution to bi-
     ased wasserstein gradients. CoRR, abs/1705.10743,
     2017.
 [3] G. E. P. Box and D. R. Cox. An analysis of trans-
     formations. Journal of the Royal Statistical Society:
     Series B (Methodological), 26(2):211–243, 1964.
 [4] P. M. Esfahani and D. Kuhm. Data-driven distri-
     butionally robust optimization using the Wasserstein
     metric: Performance guarantees and tractable refor-
     mulations. Mathematical Programming, Series A,
     171:115–166, 2018.
 [5] D. Foster. Generative deep learning: teaching ma-
     chines to paint, write, compose, and play. O’Reilly, 1
     edition, 2019.
 [6] M. A. Kramer. Nonlinear principal component anal-
     ysis using autoassociative neural networks. AIChE
     Journal, 37(2):233–243, 1991.
 [7] A. Narayanan, L. Yang, L. Chen, and L. Jinliang.
     Adaptive and scalable Android malware detection
     through online learning. In 2016 International Joint
     Conference on Neural Networks (IJCNN), pages
     2484–2491, July 2016.
 [8] B. Rashidi, C. Fung, and E. Bertino. Android Re-
     source Usage Risk Assessment using Hidden Markov
     Model and Online Learning. Computers & Security,
     65, November 2016.
 [9] B. Rashidi, C. Fung, and E. Bertino. Android mali-
     cious application detection using support vector ma-
     chine and active learning. In 2017 13th Interna-
     tional Conference on Network and Service Manage-
     ment (CNSM), pages 1–9, November 2017.
[10] J. Tumpach, M. Krčál, and M. Holeňa. Deep net-
     works in online malware detection. In ITAT, pages
     90–98, 2019.
[11] C. Villani.     Optimal Transport, Old and New.
     Springer, 2006.
[12] I. K. Yeo and R. A. Johnson. A new family of power
     transformations to improve normality or symmetry.
     Biometrika, 87:954–959, 2000.