Online Malware Detection with Variational Autoencoders Jiří Tumpach1,2 , Martin Holeňa2 1 Faculty of Mathematics and Physics, Charles University, Malostranské nám. 2, Prague, Czech Republic 2 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic Abstract: This paper studies the application of slow. This paper reports a work in progress inves- variational autoencoders (VAEs) to online learning tigating one of the possibilities to adapt neural net- from malware detection data. To this end, it em- works for malware detection. We prepared a hybrid ploys a large real-world dataset of anonymized high- model that consists of multiple pairs of a generator dimensional data collected during 375 consecutive and a classifier. Our idea is to make the generators weeks. Several VAEs were trained on selected subsets learn the history and later use them for data augmen- of this time series and subsequently tested on different tation because many malware evasion techniques are subsets. For the assessment of their performance, the shared. As the generators, we use variational autoen- accuracy metric is complemented with the Wasser- coders (VAEs), and classification is performed using stein distance. In addition, the influence of different multi-layer neural networks. Firstly, we present re- kinds of data normalization on the VAE perfomance sults of our hybrid classification-autoencoder model, has been investigated. Finally, the combinations of a which employs a kind of transient feature learning. VAE with two multi-layer perceptorns (MLPs) have Secondly, we investigate how an autoencoder behaves been investigated, which has lead to the surprising when it is trained with high amounts of outliers. result that the impact of such a combination on mal- In Section 2, we survey techniques for online mal- ware detection is positive for a simple and superficially ware detection. Section 3 is devoted to the principles optimized MLP, but negative for a complex and well of methods employed in our investigation. Section 4 optimized one. present the results of performed experiments. 1 Introduction 2 Online Malware Detection Neural networks are popular due to their perfor- Malware is continuously evolving by exploiting new mance, versatility, scalability and learning of features. vulnerabilities and examining evading techniques [7]. Due to feature learning, a neural network, especially This causes significant changes in malware behaviour a deep one generally does not need extensive feature and consequently complicates the application of clas- engineering. On the other hand, it requires a large sification techniques for malware detection. These amount of training data. There exist application ar- techniques must be precise enough not to bother user eas where feature distribution and optimal classifica- with false positives, as well as highly adaptive in order tion can change in time. Classification in such situa- to detect new threats as soon as possible. tions is often called online classification. One of such Malware detection techniques are static and dy- areas is malware detection. The reason for changing namic [8]. While static methods are focused on an distribution in malware detection is data drift, which analysis of program code without actually running comes from several sources. Firstly, benign programs a particular program, dynamic methods analyse pro- are using different frameworks or statically linked li- grams behaviour. Their main benefit is a potentially braries based on changing popularity. Secondly, mali- finite number of inspected behaviours. Commonly cious programs attempt to hide by copying clean pro- used features for dynamic analysis are resources, priv- gram’s code and behaviour, or using evading tech- ileges, calls (system or APIs) and results of tracking niques like the manipulation of code during runtime, sensitive data (e-mails, etc.) inside of an applica- and those techniques are also changing in time, as a tion [7]. reaction on updating the databases of malware detec- One of the successful examples of the dynamic ap- tion software. For this reason, neural networks hardly proach to malware detection is DroidOL [7]. It anal- ever have enough training data to be promptly pre- yses inter-procedural control-flow sub-graphs with an pared for new threats. Moreover, even if they had adaption of the Weisfeiler-Lehman Kernel (WL). WL enough data, their training would probably be too was successfully used to classify graph type with re- markable speed thanks to its linear time complexity. The authors of DroidOL found out that this represen- Copyright ©2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 Inter- tation is robust against hiding attempts of malicious national (CC BY 4.0). software. After WL prepossessing, DroidOL applies a passive-aggressive classifier, which is a kind of on- formed by corrupted image while a crisp image is ex- line learning methods suited for big data. On real pected as an output [1, 5]. Android applications, DroidOL outperforms state-of- the-art malware detectors. Online Support Vector Machines with RBF kernel 3.2 Variational Autoencoder (VAE) were investigated in [9], and they use features based Sometimes new data is required to be produced based on application behaviour, too. on a small number of real samples. For these types A different approach to malicious software was in- of problems, autoencoders offer a solution. Bottle- vestigated in [8]. The authors understand that al- necks significantly reduce chances for overtraining lowing every application access to every information while maintaining appropriate power due to nonlin- (location, contacts, pictures, . . . ) is not ideal. Alter- ear space reduction. natively, constantly bothering users by attributions One problem remains: if we would like to generate of privileges could make them indifferent. Their sys- a new sample, we need to know the distribution of tem XDroid [8] tackles this problem by online hidden codings. For this reason, VAE adds two updates to Markov model (HMM) learning users priorities. regular autoencoder that makes codings obey some This paper reliews on the content of [10] where the distribution (usually normal). The first update is to main idea is firstly introduced, now we try to iden- add another term in the model’s loss. The Kullback- tify the problems we had and try to come up with Leibler (KL) divergence is a measure that relates two a solution that could be useful in similar situations. distributions µ and ν and is defined by We are especially interested in the difficulty of VAE training since it is a model which is usually applied to DKL (µ k ν ) = H(µ , ν ) − H(µ ) problems where pictures are involved. Pictures have really nice property – all features have limited, sim- Z Z =− µ (x) ln ν (x)dx + µ (x) ln µ (x)dx ilarly important values. It could also be the reason Rn Rn why VAE usually smoothes generated samples [5], it   µ (x) Z = µ (x) ln dx, exploits spatial dependence, but the model does not Rn ν (x) have enough power to learn proper borders between objects. The malware detection requires a similar ap- where H(µ , ν ) is cross-entropy and H(µ ) is entropy. proach – some features are independent, but others If µ and ν equals, the divergence is 0. Otherwise, it are highly dependent. Nevertheless, the features are is a positive value. not similarly distributed. The model can link small changes in codings with large changes in result, giving an opportunity to over- training. So the second update is to add uncertainty 3 Methodology into codings. In practice, the coding layer of VAE is split into two branches, one for means and one for 3.1 Autoencoders variances. The second part of the network – decoder Autoencoders are artificial neural networks capable of then receives samples based on these parameters of nonlinear space reduction. They are trained almost the generating distributions. In fact, it is common to like classical neural networks for regression, but their use the logarithm of variance since it has a more suit- objective is the reconstruction of their input. The re- able range. Then a loss based on the KL divergence construction function is any loss function that could can be reduced to be used for regression problems. A usual choice is 1 N G the mean squared error (MSE) or the mean absolute LVAE = LReconstruction − 1 + vil − m2il − evil  ∑ ∑ error (MAE). Their main benefit comes from the in- 2N i=1 l=1 troduction of a bottleneck. The bottleneck can be a narrow part of the network or some heavy regularisa- where N is the batch size, G is the coding dimension, tion. Such a bottleneck is required to have a restricted vi j is the coding variance and mi j is the coding mean flow of information. This restriction causes an autoen- [6]. A scheme of a small VAE is depicted in Figure 1. coder to preserve only the most important patterns in data and drop that kind of information that can be 3.3 Wasserstein distance easily deduced or is noisy. An interesting fact is that neural networks with identity activation function and Let k · k be an arbitrary norm on Rn , n ∈ N and µ , ν be MSE loss converge to the same mapping as the prin- probability measures on Rn with finite first moments. cipal component analysis transformation [6]. For each n-dimensional random vector X, denote There exist several kinds of autoencoders. Most D(X) the distribution of X. Then the Wasserstein commonly, they are used as a dimensionality reduc- distance of order 1, or simply Wasserstein distance tion method, or for denoising – where the input is (aka Wasserstein measure, Kantorovich-Rubinstein Table 1: Functions considered as normailizing trans- formations of data attributes. µ3 Function definition Mnemonic name min-max µ2 a−min A T1 (a) = max A−min A T2 (a) = maxakAk max-abs µ1 + p a−∑ j=1 a j T3 (a) = q p standardize + p ∑i=1 (ai − 1p ∑ j=1 a j )2 T4 (a) = A a−A 0.5 robust 0.75 −A0.25 σ3 × + T5 (a) = ψ (a, λ̂ ) with ψ power and λ̂ defined in (2)–(3) σ2 × T6 (a) = q such that Aq = a perc-uniform T7 (a) = φ −1 (T6 (a)) perc-normal σ1 × T8 (a) = T1 (log(a + e − min A)) log-min-max T9 (a) = T2 (log(a + e − min A)) log-max-abs Gausian noise generator T10 (a) = T3 (log(a + e − min A)) log-standardize T11 (a) = T4 (log(a + e − min A)) log-robust T12 (a) = T5 (log(a + e − min A)) log-power T13 (a) = T6 (log(a + e − min A)) log-perc-uniform Figure 1: Variational autoencoder. Gray nodes are T14 (a) = T7 (log(a + e − min A)) log-perc-normal operations, µi , σi nodes have linear activation fucn- tion. of a sample of A in the training data, φ is the cumula- distance, earth mover’s distance) between µ and ν tive distribution function of the standard normal dis- corresponding to the norm k · k is defined [4, 11]: tribution N(0, 1), and e is the Euler constant, on which natural logarithms are based. Because T6 , T7 , T13 , T14 are based on quantiles prediction, the final transfor- mations are formed by substitute linear interpolations W1 (µ , ν ) = W (µ , ν ) = for those quantiles. Due to performance reasons, the = inf {EkX −Y k | D(X) = µ ∧ D(Y ) = ν } . (1) quantiles are predicted based on 100000 samples of A. This function is defined in four steps: We decided to use the Wasserstein distance as a method for model comparison. Unlike Kullback- I. A parametrized version ψ of the intended trans- Leibler divergence, it is a metric – it is symmet- formation is defined, which depends on a param- ric and fulfills the triangle inequality. Moreover, it eter λ ∈ R: does not require the considered measures to share a probability space. The Wasserstein distance is also   (a+1)λ −1 if a ∈ R≥0 , λ 6= 0, scale sensitive W (cµ , cν ) ≤ cβ W (µ , ν ) and sum invari-  λ if a ∈ R≥0 , λ = 0,   ln(a + 1) ant W (µ + A, ν + A) ≤ W (µ , ν ) for any β , c > 0 and ψ (a, λ ) = (−x+1)2−λ −1 A ∈ Rn [2]. Therefore, it provides us some intuition   − 2−λ if a ∈ R<0 , λ 6= 2, about our results. − ln(−x + 1) if a ∈ R<0 , λ = 2.   (2) 3.4 Considered kinds of normalization II. It is assumed that for some range of values of the parameter λ , the value of ψ (a, λ ) is a random All included attributes of the data were normal- variable with the distribution N(µ , σ 2 ) for some ized with 14 different transforming functions, denoted µ ∈ R, σ ∈ R>0 . Consequently, the log-likelihood T1 −T14 . The definitions of nearly all of them are quite of the parameters (λ , µ , σ 2 ) with respect to train- simple, therefore, they are merely listed in Table 1. ing data is The only exception is T5 , which performs a power transformation proposed by Yeo and Johnson [12]. It is based on the popular Box-Cox transform [3]. `(λ , µ , σ 2 ; a1 , . . . , a p ) = We use the following notations: a ∈ R denotes an p p 1 p = − log(2π ) − log σ 2 − 2 ∑ (ψ (ai , λ ) − µ )2 arbitrary value of a transformed attribute A, whereas 2 2 2σ i=1 A denotes the set of values of A in the training data p a1 , . . . , a p , kAk denotes the set of absolute values of A + (λ − 1) ∑ sign(ai ) log(kai k + 1). in ka1 k, . . . , ka p k, Aq , q ∈ (0, 1) denotes the quantile q i=1 III. The parameters µ and σ 2 are for each λ ∈ R esti- Almost constant mated, respectively, with the following estimates µ̂ (λ ) and σ̂ 2 (λ ): Highly skewed 19% 1 p 20% µ̂ (λ ) = ∑ ψ (ai , λ ), p i=1 Binary 21% 1 p σ̂ (λ ) = ∑ (ψ (ai , λ ) − µ̂ (λ ))2 . 2 p i=1 30% 10% IV. For the parameter λ , the maximum likelihood estimate is used, after replacing the other two Gaussian Other parameters by their estimates: Figure 2: Distribution in the feature space. λ̂ = max `(λ , µ̂ (λ ), σ̂ 2 (λ ); a1 , . . . , a p ) = λ ∈R p p max(λ −1) ∑ sign(ai ) log(kai k+1)− log σ 2 (λ ). 4.2 Experiments with Different λ ∈R i=1 2 Normalizations (3) Our dataset has many outliers. We were interested We cannot measure the Wasserstein distance in in how outliers effect generator. Therefore, we com- transformed space because there is no clear way of pared two common loss functions used for regression, comparing the results between transformations. It both combined with the kinds of normalization con- is not a problem for most of the transformations we sidered in Subsection 3.4. The first was the mean use, but the perc-uniform and perc-normal are slightly square error (MSE) and the second was the mean ab- problematic. We have to use a linear interpolation solute error (MAE). Both are tested on a rather large on a sequence of quantiles in order to approximate network with 7 layers [541, 306, 173, 98, 56, 32, 18, 10], the initial dataset. For this reason, we expect to with ELU as activation function and batch normal- have some increase in the Wasserstein distance on isation on every layer except the coding and output perc variants since it is nonzero value even if we com- one. The results can be seen in Figure 4 It looks pare original dataset to its forward and then backward like the best normalisation for our data is the power transformed version. transform, but the differences between different kinds of normalizations are, in general, small. It transforms 4 Experimental Evaluation data to be more normal. Both perc-uniform and perc- normal performed excellently. It seems that small dis- 4.1 Available Malware Data tortions created by linear interpolations of quantiles We use real-word anonymized data, which feature do not deteriorate result. malware and clean software in several categories, but It is better to use raw data than robust, min-max we consider only two by merging some of them. Anti- and max-abs types of normalisations, especially if malware software producers must protect their know- MSE is employed. We observe that some of the fea- how by providing only data which were anonymized. tures consist of a small-valued majority with some For us, the anonymization process is unknown. It huge values. Min-max and max-abs transformations essentially hides the meaning of features in provided could make this range very small, demanding high datasets by stripping it in the documentation and ap- precision on regression for the smaller ones. On the plying unknown functions on those columns. The fea- other hand, the robust transformation could make ture space is very complex, there are 540 features with the range larger because quantities could be evalu- various distributions. This makes particularly diffi- ated only on the small values – artificially increasing cult to choose the correct data scaling. In Figure 2, existing variance. several kinds of features are differentiated: It does not seem that the log variants have any pos- itive effect, but we saw a slight increase in min-max • Binary feature and max-abs transformations. This is expected be- • Gaussian feature: both absolute skewness and ex- cause logarithm reduces ranges of values. Robust cess kurtosis are less than 2 transformation has a substantial incompatibility with • Highly skewed feature: skewness > 30 logarithm transformation because log-robust transfor- • Almost constant feature: more than 99.9 % val- mation operates on logarithmic space. After exponen- ues are identical tiation, the errors could be much larger. However, this explanation fails if we consider that log-perc transfor- • Other unknown distributions mations performed almost equally. MAE is generally better. MSE is equal to MAE thought the MLP1 is significantly worse than MLP2 in perc-uniform, perc-normal, log-perc-uniform, log- in 94% of weeks, when VAE was added, it became perc-normal, power, log-power. significantly better on 18% of weeks while MLP2 on none them. week 1 week 2 week 3 In Figure 8, unusual behaviour of the compared methods can be seen. The VAE with the MLP1 is significantly better than the VAE with the MLP2 for the weeks 1-71. This is interesting because a sepa- VAE VAE VAE rate MLP1 trained on a small amount of samples is generally worse than MLP2 . Several possibilities explain such behaviour. Firstly, the hyperparameters of MLP2 can be sim- ply overtrained on training samples. Instead of learning about patterns, the model may learn small MLP MLP MLP mistakes made by VAE and try to infer based on them. Secondly, MLP1 may lack some form of Figure 3: Training data paths for VAEs and MLPs for regularization. each week. Red indicates generated data, blue adds label classifications to features. We also tried to investigate how an increase in cod- 5 Conclusion ing size affect the resulting performance. It seems there is no difference, 10 as a coding dimension (pre- This paper investigated the application of variational dicted normal distributions of 2 parameters) is suffi- autoencoders to online learning from malware detec- cient no matter which normalisation was chosen. The tion data. To this end, it employed a large real- results can be seen in Figure 5. world dataset of anonymized high-dimensional data. Figure 6 shows expected behaviour. The VAE has We prepared a hybrid model that consists of multiple trouble expressing features with higher variance. It is pairs of a generator and a classifier. The basic idea evident, especially on the highest tertile, where differ- of our approach is to make the generators learn the ences between distributions are almost non-existent. history and later use it for data augmentation. In ad- A next view on the data reveals something interest- dition, the influence of different kinds of data normal- ing – there is a great difference between the data with ization on the VAE perfomance has been investigated. different signs of skew of a particular feature. Nega- For the assessment of VAE performance, accuracy has tive skew was much harder to imitate while positive been complemented with the Wasserstein distance. skew was easier to reproduce than in the other two The experiments have proven our expectation that middle quartiles. This could be a property of dataset for highly heterogeneous data, normalization is rele- features or under-training. It seems that max-abs and vant. However, the most basic transformations like min-max normalizations have a slightly different re- min-max and max-abs were worse than using the raw sponse. value without any transformation. The experiments have also shown that variational autoencoders can be 4.3 Experiments with Different MLPs used for data augmentation in MLP-based classifica- tion. However, one should be aware of the hyperpa- We were interested in the ability of generative models rameters determining the size of the classifying MLP. to capture important information (e.g. vanishing or If it is too large, generated noise produced by VAE strange behaviour) and ignore the noise (waiting for could be detrimental instead. user input). As the next experiment, we prepared a balanced dataset extended by a feature reporting true benign and malware labels. Firstly, we let the genera- tive model learn on the first part of data without true Acknowledgement labels. Then, every other part of the data was aug- mented by classified features generated by the previ- The research reported in this paper has been sup- ous generator and classified by the previous classifier. ported by the Czech Science Foundation (GAČR) This process is depicted in Figure 3. grant 18-18080S. Computational resources were sup- We prepared two results of Bayesian optimisation plied by the project ”e-Infrastruktura CZ” (e-INFRA of MLPs hyperparameters. The first, MLP1 is the re- LM2018140) provided within the program Projects of sult of mid-optimisation and MLP2 is the final result. Large Research, Development and Innovations Infras- The hyperparameters are in Tables ?? and ??. Even tructures. mse loss 99 95 75 mae loss 50 generated distribution and heldout subsample 25 Comparison between loss functions on performance Median of Wasserstein distance between 5 1 1 1 4 1 16 1 64 1 256 1 1024 lo lo lo lo lo lo lo m m po pe pe ra ro st g- g-m g- g- g- g- g- ax in we rc rc w bu an m po pe pe ro st -a -m r -n -u st d. ax in w r c- rc- b us and bs ax or ni -a -m er n u m fo bs ax or ni t . al rm m fo al rm Figure 4: The graph shows reactions of the VAE on the loss function given a dataset normalization. 10 codings 99 95 75 25 codings 50 50 codings generated distribution and heldout subsample 25 5 Comparison of coding size on performace Median of Wasserstein distance between 1 1 1 4 1 16 1 64 1 256 1 1024 lo lo lo lo lo lo lo m m po pe pe ra ro st g- g- g- g- g- g- g- ax in we rc rc w bus an m m po pe pe ro st -a -m r -no -u t d . ax in w rc r c bu an bs ax rm ni -a -m er -n -u st d. fo bs ax or ni al rm m fo al rm Figure 5: The graph shows reactions of the VAE on the increase in coding size given a dataset normalization. <0.0,0.4756) 99 95 75 <0.4756,0.8788) 50 <0.8788,428.2) generated distribution and heldout subsample 25 5 Comparison of normalizations on different features based on their variance Median of Wasserstein distance between 1 1 1 4 1 16 1 64 1 256 1 1024 lo lo lo lo lo lo lo m m po pe pe ra ro st g- g- g- g- g- g- g- ax in we rc rc w bu an m m po pe pe ro st -a -m r -n -u st d. ax in w r c- r c- b us and b s ax or ni -a -m er n u m fo bs ax or ni t . al rm m fo al rm Figure 6: The graph shows how the VAE model handle features with different variances given a dataset nor- malization. <-171.5,3.768) 99 95 75 <3.768,19.51) 50 <19.51,72.51) <72.51,767.1) 25 generated distribution and heldout subsample 5 Median of Wasserstein distance between 1 Comparison of normalizations on different features based on their skewness 1 1 4 1 16 1 64 1 256 1 1024 lo lo lo lo lo lo lo m m po pe pe ra ro st g- g- g- g- g- g- g- ax in we rc rc w b us and m m po pe pe ro st -a -m r -n -u t . ax in w r c r c bu an bs ax or nifo -a -m er -n -u st d. m bs ax or ni al rm m fo al rm Figure 7: The graph shows how the VAE model handle features with different skew given a dataset normalization. 0.95 0.9 Median of accuracy 0.85 0.8 0.75 0.7 0.65 Significant result 0.6 VAE-MLP1 VAE-MLP2 0.55 0 10 20 30 40 50 60 Week number Figure 8: Comparison between the combinations VAE-MLP1 and VAE-MLP2 . Table 2: Results of MLP1 and MLP2 hyperparameter optimization using GPyOpt library. Selected values Name Possibilities MLP1 MLP2 Learning rate 0.0001-0.01 0.0002 0.00763 Batch norm. yes/no yes yes Dropout 0-0.7 0 0.22 Gaussian noise 0-1.0 0 0.795 Layers 1-1-1-2 up to 400-400-400-400-2 69-32-30-11-2 354-322-316-305-2 Activation ELU, SELU, softplus, softsign, ReLU, tanh ReLU tanh, sigmoid, leaky ReLU, PReLU Minibatch size 10-1000 410 730 L1 regularization 0-0.1 0.00016 0.01 L2 regularization 0-0.1 0.09229 0.0998 Data scaling standard, robust min-max min-max standard References [1] G. Aurélien. Hands-on machine learning with Scikit- Learn and TensorFlow: concepts, tools, and tech- niques to build intelligent systems. O’Reilly, 2 edition, 2019. [2] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mo- hamed, B. Lakshminarayanan, S. Hoyer, and R. Munos. The cramer distance as a solution to bi- ased wasserstein gradients. CoRR, abs/1705.10743, 2017. [3] G. E. P. Box and D. R. Cox. An analysis of trans- formations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2):211–243, 1964. [4] P. M. Esfahani and D. Kuhm. Data-driven distri- butionally robust optimization using the Wasserstein metric: Performance guarantees and tractable refor- mulations. Mathematical Programming, Series A, 171:115–166, 2018. [5] D. Foster. Generative deep learning: teaching ma- chines to paint, write, compose, and play. O’Reilly, 1 edition, 2019. [6] M. A. Kramer. Nonlinear principal component anal- ysis using autoassociative neural networks. AIChE Journal, 37(2):233–243, 1991. [7] A. Narayanan, L. Yang, L. Chen, and L. Jinliang. Adaptive and scalable Android malware detection through online learning. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 2484–2491, July 2016. [8] B. Rashidi, C. Fung, and E. Bertino. Android Re- source Usage Risk Assessment using Hidden Markov Model and Online Learning. Computers & Security, 65, November 2016. [9] B. Rashidi, C. Fung, and E. Bertino. Android mali- cious application detection using support vector ma- chine and active learning. In 2017 13th Interna- tional Conference on Network and Service Manage- ment (CNSM), pages 1–9, November 2017. [10] J. Tumpach, M. Krčál, and M. Holeňa. Deep net- works in online malware detection. In ITAT, pages 90–98, 2019. [11] C. Villani. Optimal Transport, Old and New. Springer, 2006. [12] I. K. Yeo and R. A. Johnson. A new family of power transformations to improve normality or symmetry. Biometrika, 87:954–959, 2000.