Deep Networks in Online Malware Detection Jiří Tumpach1 , Marek Krčál2 , Martin Holeňa3 1 Faculty of Mathematics and Physics, Charles University, Malostranské nám. 2, Prague, Czech Republic 2 Rossum Czech Republic, Dobratická 523, Prague 3 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic Abstract: Deep learning is usually applied to static of data can cause quick overtraining, especially in case of datasets. If used for classification based on data streams, it slow drift. is not easy to take into account a non-stationarity. This pa- This paper investigates faster retraining of neural net- per presents work in progress on a new method for online works on data with slow drift. Such a research is highly deep classification learning in data streams with slow or relevant for the application domain of malware detec- moderate drift, highly relevant for the application domain tion because most of the malware is evolving, entailing of malware detection. The method uses a combination a drift in data. The main idea is to have multiple pairs of of multilayer perceptron and variational autoencoder to generator-discriminator for each time interval. The cur- achieve constant memory consumption by encoding past rent generator is trained with the last subset of training data to a generative model. This can make online learn- data (moving window) with the addition of generated sam- ing of neural networks more accessible for independent ples based on the previous generator. Its job is to esti- adaptive systems with limited memory. First results for mate the distribution of past data points and to use that real-world malware stream data are presented. distribution for generating new examples. A discriminator uses also labels generated by the previous discriminator if labels are not provided explicitly. The generative model 1 Introduction stores some information about the importance of different training cases (weights) and acts as an implicit decay. For Deep network architectures have many benefits. The most the generative model, we currently use variational autoen- obvious one is the lack of need for comprehensive prepa- coders (VAEs) and intend to include also deep belief net- ration of data. A large enough network probably finds rel- works (DBN) soon. However, this idea can be generalized evant features automatically. So it is easier to pass data to to any suitable classifier and generative model. training than to guess about the correct match in the triple In Section 2, we present the state of the art in online problem-transformation-classifier. malware detection. The used methods are described in However, deep network needs a lot of training data to Section 3. In Section 4, strategies for training and evalua- perform in this way. Fortunately, many areas constantly tion are proposed. In Section 5, our data and experiments generate large amounts of data. on a real word malware dataset are presented. Too much data may be a problem because parallel train- ing for deep neural networks can be expensive. Some training examples may be unnecessary and contain only 2 Online Malware Detection repeating relevant information with some random noise. In this case, they function as a weight for the relevant in- Malware is continuously evolving by exploiting new vul- formation. nerabilities and examining evading techniques [8]. More- Consider a situation where there is no expected change over, detection has to deal with significant data drift. It can of the target function during its use (offline training). In make use of a signature database of previously detected this case, one can save similarity filtered latent features malware. When the file is scanned, at first its is compared of the trained network. For example, latent features can with the items in the database. So only modified and new be outputs of some middle layer. One application can be malware needs to be detected giving high priority to gen- transfer learning where some trade-off between network eralization. Therefore, online detection methods, capable performance and speed of training is already expected. of keeping up with and adapt to such evolution, are desir- Online problems are specific because they are intended able. for situations, when some drift of information is expected. Malware detection techniques can be divided into static So training on all available data can be harmful. One easy and dynamic methods [12]. The static methods focus on solution is to train a model only on the most recent subset an analysis of program code while dynamic methods infer of training examples. This method reduces the need for from program behaviour. They can log used resources and parallel training, however, discarding a large proportion privileges, system or APIs calls or track sensitive data an inside application [8]. In connection with online learning, Copyright c 2019 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY DroidOL [8] uses the analysis of inter-procedural control- 4.0). flow graphs to achieve robustness against hiding attempts. Output layer i1 ·w(1, j) Hidden layer i2 ·w(2, j) i3 ·w(3, j) f oj Input layer i4 ·w(4, j) Figure 2: Multilayer perceptron with two layers. 1 ·b j 1 1 0.8 0.5 0.6 0 Figure 1: The neuron j, its inputs (ii ) are multiplied by 0.4 corresponding weights (w(i, j) ) then summed together with −0.5 0.2 a specific bias bi . The resulting value is called activation. This value is mapped by the activation function f (x) to the 0 −4 −2 0 2 4 −1 −2 −1 0 1 2 output o j of the neuron j. sigm(x) = 1+e1 −x relu(x) = max(0, x) (a) sigmoid (b) relu It is trained with a fast online linear algorithm adapted to growing dimensionality. On real Android applications, Figure 3: Important examples of activation functions. DroidOL outperforms state-of-the-art malware detectors with 84.29% accuracy. Another dynamic online method [11] reports using on- are usually inefficient [2, 17]. On the other hand, more line learned Support Vector Machines with RBF kernel to successful methods are attempting to approximate second detect malware from application behavior. order behavior. One of the strategies is to have differ- Users can have different sensitivity to give their data ent learning rates (sizes of steps) for each learned vari- like location, contacts, or files to an author of a spe- able (Adam, AdaGrad, RMSProp, SGD with Nestorov cific application. Antimalware programs then need to momentum, ...) [3]. Alternatively, some methods ap- profile each user to not restrict them or overly bother. proximate second order derivatives from gradients history XDroid [12] tackles this problem by online hidden Markov (Adam) [5]. model (HMM) learning. One of the most popular loss functions used in regres- sion problems is the Mean Square Error (MSE) loss [3] LMSE = N1 ∑Ni=1 (yi − ŷi )2 where ŷi is output of MLP given 3 Methodological Background sample xi from feature space with corresponding correct value yi , N is the number of samples in one training cycle. 3.1 Multilayer Perceptron (MLP) In classification, to be able to learn probabilities of labels, one can employ cross-entropy loss. For classification into A multilayer perceptron is composed of neurons (Figure 1) G classes, it is defined as − N1 ∑Ni=1 ∑G l=1 yil log(ŷil ) and the arranged into layers (Figure 2) [3]. The first layer is called predicted probability ŷil of the label l is given by the soft- input layer, and its function is to receive values of the in- max activation function ŷil = exp(ŷil )/∑G s=1 exp ŷis . puts. The last layer is called output layer and it has a sim- ilar structure as the remaining, aka hidden layers. Their neurons are connected to the output of each neuron in the previous layer. Figure 2 depicts a two layer MLP. It is a 3.2 Autoencoders (AEs) non-linear regression or discrimination model because its neurons use non-linear activation functions (Figure 3). Autoencoders are neural networks capable of learning data MLP is learned through minimizing some loss function representations called codings, usually with a much lower usually by some kind of smooth optimization. The most dimension than is the dimension of the input data [3]. They simple, but still used kind of smooth optimization is gra- learn to copy the input to its output and are consisted of dient descent, in the area of neural networks also known two parts: an encoder and a decoder, cf. the example in as backpropagation, due to the flow of gradient computa- Figure 4. By restricting the flow of information, one can tion. In high-dimensional spaces, its stochastic variant is achieve interesting properties, for example denoising, de- commonly used, stochastic gradient descent. Exact sec- tecting anomalies, generating unseen samples with a simi- ond order methods like such as the Gauss-Newton method lar distribution as the training one and so on. µ3 µ2 µ1 + + σ3 × + σ2 × σ1 × encoder decoder Gausian noise generator Figure 4: Autoencoder – the output of the encoder is the input to the decoder. Figure 5: Variational Autoencoder. Gray nodes are opera- tions, µ , σ nodes have linear activation fucntion. 3.3 Variational Autoencoders (VAEs) Codings in basic autoencoders can have nonstandard dis- . tributions [3]. This property makes it difficult to generate In [3] has been proposed to speed up convergence in samples similar to the training dataset. VAEs solve this training by predicting logarithm of variance (log(σi2 ) = vi ) problem by employing the Kullback-Leibler (KL) diver- instead of standard deviation. Then LVAE will be: gence. KL divergence between two distributions p and q is defined as: 1 N G ∑ ∑ 1 + vil − µil2 − evil  LVAE = LMSE − 2N i=1 l=1 DKL (p||q) = H(p, q) − H(p) Z ∞ Z ∞ =− p(x) ln q(x)dx + p(x) ln p(x)dx The VAE encoder input is now ~µ +~ε ·~v, where ~ε is a −∞ −∞ vector of samples from standard normal random distribu- tion. VAEs backpropagation is unchanged, all operations   p(x) Z ∞ = p(x) ln dx, should be considered without any skipping. −∞ q(x) If VAE is properly learned, sampling becomes easy. We where H(p, q) is cross-entropy and H(p) is entropy. The can expect a normal distribution of its codings if we sam- KL divergence is a measure of difference between two dis- ple from a real learned distribution. The encoding part is tributions. If p(x) and q(x) are the same, the divergence then redundant and can be skipped. The result is only a equals 0, otherwise it is positive value. random sampler which gives inputs to the decoder. Because codings in AEs are deterministic, it is not pos- sible to define KL divergence. The important idea in [6] is to map the codings to normal distributions, using a suit- 3.4 Support Vector Machine (SVM) able neural network. The i-th coding now corresponds to one pair of output neurons of the network, and their activi- A support vector machine will be tested as an alternative ties represent a normal distribution for the i-th codding. So to a multilayer perceptron for the starting classification of the first neuron defines the mean (µi ) and the second one available data, due to a frequent use of SVMs in malware the standard deviation (σi ) of that normal distribution. The detection [7, 9, 10, 16]. normal distributions for different codings are mutually in- A SVM is constructed with the objective of best gen- dependent. eralization, i.e., maximal probability that the classifier φ VAEs learn to minimize LVAE where LVAE = classifies correctly with respect to the random variables X DKL (N (µ , σ )||N (0, 1)) + LMSE . So they are learned to and Y producing the inputs and outputs, respectively, copy their inputs to the outputs, while maintaining approx- imately a normal distributions in the codings. In [6] has max P(φ (X) = Y ). (1) been proven that this divergence can be computed as For our high-dimensional feature space X ⊂ Rn , it is suf- 1 N G ficient to consider only a linear SVM, which classifies ac- 1 + log(σil2 ) − µil2 − σil2  LVAE = LMSE − ∑ ∑ 2N i=1 l=1 cording to some hyperplane Hw = {x ∈ Rn |x> w + b = 0} with w ∈ Rn , b ∈ R, 4 Proposed Strategy for Online Learning ( with VAEs 1 if x> w + b < 0, (∀x ∈ X ) φ (x) = φw (x) = (2) -1 if x> w + b ≥ 0. We propose an online learning strategy which focuses on more effective learning and a constant memory require- It can be shown [1, 14] that on quite weak conditions, ments of fetures. The strategy uses two deep learning searching for maximal generalization (1) is equivalent to architectures: MLP and VAE. While a MLP is trained searching for maximal margin between the representatives to replicate labels, a VAE is used as a feature generator. of both classes in the training data, Hence, a VAE can generate new unseen samples for a MLP ρ ρ representing the history. The pseudocode of the algorithm max with constraints ck xk> w ≥ , k = 1, . . . , p, kwk 2 can be found in Algorithm 1 and a diagram of training data where ρ is the scaled margin and paths is depicted in Figure 6. In the first week of training, the VAE is trained on cur- (xk , ck ) ∈ Rn × {−1, 1} are the training samples, (3) rent moving window, which act as a memory limit. The and that using the standard Lagrangian approach for in- same applies for the MLP, but it also uses label informa- equality constraints, (3) can be transformed into the dual tion. Next weeks are different. The VAEs use also data task sampled from previous weeks VAE, this provides some- thing like a moving average. The problem is in choos- 1 p ρ p ing the right 1. time to update, 2. size of the generated max − ∑ α j αk c j ck x>j xk + ∑ αk (α ,ρ ) 4 j,k=1 2 k=1 data, 3. relative importance of generated data. All MLPs are also trained from VAEs generated data; because gener- with constraints KKT, α1 , . . . , α p ≥ 0, ρ > 0, ated data lacks label information, the previous weeks MLP where α1 , . . . , α p are Lagrange multipliers. (4) must be employed to add them. The objective function in (4) is quadratic, thus it has a sin- gle global maximum, which can be found in a straightfor- week 1 week 2 week 3 ward way. The abbreviation KKT in (4) stands for Karush- Kuhn-Tucker conditions ρ αk ( − ck xk> w) = 0, k = 1, . . . , p. (5) VAE VAE VAE 2 Due to KKT, the classifier (2) in terms of the solution α1∗ , . . . , α p∗ , ρ ∗ of (4) turns to ( 1 if ∑xk ∈S αk∗ ck x> xk + ρ ∗ ≥ 0, (∀x ∈ X ) φw (x) = -1 if ∑xk ∈S αk∗ ck x> xk + ρ ∗ < 0, MLP MLP MLP (6) where S = {xk |αk∗ > 0}. The vectors in S lie in the sup- Figure 6: Training data paths for VAEs and MLPs for each port hyperplanes of the representatives of both classes in week. Red indicates generated data, blue adds label clas- the training data. Therefore, they are called support vec- sifications to features. tors. Because the size of input features is 540, and at least 40% of them are binary or look almost as constants, we 5 Experiments with Malware Detection decided to use a linear SVM. Moreover when polynomial Data kernel (p = 2) was used, the speed of convergence was too slow. In this section, we describe several experiments with real- world data from the area of malware detection. 3.5 Linear Regression To estimate the trend of a time series of model accuracies, 5.1 Data we need to perform a linear regression [13] for C points in We use real-word anonimized data, which feature malware two dimensions (x0 , y0 ), (x1 , y1 ), . . . , (xC , yC ). More pre- and clean software in several categories, but we consider cisely, the trend of the time series is described by the slope only two by merging some of them. The semantics of a of the line ŷi = axi + b where the individual features has not been made available by the xy − x y company. The feature space is very complex, there are a= b = y − ax 540 features with various distributions. This makes partic- x2 + x2 ularly difficult to choose the correct data scaling. In Figure with t = C1 ∑Ci=1 ti . 7, several groups of features are differentiated: Almost constant Algorithm 1 Proposed online learning algorithm Highly skewed Require: 19% number N . N is the number of inputs generated by 20% the previous generator. number M . M is the size of the considered most re- Binary 21% cent training data. function data_for_iteration(number) 30% . It gives access to stored data for some 10% iteration with provided number. function labels_for_iteration(number) Gaussian Other . Same as previous function, but for la- bels. Figure 7: Distribution in the feature space. Ensure: Provides discriminator updates for each client . Discriminator can predict labels for new • Binary feature data. • Normally distributed feature: both absolute skewness and kurtosis is less than 2 1: procedure C LIENT • Highly skewed feature: skewness > 30 2: discriminator ←function(x){return default class} 3: while workstation runs do • Almost constant feature: more than 99.9 % values are 4: if exists new version of discriminator then identical 5: discriminator ← update_discriminator() • Other unknown distributions 6: end if 7: if new undecided file exists then The data are initially divided by week. We decided to 8: input ←get_features(x) keep this natural division even though some of the weeks 9: label ← discriminator(input) are mostly empty. We have used 375 weeks in our experi- 10: send_to_server(x) ments, the number of files and proportion of malware files 11: do task specific operation with file as label. are for them depicted in Figure 8. 12: end if [p] 13: end while 14: end procedure 5.2 Performed Experiments and Their Results 15: procedure S ERVER To be able to decide if a neural network is a good model for 16: iteration ← 0 this task, we compare it with a linear SVM. The number of 17: while not last iteration do recent training examples is chosen M = 150.000; it corre- 18: iteration ←iteration + 1 sponds to about 5.5 average weeksand at least 309 MiB of 19: data ←most_recent_data(M) RAM. In order to evaluate the full dataset, one must pro- 20: labels ←most_recent_labels(M) cess 113 GiB of data, and train, sample and evaluate about 21: if iteration > 1 then 370 SVMs and VAEs. 22: gen_data ← generator(N) Both the MLP and the SVM models are Bayesian op- 23: gen_labels ← discriminator(gen_data) timized on first week following the first M of excluded 24: data ← [data; gen_data] data points by the GpyOpt library [15] using the maxi- 25: labels ← [labels; gen_labels] mum probability of improvement as acquisition function 26: end if and mixed sequential and local penalization evaluation. 27: generator ← learn_generator(data) The MLP model is using the Adam algorithm with early 28: discriminator ← learn_discriminator(data, la- stopping after 10 unimproved evaluation of the validation bels) data (25 % of the actual training data). The MLP uses 29: publish_discriminator(discriminator) only densely connected layers with cross entropy loss, the 30: while updating the discriminator is not needed SVM uses squared hinge loss. The resulting hyperparam- do eters are in Tables 1 and 2. Table 1 shows a noticeably 31: wait() larger network size together with a lot of regularization. 32: end while We have applied the Bayesian optimization also to the 33: end while VAE, but the results were not conclusive. Layers prefer 34: end procedure to be as large as possible because the LMSE part of the loss can be reduced more with larger layers. Unfortunately, this does not reveal whether some increase in history size (M) 35000 100% Proportion of malware files 30000 Number of files Number of files in week 80% 25000 20000 60% 15000 40% 10000 20% 5000 0 0% 0 50 100 150 200 250 300 350 400 Week number Figure 8: Number of analyzed files and the proportion of malware files in each week 1 0.9 Median of accuracy 0.8 0.7 0.6 0.5 Significant result MLP1 median accuracy 0.4 MLP1 linear regression 0.3 SVM median accuracy SVM linear regression 0.2 0 50 100 150 200 250 300 350 Week number Figure 9: Comparison between two models trained on the data from the first week. The trend in the time series indicates that a data drift is present. 0.95 0.9 Median of accuracy 0.85 0.8 Significant 0.75 SVM MLP1 MLP2 0.7 0 50 100 150 200 250 300 Week number Figure 10: Comparison between SVMs and MLPs retrained for each week. There is no clear gradual increase in the difficulty of problem. MLP2 seems to be the best of the compared models. A result is highlighted if it is significantly better than another worse result in the respective week. 0.94 0.92 0.9 Median of accuracy 0.88 0.86 0.84 0.82 0.8 MLP2 MLP1 0.78 SVM 0.76 VAE-MLP1 0 10 20 30 40 50 Week number Figure 11: Results of our algorithm in first 55 weeks. The significantly worst MLP1 gains significant performance advantage when combined with a VAE, to the point of basically matching MLP2 and SVM. A summary of comparison results is given in Table 4. Table 1: Results of MLP hyperparameter optimization. Table 2: Results of SVM hyperparameter optimization. Name Selected value Possibilities Name Selected value Possibilities Learning rate 0.00763 0.0001-0.01 Penalty 64.44 0.001-80 Batch norm. yes yes/no Penalty type l1 l1/l2 Dropout 0.22 0-0.7 Standard Robust Data scaling Standard Gaussian MinMax 0.795 0-1.0 noise 354-322-316- up to 400-400- Layers 305-2 400-400-2 each week and it does not seem that the difficulty of the elu, selu, softplus, problem is increasing. The models are not clearly over- softsign, relu, tanh, trained because both achieved a rather high accuracy with Activation relu. sigmoid, a rather small training dataset. The results were statisti- LeakyReLU, cally analyzed by the Wilcoxon ranksum test with Holm PReLU, ELU correction on the 5% family-wise significance level [4]. Minibatch For models trained only once, the results showed that the 730 10-1000 size SVM was better 88.3% of weeks while being significantly L1 regular. 0.01 0-0.1 better 45.9% of them. MLP1 was significantly better only L2 regular. 0.0998 0-0.1 in 0.8% of weeks. It is important to say that the MLP1 in Standard Robust this test does not have optimal hyperparameters we, only Data scaling Standard MinMax want to see if its behaviour is evolving with time. The results of this comparison can be seen in Figure 9. Subsequently, MLP1 , MLP2 and SVM were trained re- helps more than the appropriate extension of a network. peatedly each week with a corresponding history of size M Layers also tend to have elu as the most suitable activation and then tested on the next week. The results are depicted function together with batch normalization. in Figure 10, whereas a summary is in Table 3. Altogether for baselines, we were using MLP1 , repre- Our VAE-MLP algorithm is rather slow, due to inherent senting a small slightly regularized MLP, MLP2 with op- sequential training. For the VAE, we used the 540-200- timal hyperparameters (Table 1), representing a large and 200-10-200-200-540 fully connected architecture with elu highly regularized network, and a SVM. activations and LMSE . The network is updated with data In Figure 9, we see a data drift is indeed present and from each week with M = N = 150.000. Figure 11 de- both models are similarly penalized in time. This observa- picts an interesting property. The previously clearly infe- tion is confirmed by Figure 10 where learning is done for rior MLP1 is improved by VAE to the point of matching Table 3: Summary of baseline consideration, the MLP1 is validated on a large set of real-world malware-detection a small network with little regularization, MLP2 is a large data. This dataset contains Windows executable files from network with a lot of regularization and linear SVM is con- 375 weeks, in the amount up to 30.000 binary files from sidered because it may have superior generalization prop- each week. Due to the large size of the dataset, only the erties. baseline detection using a MLP alone has been tested up to now, and also compared to classification based on linear MLP1 MLP2 SVM SVMs, frequently used in malware detection. The com- MLP1 6.1% 2.9% putational demands of testing the proposed new approach is MLP2 93.9% 61.6% allowed to accomplish it so far for only 55 weeks. Results better than SVM 97.1% 38.4% of the ongoing experiment will be available and presented at the workshop. MLP1 MLP2 SVM MLP1 is 0.0% 0.0% MLP2 significantly 18.1% 6.1% Acknowledgement SVM better than 12.8% 0.0% The research reported in this paper has been supported by Table 4: Summary of the results of the first 50 weeks be- the Czech Science Foundation (GAČR) grant 18-18080S. tween baselines (MLP2 and SVM) and MLP1 with and Access to computing and storage facilities owned by par- without VAE. ties and projects contributing to the National Grid In- frastructure MetaCentrum provided under the programme MLP2 MLP1 SVM VAE1 "Projects of Large Research, Development, and Innova- MLP2 100.0% 100.0% 82.0% tions Infrastructures" (CESNET LM2015042), is greatly is appreciated. MLP1 0.0% 0.0% 0.0% better than SVM 0.0% 100.0% 8.0% VAE1 18.0% 100.0% 92.0% References MLP2 MLP1 SVM VAE1 MLP2 is 100.0% 84.0% 4.0% [1] P.J. Bartlett and J. Shawe-Taylor. Generalization perfor- MLP1 significantly 0.0% 0.0% 0.0% mance of support vector machines and other pattern classi- fiers. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, edi- SVM better than 0.0% 100.0% 0.0% tors, Advances in Kernel Methods – Support Vector Learn- VAE1 0.0% 100.0% 14.0% ing, pages 43–54. MIT Press, Cambridge, 1999. [2] W. L. Buntine and A. S. Weigend. Computing second baselines performance. It clearly shows the potential of derivatives in feed-forward networks: a review. IEEE Transactions on Neural Networks, 5(3):480–488, May this algorithm, not only we do not optimize MLP and VAE 1994. together, but also we do not tune the vaues M and N. Ta- [3] Aurélien Géron. Hands-on machine learning with Scikit- ble 4 further confirms the findings from Figure 11 as a nice Learn and TensorFlow: concepts, tools, and techniques to summary. build intelligent systems. O’Reilly Media, Boston, first edi- If you are interested, you can try our model or help with tion edition, 2017. development at the following links: [4] S. Holm. A simple sequentially rejective multiple test pro- cedure. Scandinavian Journal of Statistics, 6:65–70, 1979. Bayesian hyperparameter optimization framework https://github.com/tumpji/Bayesian-optimizer.git [5] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December Implementation of the proposed method 2014. arXiv: 1412.6980. https://github.com/tumpji/VAE-NN-Tensorflow.git [6] Diederik P. Kingma and Max Welling. Auto-Encoding Deep belief networks in Tensorflow Variational Bayes. arXiv:1312.6114 [cs, stat], December https://github.com/tumpji/DBN-Tensorflow.git 2013. arXiv: 1312.6114. [7] M. Mursleen, A.S. Bist, and J. Kishore. A support vector machine water wave optimization algorithm based predic- 6 Conclusion tion model for metamorphic malware detection. Interna- tional Journal of Recent Technology and Engineering, 7:1– This paper presented work in progress on a new approach 8, 2019. to online deep classification learning in data streams with [8] A. Narayanan, L. Yang, L. Chen, and L. Jinliang. Adap- slow or moderate drift. Such kind of learning is highly rel- tive and scalable Android malware detection through online evant for the application domain of malware detection. In learning. In 2016 International Joint Conference on Neural the paper, the employed methods have been recalled and Networks (IJCNN), pages 2484–2491, July 2016. the principles of the proposed approach has been outlined. [9] N. Nissim, R. Moskowitch, L. Rokach, and I. Elovici. In ongoing experiments, the approach is currently being Novel active learning methods for enhanced PC malware detection in windows OS. Expert Systems with Applica- tions, 41:5843–5857, 2014. [10] H.H. Pajouh, A. Dehghantanha, R. Khayami, and K.K.R. Choo. Intelligent OS X malware threat detection with code inspection. Journal of Computer Virology and Hacking Techniques, 14:212–223, 2018. [11] B. Rashidi, C. Fung, and E. Bertino. Android malicious application detection using support vector machine and ac- tive learning. In 2017 13th International Conference on Network and Service Management (CNSM), pages 1–9, November 2017. [12] Bahman Rashidi, Carol Fung, and Elisa Bertino. Android Resource Usage Risk Assessment using Hidden Markov Model and Online Learning. Computers & Security, 65, November 2016. [13] Mathieu ROUAUD. Probability, Statistics and Estima- tion: Propagation of Uncertainties in Experimental Mea- surement. Mathieu ROUAUD, June 2017. [14] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, 2002. [15] Machine Learning Group-University of Sheffield. GPyOpt. [16] M. Stamp. Introduction to Machine Learning with Appli- cations in Information Security. CRC Press, Boca Raton, 2018. [17] William T. Vetterling, Brian P. Flannery, William H. Press, and Saul A. Teukolsky. Numerical Recipes: The art of scientific computing. Cambridge University Press, Cam- bridge, 3nd ed edition, 2007.