=Paper= {{Paper |id=Vol-3125/paper9 |storemode=property |title=Bayesian Hyper-Parameter Optimization for Malware Detection |pdfUrl=https://ceur-ws.org/Vol-3125/paper6.pdf |volume=Vol-3125 |authors=Fahad T. AlGorain,John A. Clark |dblpUrl=https://dblp.org/rec/conf/sgai/AlGorainC21 }} ==Bayesian Hyper-Parameter Optimization for Malware Detection== https://ceur-ws.org/Vol-3125/paper6.pdf
Bayesian Hyper-Parameter Optimisation for Malware
Detection
Fahad T ALGorain1 , John A Clark1
1
    University of Sheffield, Sheffield S10 2TN, UK


                                         Abstract
                                         Malware detection is a major security concern and a great deal of academic and commercial research and
                                         development is directed at it. Machine Learning is a natural technology to address malware detection
                                         and many researchers have investigated its use. However, drawing comparisons between different
                                         techniques is a fraught affair. For example, the performance of ML algorithms often depends significantly
                                         on parametric choices, so the question arises as to what parameter choices are optimal. In this paper,
                                         we investigate the use of a variety of ML algorithms for building malware classifiers and also how best
                                         to tune the parameters of those algorithms – generally known as hyper-parameter optimisation. We
                                         examine the effects of some simple (model-free) ways of parameter tuning together with a state-of-the-art
                                         Bayesian model-building approach. Our work is carried out using EMBER, a major published malware
                                         benchmark dataset on Windows Portable Execution (PE) metadata samples.

                                         Keywords
                                         Hyper-parameter optimisation, Automated Machine Learning, Static Malware Detection, Tree Parzen
                                         Estimators, Bayesian optimisation, Random Search, Grid Search




1. Introduction
Malware is one of the most pressing problems in modern cybersecurity and its detection has
been a longstanding focus for both academic and commercial research and development [1].
Detection must be effective (low false positives and negatives) but also efficient, particularly
in areas such as forensics or threat hunting where vast file storage may need to be scanned
for malware. Furthermore, the malware environment constantly changes and so (re-)training
speed of detectors is also important [2]. Machine Learning (ML) is an obvious avenue to pursue,
with various advantages to harnessing it: an ML approach can significantly reduce manual
effort in developing detectors, giving more rapid deployment; it can play a critical role in the
extraction of insight from malware samples; and ML-based detectors can detect some unseen
malware, e.g. unseen malware that has features that are similar to those of known malware
may be detected because of the loose pattern matching that underpins many ML classification
approaches. A large number of ML techniques have been brought to bear on the malware
detection problem, often attaining good results. However, ML must not be seen as a toolkit
that can simply be thrown at a problem. Many ML techniques are parametrised and the choice
of parameters may make a significant difference to performance. In modern, widely used ML

AI-CyberSec 2021: Workshop on Artificial Intelligence and Cyber Security, December 14, 2021, Cambridge, UK
$ ftalgorain1@sheffield.ac.uk (F. T. ALGorain); john.clark@sheffield.ac.uk (J. A. Clark)
 0000-0003-0547-1402 (F. T. ALGorain); 0000-0002-9230-9739 (J. A. Clark)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
toolkits, ML algorithms often have many tens of parameters (and sometimes more). This leads
to the thorny issue of how such parameters may be best set, a problem generally referred to
as hyper-parameter optimisation (HPO). HPO is a significant focus of research in optimisation
and has the potential to improve on the results obtained by a specific detection approach, but
also to enable fair comparison of techniques with techniques that are not the specific focus of
investigation: comparing a new technique with ‘vanilla’ variants of existing techniques is a
recurring motif in many studies. Bergstra et al. (2015) [3] advocate that hyper-parametrisation
should be a ‘formal outer loop’ in the learning process, a view we very much support. Manual
tuning is often simply impossible. (As an aside, we observe that many commercial users of
ML spend a great deal of time tuning for their specific needs.) ML toolkits seek to address this
problem by adopting default values for parameters; these values are set to values that have
been shown to work plausibly well over many problems. However, for any specific problem at
hand it is far from clear that the default values will be the best, or even good, choices. We have
significant domain incentives to gain the best possible results for malware detection. Both
false positive (FP) and false negative (FN) classifications have major consequences. The former
lead to significant wasted effort investigating the (non-) malware together with denial of service
to the application concerned, while the latter means malware goes undetected, with potentially
catastrophic consequences. For malware detection based on ML making high-performing
hyper-parameter choices therefore matters. In this paper we, initially explore a variety
of ML techniques applied to classification of a specific form of malware (Windows PE files).
We aim to demonstrate that hyper-parameter optimisation may have a significant effect on
detection of this type of malware. More generally, we argue that it should play an important
part in ML-based malware detection research research, and in security applications more widely.
The contributions of our paper are:

   1. A demonstration of how well various ML-based Windows Portable Executable (PE) file
      classifiers perform when trained with default parameters.
   2. An evaluation of various hyper-parameter optimisation approaches applied to this prob-
      lem, including:
        a) Established model-free techniques, e.g. Grid Search and Random Search; and
        b) A Bayesian optimisation model-based approach.
   3. A demonstration that for our target problem the optimal choices of ML hyper-parameters
      may vary considerably from the toolkit defaults.

Windows PE files are an important malware vector, and their detection has been the focus
of significant research. The work uses the EMBER dataset [4] – a recently published dataset
comprising header and derived information from a million examples of PE files. This dataset
contains examples of malware, benign software, and software of unknown malicious status.
The dataset elements are labelled accordingly and so enable supervised learning. EMBER is now
a major resource for ML and the malware community. Our evaluation includes both functional
performance and efficiency (time to train). The last point above should not be underestimated.
In a great many cases results will simply be sub-optimal if tuning is manual and the complexity
of the parameter space is huge (and for many techniques this will be the case), and, as we have
argued above, the costs of sub-optimality may be considerable.
2. Related Literature
2.1. Windows PE and ML Background
Numerous works have been produced regarding ML-based static Portable Executable (PE)
malware detection such as [5][6][7], but work has often been hampered by the absence of
a standard benchmark dataset. The publication of the EMBER dataset [4] has resolved this
problem. A baseline model for supervised ML has been provided by the authors to aid in using
the dataset as a benchmark. Their dataset is accompanied by Python access routines. In [8] the
authors considered imbalanced data set issues and model training duration (by reducing feature
dimensions). They also applied a static detection method using a Gradient Boosting Decision
Tree Algorithm. Their model achieved better performance than the baseline model with less
training time. (They used feature reduction based on the recommendation of the authors in
[4].) Another approach utilized a subset of the EMBER dataset for their work and compared
different ML models [9]. Their work is mainly concerned with scalability and efficiency. Their
goal was to identify malware families. The proposed Random Forest model achieved a slightly
better performance than the baseline model.

2.2. HPO Related literature
Multiple works have asserted the potential of hyper-parameter optimisation. For example
[10] indicated the importance of parameter tuning for increasing accuracy, indicating that
Random Search works better than Grid Search when tuning neural networks. Also, [11] applied
standard tuning techniques to a decision tree on 102 datasets and calculated the accuracy
differences between tuned and traditional models. [12][13] are concerned with the greedy
forward search which seeks to identify the most important hyperparametter to change next.
[14] stressed the importance of single hyperparameters after using sequential model-based
optimisation (SMBO) tuning. ANOVA is used to measure hyperparameter importance. The
authors in [15][16] assessed the performance of hyperparameters across different datasets. [16]
also utilized surrogate models that allow setting the arbitrary hyperparameter configurations
based on a limit on the number of evaluations carried out. Bayesian optimisation-based (BO)
hyper-parameter search was used in [15][17] for speeding up the analysis of the work. The
literature reveals that HPO, and in particular Bayesian optimisation-based (BO) approaches,
have much to offer.
   In this paper, we will use Automated Hyper-parameter Optimisation using Tree Parzen
Estimators (AHBO-TPE) which takes advantage of BO to speed up the analysis of the search as
well.
3. Hyper-parameter optimisation (HPO)
3.1. Definition and Motivation
Hyper-parameters are parameters of a model that are not updated during the learning process
[17]. The HPO problem is defined similarly by many researchers as a search to find 𝑥*

                                       𝑥* = arg min 𝑓 (𝑥),                                      (1)
                                                 𝑥∈𝑋

where 𝑓 (𝑥) is an objective score to be minimised. Commonly it is an error rate of some form
evaluated on the validation set, e.g. the Root Mean Square Error (RMSE). 𝑥* is the hyper-
parameter vector that gives rise to the lowest score value, and 𝑥 can be any vector of parameters
in the specified domain. HPO seeks the hyper-parameter values that return the lowest score. For
malware and similar classification tasks suitable choices for the objective function are holdout
and cross-validation error. Furthermore, if we consider a loss function for the same problem a
possible choice is a misclassification rate [18]. For our proposed model, the loss function is taken
to be (ROC AUC score with cross-validation – 1). For in-depth background about validation
protocols see [19]. Our work aims also to investigate evaluation time. There are two clear ways
to do this. The first is to use a subset of folds in testing an ML algorithm [20]. The second is to
use a subset of the dataset, especially if we have a big data set [21][22], or use a less iteration.
   Although HPO has a great deal to offer, it comes at a price, in particular a computational
price. For every hyper-parameter evaluation, we must train the model, make predictions on
the validation set, and then calculate the validation metrics. Developing a robust ML-based
classifier for Windows PE with a credibly sized and diverse dataset such as EMBER is therefore a
significant undertaking. The computational costs involved act as a disincentive to implementing
Bergstra et al.’s formal outer loop. There is a pressing need for traversing the hyper-parameter
space efficiently and we demonstrate how a leading HPO approach allows us to do so in this
paper.
   Here, Windows PE files are a means to an end; the same issues apply to detecting other
malware. Although malware is our major interest, our work also seeks to motivate consideration
of HPO more widely in the application of ML in cybersecurity (since HPO issues apply there
too). For more information about the HPO problem interested readers should refer to [18].

3.2. Model-free Blackbox optimisation methods (Search Methods)
Perhaps the two most common HPO methods are Grid Search and Random Search. These
require only an evaluation function to work, i.e. they are what is commonly referred to as
‘blackbox’. In Grid Search the individual parameters are discretised, i.e. a number of specific
values are selected as ‘covering’ the particular parameter space. For example the elements in
the set {0.0, 0.25, 0.5, 0.75, 1.0} could be taken to cover a continuous parameter in the range
[0.0..1.0]. Grid Search evaluates the function over the cross product of the discretised sets of
hyper-parameters.
   Random Search selects values randomly from the domain of each hyper-parameter. Usually,
the values selected from each domain by Random Search are independent, i.e. the value of one
parameter choice does not affect the value selected for another parameter choice. Furthermore,
for an individual parameter all values generally have the same probability of being selected. It
is possible to relax such properties, producing what is often referred to as a biased stochastic
search. Such biases often encode for domain insight which is not in the spirit of a blackbox
approach. In our work, we adopted a standard or ‘vanilla’ Random Search. Grid Search suffers
from the ‘curse of dimensionality’ [23] . As the number of parameters increases or finer grain
discretisation is adopted the computational complexity mushrooms. Furthermore, it is does
not learn from past evaluations; we generally refer to such approaches as being uninformed.
Consequently, it may spend a great deal of time evaluating candidates in regions where previous
evaluation of candidates has given rise to poor objective values. Random Search will search
the specified space until a certain number of evaluations, time, or budget has been reached.
It works better than Grid Search when we know the promising hyper-parameter regions and
so we can constrain the stochastic selection of candidates to lie somewhere in such regions
[10][11]. Combining Random Search with complex strategies allows a minimum convergence
rate and adds exploration that can improve model-based search [18][24]. Random Search is also
an uninformed method and so takes a long time to identify the best performing hyper-parameter
settings. It is not surprising that uninformed methods can be outperformed by methods that use
evaluation history to judge where to try next; indeed, such guided search usually outperforms
uninformed methods [15][25][26].

3.3. Bayesian optimisation (BO)
BO has emerged recently as one of the most promising optimisation methods for expensive
blackbox functions. It has gained a lot of traction in the HPO community with significant results
in several areas such as image classification, speech recognition, neural language modeling,
For an in-depth preview about BO, the reader is referred to [17][27]. BO is an informative
method that takes into consideration past results to find the best hyper-parameters. It utilizes
those previous results to form a probabilistic model that is based on a probability of the score
given a hyper-parameter which is denoted by the formula: 𝑃 (𝑠𝑐𝑜𝑟𝑒|ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟). [28]
refers to the probabilistic model as a surrogate for the objective function denoted by 𝑃 (𝑦|𝑥),
the probability of 𝑦 given 𝑥. The model or surrogate is more straightforward to optimize than
the objective function. BO works to find the next hyper-parameters to be evaluated using the
actual objective function by selecting the best performing hyper-parameters on the surrogate
function. A 5-step processes to do this is given by [28]. The first step builds a surrogate
probability model of the objective function. The second finds the hyper-parameters with best
results on the surrogate. The third applies those values to the real defined objective function.
The fourth updates the surrogate with this new real objective function result. Steps 2–4 are
repeated until the maximum iteration or budgeted time is reached [29]. BO has two primary
components: probabilistic model and an acquisition function to decide the next place to evaluate.
Furthermore, BO trades off exploration and exploitation; instead of assessing the costly blackbox
function, the acquisition function is cheaply computed and optimized. There are many choices
for the acquisition function but here we use the most common – expected improvement (EI)
[30]. The goal of Bayesian reasoning is to become more accurate as more performance data
is acquired. The previous 5-step processes is repeated to keep the surrogate model updated
after each evaluation of the objective function [15]. BO spends a little more time generating
sets of hyper-parameter choices that are likely to provide real improvements whilst keeping
calls to the actual objective function as low as possible. Practically, the time spent on choosing
the next hyper-parameters to evaluate is often trivial compared to the time spent on the (real)
objective function evaluation. BO can find better hyper-parameters than Random Search in
fewer iterations [25]. This is one of the issues we seek to address in our work: whether AHBO-
TPE could find better hyper-parameters than Random Search with fewer iterations specially for
our target domain - Windows PE file malware detection.

3.4. Sequential Model-Based optimisation (SMBO)
There are several options for the SMBO’s evaluation of the surrogate model 𝑃 (𝑦|𝑥) [15]. One
of the choices is to use Expected Improvement (EI) as defined in the equation below:
                                             ∫︁ 𝑦*
                                𝐸𝐼𝑦* (𝑥) =           (𝑦 * − 𝑦)𝑃 (𝑦|𝑥)𝑑𝑦                            (2)
                                              −∞

Here 𝑦 * is the threshold value of the objective function, 𝑥 is the hyper-parameter, 𝑦 is the
actual value of the objective function using the hyper-parameters 𝑥, and 𝑃 (𝑦|𝑥) is the surrogate
probability model expressing the probability (density) of 𝑦 given 𝑥. The goal is to find the best
hyper-parameters under function 𝑃 (𝑦|𝑥). The threshold value 𝑦 * is the best objective value
obtained so far. We aim to improve on (i.e. get a lower value than) the best value obtained so far.
For such minimisation problems, if a value 𝑦 is greater than the threshold value, then it is not
an improvement. Only values less that the threshold are improvements. For a value 𝑦 less than
the threshold 𝑦 * − 𝑦 is the improvement. By integrating over all such improvements, weighted
by the density function 𝑃 (𝑦|𝑥) gives the overall expected improvement given the vector of
parameter values 𝑥. When better values of 𝑥 are found (i.e. giving rise to actual improvements
in the real objective function) the threshold value 𝑦 * is updated. The above description is an
idealised view of Expected Improvement. In practice the choice of threshold value is more
flexible, i.e. 𝑦 * need not be the best objective value witnessed so far; this is actually the case for
the Tree-Parzen Estimator approach outlined immediately below.

3.5. Tree-Structured Parzen Estimators (TPE)
TPE constructs its model utilizing Bayesian rules. Its model 𝑃 (𝑦|𝑥) is built from two model
components. One component models values less than a threshold and the other models values
greater than that threshold.
                                       {︃
                                         𝑙(𝑥) 𝑖𝑓 𝑦 < 𝑦 *
                             𝑃 (𝑥|𝑦) =                                                   (3)
                                          𝑔(𝑥) 𝑖𝑓 𝑦 >= 𝑦 *

TPE uses 𝑦 * to be some quantile 𝛾 of the observed 𝑦 values, i.e. such that 𝑃 (𝑦 < 𝑦 * ) = 𝛾 [3].
This allows data to be available to construct the indicated densities. 𝑙(𝑥) is a density based
on the set of evaluated values of 𝑥 that have been found to give objective values less than the
threshold. 𝑔(𝑥) is the density based on the remaining evaluated 𝑥 values. Here 𝑃 (𝑥|𝑦) is the
density of hyper-parameter 𝑥 given a objective function score of 𝑦. It is expressed as follows
[15]:
                                           𝑃 (𝑥|𝑦) * 𝑃 (𝑦)
                                𝑃 (𝑦|𝑥) =                                                   (4)
                                                𝑃 (𝑥)
Bergstra et al. (2012) also show that to maximise improvement we should seek parameters 𝑥
with high probability under 𝑙(𝑥) and low probability under 𝑔(𝑥). Thus, they seek to maximise
𝑔(𝑥)/𝑙(𝑥). The best such 𝑥 outcome is then evaluated in the actual objective function and will
be expected to have a better value. The surrogate model estimates the objective function; if the
hyper-parameter that is selected does not make an improvement, the model won’t be updated.
The updates are based upon previous history/trials of the objective function evaluation. As
mentioned before, the previous trials are stored in a pair of (score, hyper-parameters) by the
algorithm after building the lower threshold density 𝑙(𝑥) and higher threshold density 𝑔(𝑥). It
uses the history of these previous trials to improve the objective function with each iteration.
The motivation to use TPE with SMBO to reduce time and find better hyper-parameters came
from other papers [15][25][31][32]. SMBO uses Hyperopt [3], a Python library that implements
BO or SMBO. Hyperopt makes SMBO an interchangeable component that could be applied to
any search problem. Hyperopt supports more algorithms but we are interested in TPE only in
our work. Our contribution lies in the demonstration of the usefulness of SMBO using
TPE for malware classification purposes.


4. Experiments
Here we outline the experiments carried out and provide sample data and execution environment
details. Discussion of results is given in the next section.

4.1. Execution Environment and Dataset
Our machine learning models evaluation used Scikit-learn [33] and Hyperopt [3]. The experi-
ments were carried out on Windows 10 operating system, 8GB RAM, AMD Ryzen 5 3550 H
with Radeon Vega Mobile Gfc 2.10 GHz, 64-bit operating system, x64-based processor. Also, a
MacBook Air(Catalina version 10.15), 1.8 GHz Dual-core Intel i5, 8GB 1600 Mhz DDR3, Intel
HD graphics 6000 1536MB. Version 2018 of the EMBER dataset [34] were used. It contains
1M samples in total. We used 300k benign and 300k malicious samples for training with 100k
benign and 100k malicious samples for testing purposes. The 200k unlabelled examples of the
dataset were not used in our experiments. Furthermore, the results were obtained using Jupyter
Notebook version 6.1.0 and Python version 3.6.0.

4.2. Experiments with Default Settings
Table 1 shows the results when various ML techniques are applied with default parameter
settings. The techniques include well-established approaches (Stochastic Gradient Descent
classifier (SGD), Logistic Regression classifier (LR), Gaussian Naïve Bayes (GNB), K-nearest
Neighbour (KNN), and Random Forest (RF) [33], [35]) and a state-of-the-art approach – Light-
GBM [36]. This technique has over a hundred parameters and so introduces major challenges
for hyper-parametrisation. Some of its categorical parameters (e.g. boosting type) give rise
to conditional parameters. For initial experiments we adopted the default parameter settings
adopted by the Scikit-Learn toolkit for all techniques other than LightGBM (which has its own
defaults). The evaluation metric is Area Under the Receiver Operating Characteristic
Curve (ROC AUC) [37]. ROC AUC plays an important role in many security classification
tasks, e.g. it also occurs frequently as an evaluation metric in intrusion detection research. Grid
Search results were obtained using the MacBook Air, while the rest (AHBO-TPE and Random
Search) were obtained under the Windows 10 operating system for faster performance.

4.3. Hyper-parameter Optimisation
The most promising of the evaluated ML algorithms, taking into account functional perfor-
mance and speed of training, was LightGBM. We choose to further explore hyper-parameter
optimisation on this technique. Since LightGBM has over a 100 parameters, some of which
are continuous, we simply cannot do exhaustive search. Accordingly, we have had to select
parameters as a focus in this work. We focused on what we believe are the most important
parameters. For Grid Search in particular we had to be particularly selective in what we op-
timised. Moreover, for Random Search we specified a budget of 100 iterations. We examine
Grid Search, Random Search, and AHBO with Tree Parzen Estimators as HPO approaches. We
therefore provide comparison between model-free (blackbox) approaches (Grid Search and
Random Search), and AHBO-TPE, an approach that uses evaluation experience to continually
update its model and suggest next values of the hyper-parameters. We applied AHBO-TPE in
two phases, the first one we initially set it to 3 iterations, while the second being allowed 100
more iterations for fair comparison (with Random Search).

4.4. Additional Materials
Implementation details of our experiments can be found on our github repository [38].


5. Results and discussion
We have carried out our experiments on a benchmark dataset that has been assembled, in part,
to include examples that present challenges to ML classification approaches [34]. The results
given in Table 1 show that the major ML approaches vary hugely in their suitability for
the static malware classification tasks. Our benchmark dataset is comprised of summaries
of Windows PE files. It would seem prudent for malware detection researchers to evaluate
multiple ML algorithms for non-Windows malware detection too. The results also show the
clear promise of using a particular state-of-the-art algorithm – LightGBM – for the malware
detection task. Taking both time and performance into account, Table 1 shows that LightGBM is
clearly the best performing approach. The subsequent tables summarise our attempts to apply
HPO approaches to the most promising of the original ML techniques. We can see in Table
2 that HPO can offer significant improvements. Random search performs very well. So
does AHBO-TPE, but the initial optimisation is far more efficient. Figure 1 further illustrates
how Random Search and AHBO-TPE evolve as iteration number increases. The tool default
parameter choices cannot be relied upon to produce the best or even good results. Tables 5, 6
and 7 illustrate the difficulty of manually tuning parameters for this task. In some cases the
defaults and the best found values are at the opposite ends of the parameter ranges,
e.g. the bagging fraction in Table 5. Many are significantly different to the default value, e.g.
𝑛𝑢𝑚_𝑙𝑒𝑎𝑣𝑒𝑠 in Tables 6 and 7 and 𝑛_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠 of Table 7. Some binary choices are reversed,
e.g. 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 and 𝑖𝑠_𝑢𝑛𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑 of Table 6. Further results are given in the appendix were
we use AHBO-TPE with the selected ML techniques. The results are shown with 10 iterations
(a constraint imposed for reasons of computational practicality) and 3 fold cross validation.
Moreover, the hyper-parameters that are promising in each ML model are added in Tables 7,
8, 9, 10 and 11. Also, in Figure 2 a comparison is given between our search methods and the
benchmark model.

Table 1
Comparison Between Default ML Models Parameters Scores
                                      ML Model               Time to train      Score (AUC-ROC)
                                       GNV                    11 min 56 s              0.406
                                        SGD                   19 min 11s               0.563
                                LightGBM Benchmark              26 min                 0.922
                                         RF                 57 min and 52s             0.90
                                         LR                1 hrs and 44 min            0.598
                                       KNN               3 hrs 14 min and 59s          0.745




Table 2
Search Methods Score Comparison
                  Search Methods                      Best ROC AUC score    Number of Iteration    Time To Complete Search
      BenchMark LightGBM Model non optimized                 0.922                100                       26 mins
              Grid Search optimisation                       0.944                965                  Almost 3 months
            Random Search optimisation                       0.955                 60             15 days, 13 hrs and 12 mins
           AHBO-TPE initial Optimisation                     0.955                  3                         4 hrs
     AHBO-TPE with extra 100 iteration optimisation          0.957                 26             27 days, 23 hrs and 11 mins




Table 3
Selected ML Models with AHBO-TPE Score Comparison
                               ML Model          Score (AUC-ROC)     Score (AUC-ROC) After optimisation
                                GNV                     0.406                     same
                                 SGD                    0.563                     0.597
                         LightGBM Benchmark             0.922                     0.957
                                  RF                    0.901                     0.936
                                  LR                    0.598                     0.618
                                KNN                     0.745                     0.774
Figure 1: Highest Validation Score (the star is the best score achieved through iterations) AHBO-TPE (yellow)
with Random Search Hyperparameter (blue) optimisation.


Table 4
Selected ML Models with AHBO-TPE Performance Results
                             ML Model          Time to train      Training Time Reduction by
                              GNV                11 min 56s                  same
                               SGD                4 min 35s               14 min 35s
                       LightGBM Benchmark        18 min 30s                  8mins
                                RF               31 min 14s                 26 min
                                LR             1 hr 5min 37s                 38 min
                              KNN           4hrs 37 min and 30s   increased by 1hr 23min 29s



6. Limitations and Conclusions
6.1. Limitations
A significant limitation is that the work uses a specific dataset (EMBER) concerned with Windows
PE files. This limits rigorously founded generalisation of results. Evaluation using further
malware datasets would build further insight into applicability beyond the direct case study
malware type reported here. Also we have used a single (albeit highly effective) ‘informed’
hyper-parametrisation approach. The use of other informed hyper-parametrisation approaches
would provide further insight and possible improvements. For practical purposes we informally
identified plausible parameters that should be subject to variation and allowed the remaining
ones to be set at the defaults. It is possible that improvements in results could be obtained by
allowing variation in those parameters fixed at their default values.
6.2. Conclusions
The results overall would suggest that researchers in malware and ML are missing a significant
opportunity to improve results attained by specific techniques of interest. The importance of
sound optimisation in this domain is considerable: every improvement matters to the security
of the system and there are major cost implications for gaining improvements. As, we argued
earlier, sub-optimality may have major costs. Researchers who apply particular techniques to
malware detection are interested in gaining maximal performance and showing their technique
(accurately) at its best. We need a fair comparison of techniques at their best. Fair comparison
is also crucial to the development of the field. HPO has the ability to place ML-based malware
detection on a sound empirical footing. We recommend hyper-parameter optimisation for
malware detection and the ML community. We also recommend Bayesian optimisation as a
particularly promising approach with informative approaches generally being an important
avenue for future research. For future work it seems investigating the applicability of our model
in dynamic malware detection settings is worthwhile. The major available dataset (EMBER)
contains unknown malware samples. Accordingly, applying HBO in a semi-supervised regime
seems a plausible avenue to pursue.
   We have shown how Bayesian Optimisation can be harnessed to advantage. We have applied
a leading state-of-the-art approach to show how malware detection can be enhanced. We should
be alert to major developments in the HPO field to ensure we deliver the very best hopes for
malware detection stakeholders. In summary: the parametrisation of ML approaches exerts
significant effect on the results obtained; Hyper-parameter Optimisation is an important device
for obtaining best results for any specific technique; it is arguably an essential component for
the rigorous comparison of techniques, which is a fundamental criterion for the development
of the topic. It would be very useful for the ML-based malware detection community to agree
comparison protocols. The above are our conclusions for ML applied to malware detection
(here malicious Windows PE files). We propose that HPO be an essential element of the ML
process for malware detection applications.


References
 [1] A. K. Pandey, A. K. Tripathi, G. Kapil, V. Singh, M. W. Khan, A. Agrawal, R. Kumar, R. A.
     Khan, Trends in malware attacks: Identification and mitigation strategies, in: Critical
     Concepts, Standards, and Techniques in Cyber Forensics, IGI Global, 2020, pp. 47–60.
 [2] A. Al-Sabaawi, K. Al-Dulaimi, E. Foo, M. Alazab, Addressing malware attacks on connected
     and autonomous vehicles: Recent techniques and challenges, in: Malware Analysis Using
     Artificial Intelligence and Deep Learning, Springer, 2021, pp. 97–119.
 [3] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, Hyperopt: a python library for
     model selection and hyperparameter optimization, Computational Science & Discovery 8
     (2015) 014008.
 [4] H. S. Anderson, P. Roth, Ember: an open dataset for training static pe malware machine
     learning models, arXiv preprint arXiv:1804.04637 (2018).
 [5] M. G. Schultz, E. Eskin, F. Zadok, S. J. Stolfo, Data mining methods for detection of new
     malicious executables, in: Proceedings 2001 IEEE Symposium on Security and Privacy.
     S&P 2001, IEEE, 2000, pp. 38–49.
 [6] J. Z. Kolter, M. A. Maloof, Learning to detect and classify malicious executables in the
     wild., Journal of Machine Learning Research 7 (2006).
 [7] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, C. K. Nicholas, Malware detection
     by eating a whole exe, in: Workshops at the Thirty-Second AAAI Conference on Artificial
     Intelligence, 2018.
 [8] H.-D. Pham, T. D. Le, T. N. Vu, Static pe malware detection using gradient boosting decision
     trees algorithm, in: International Conference on Future Data and Security Engineering,
     Springer, 2018, pp. 228–236.
 [9] C. Fawcett, H. H. Hoos, Analysing differences between algorithm configurations through
     ablation, Journal of Heuristics 22 (2016) 431–458.
[10] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization., Journal of
     machine learning research 13 (2012).
[11] F. Hutter, H. Hoos, K. Leyton-Brown, An efficient approach for assessing hyperparameter
     importance, in: International conference on machine learning, PMLR, 2014, pp. 754–762.
[12] J. N. Van Rijn, F. Hutter, Hyperparameter importance across datasets, in: Proceedings of
     the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
     2018, pp. 2367–2376.
[13] A. Biedenkapp, M. Lindauer, K. Eggensperger, F. Hutter, C. Fawcett, H. Hoos, Efficient
     parameter importance analysis via ablation with surrogates, in: Proceedings of the AAAI
     Conference on Artificial Intelligence, volume 31, 2017.
[14] K. Eggensperger, M. Lindauer, H. H. Hoos, F. Hutter, K. Leyton-Brown, Efficient bench-
     marking of algorithm configurators via model-based surrogates, Machine Learning 107
     (2018) 15–41.
[15] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimiza-
     tion, in: 25th annual conference on neural information processing systems (NIPS 2011),
     volume 24, Neural Information Processing Systems Foundation, 2011.
[16] P. Probst, A.-L. Boulesteix, B. Bischl, Tunability: Importance of hyperparameters of
     machine learning algorithms., J. Mach. Learn. Res. 20 (2019) 1–32.
[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of
     the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (2015) 148–175.
[18] M. Feurer, F. Hutter, Hyperparameter optimization, in: Automated Machine Learning,
     Springer, Cham, 2019, pp. 3–33.
[19] B. Bischl, O. Mersmann, H. Trautmann, C. Weihs, Resampling methods for meta-model val-
     idation with recommendations for evolutionary computation, Evolutionary computation
     20 (2012) 249–275.
[20] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection
     and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th
     ACM SIGKDD international conference on Knowledge discovery and data mining, 2013,
     pp. 847–855.
[21] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, et al., Fast bayesian hyperparameter
     optimization on large datasets, Electronic Journal of Statistics 11 (2017) 4945–4968.
[22] O. Maron, A. W. Moore, The racing algorithm: Model selection for lazy learners, Artificial
     Intelligence Review 11 (1997) 193–225.
[23] R. Bellman, Dynamic programming princeton university press princeton, New Jersey
     Google Scholar (1957).
[24] F. Hutter, H. Hoos, K. Leyton-Brown, An evaluation of sequential model-based optimization
     for expensive blackbox functions, in: Proceedings of the 15th annual conference companion
     on Genetic and evolutionary computation, 2013, pp. 1209–1216.
[25] J. Bergstra, D. Yamins, D. Cox, Making a science of model search: Hyperparameter opti-
     mization in hundreds of dimensions for vision architectures, in: International conference
     on machine learning, PMLR, 2013, pp. 115–123.
[26] S. Falkner, A. Klein, F. Hutter, Bohb: Robust and efficient hyperparameter optimization at
     scale, in: International Conference on Machine Learning, PMLR, 2018, pp. 1437–1446.
[27] E. Brochu, V. M. Cora, N. De Freitas, A tutorial on bayesian optimization of expensive
     cost functions, with application to active user modeling and hierarchical reinforcement
     learning, arXiv preprint arXiv:1012.2599 (2010).
[28] I. Dewancker, M. McCourt, S. Clark, Bayesian optimization primer, 2015. URL:
     https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8412e2/static/pdf/
     SigOpt_Bayesian_Optimization_Primer.pdf.
[29] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, F. Hutter, Auto-sklearn:
     efficient and robust automated machine learning, in: Automated Machine Learning,
     Springer, Cham, 2019, pp. 113–134.
[30] R. J. Donald, Efficient global optimization of expensive black-box function, J. Global Optim.
     13 (1998) 455–492.
[31] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, K. Leyton-Brown,
     Towards an empirical foundation for assessing bayesian optimization of hyperparameters,
     in: NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, 2013,
     p. 3.
[32] L. Yang, A. Shami, On hyperparameter optimization of machine learning algorithms:
     Theory and practice, Neurocomputing 415 (2020) 295–316.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
     Journal of machine Learning research 12 (2011) 2825–2830.
[34] H. S. Anderson, P. Roth, elastic/ember, 2021. URL: https://github.com/elastic/ember/blob/
     master/README.md.
[35] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Pretten-
     hofer, A. Gramfort, J. Grobler, et al., Api design for machine learning software: experiences
     from the scikit-learn project, arXiv preprint arXiv:1309.0238 (2013).
[36] Lightgbm documentation, ???? URL: https://lightgbm.readthedocs.io/en/latest.
[37] sklearn.metrics.roc_auc_score, ????            URL: https://scikit-learn.org/stable/modules/
     generated/sklearn.metrics.roc_auc_score.html.
[38] F.     ALGorain,        J.   Clark,       Bayesian      hyper     parameter      optimization
     for     malware        detection,        2021.     URL:     https://github.com/fahadgorain/
     Bayesian-Hyper-Parameter-Optimization-for-Malware-Detection.
[39] M. Amin, T. A. Tanveer, M. Tehseen, M. Khan, F. A. Khan, S. Anwar, Static malware
     detection and attribution in android byte-code through an end-to-end deep system, Future
     Generation Computer Systems 102 (2020) 112–126.
[40] F. Cohen, Computer viruses: theory and experiments, Computers & security 6 (1987)
     22–35.
[41] Willkoehrsen, Automated model tuning, 2018. URL: https://www.kaggle.com/willkoehrsen/
     automated-model-tuning.
[42] S.-H. Zhang, C.-C. Kuo, C.-S. Yang, Static pe malware type classification using machine
     learning techniques, in: 2019 International Conference on Intelligent Computing and its
     Emerging Applications (ICEA), IEEE, 2019, pp. 81–86.
[43] R. G. Mantovani, T. Horváth, R. Cerri, J. Vanschoren, A. C. de Carvalho, Hyper-parameter
     tuning of a decision tree induction algorithm, in: 2016 5th Brazilian Conference on
     Intelligent Systems (BRACIS), IEEE, 2016, pp. 37–42.
[44] F. Hutter, H. H. Hoos, K. Leyton-Brown, Identifying key algorithm parameters and instance
     features using forward selection, in: International Conference on Learning and Intelligent
     Optimization, Springer, 2013, pp. 364–381.
[45] V. Nguyen, S. Gupta, S. Rana, C. Li, S. Venkatesh, Filtering bayesian optimization approach
     in weakly specified search space, Knowledge and Information Systems 60 (2019) 385–413.
[46] B. Shahriari, A. Bouchard-Côté, N. Freitas, Unbounded bayesian optimization via regular-
     ization, in: Artificial intelligence and statistics, PMLR, 2016, pp. 1168–1176.
[47] H. J. Escalante, M. Montes, L. E. Sucar, Particle swarm model selection., Journal of Machine
     Learning Research 10 (2009).
[48] B. Komer, J. Bergstra, C. Eliasmith, Hyperopt-sklearn: automatic hyperparameter con-
     figuration for scikit-learn, in: ICML workshop on AutoML, volume 9, Citeseer, 2014,
     p. 50.
[49] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based optimization for gen-
     eral algorithm configuration, in: International conference on learning and intelligent
     optimization, Springer, 2011, pp. 507–523.
[50] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning
     algorithms, arXiv preprint arXiv:1206.2944 (2012).
[51] G. E. Dahl, T. N. Sainath, G. E. Hinton, Improving deep neural networks for lvcsr using
     rectified linear units and dropout, in: 2013 IEEE international conference on acoustics,
     speech and signal processing, IEEE, 2013, pp. 8609–8613.
[52] G. Melis, C. Dyer, P. Blunsom, On the state of the art of evaluation in neural language
     models, arXiv preprint arXiv:1707.05589 (2017).
[53] J. Singh, J. Singh, A survey on machine learning-based malware detection in executable
     files, Journal of Systems Architecture (2020) 101861.


7. Appendix
Table 5
LightGBM Grid Search Hyperparameter results
          Hyper-parameter                      Grid Search Best Hyper-parameter Settings                Range                Default Value
          boosting_type                                          GBDT                              GBDT,DART,GOSS               GBDT
          num_iteration                                           1000                                500:1000                   100
          learning_rate                                           0.005                               0.005:0.05                  0.1
          num_leaves                                               512                                512:2048                    31
          feature_fraction(subsample)                              1.0                                  0.5:1.0                   1.0
          bagging_fraction(subsample_freq)                         0.5                                  0.5:1.0                   1.0
          objective                                              binary                                 binary                  None




Table 6
LightGBM Random Search Hyperparameter results
          Hyperparameter                          Random Search Best Hyperparameteres                     Range             Default Value
          boosting_type                                          GBDT                                GBDT or GOSS              GBDT
          num_iteration                                            60                                     1:100                  100
          learning_rate                                        0.0122281                                0.005:0.05               0.1
          num_leaves                                               150                                    1:512                   31
          feature_fraction(subsample)                              0.8                                    0.5:1.0                1.0
          bagging_fraction(subsample_freq)                         0.8                                    0.5:1.0                1.0
          objective                                              binary                                binary only             None
          min_child_samples                                        165                                    20:500                  20
          reg_alpha                                             0.102041                                  0.0:1.0                0.0
          reg_lambda                                            0.632653                                  0.0:1.0                0.0
          colsample_bytree                                         1.0                                    0.0:1.0                1.0
          subsample                                             0.69697                                   0.5:1.0                1.0
          is_unbalance                                            True                                True or False             False




Table 7
LightGBM AHBO-TPE Search Hyper-parameter results
          Hyperparameter                         AHBO-TPE Search Hyperparameter Results                    Range             Default Value
          boosting_type                                         GBDT                                  GBDT or GOSS              GBDT
          num_iteration                                            26                                       1:100                 100
          learning_rate                                         0.02469                                  0.005:0.05               0.1
          num_leaves                                              229                                       1:512                 31
          feature_fraction(subsample)                           0.78007                                    0.5:1.0                1.0
          bagging_fraction(subsample_freq)                      0.93541                                    0.5:1.0                1.0
          objective                                             binary                                  binary only              None
          min_child_samples                                       145                                      20:500                 20
          reg_alpha                                             0.98803                                    0.0:1.0                0.0
          reg_lambda                                            0.45169                                    0.0:1.0                0.0
          colsample_bytree                                      0.89595                                    0.0:1.0                1.0
          subsample                                             0.63005                                    0.0:1.0                1.0
          is_unbalance                                           True                                  True or False             False
          n_estimators                                            1227                                     1:2000                 100
          Subsample_for_bin                                     160000                                 2000:200000              200000




Table 8
SGD Model AHBO-TPE Search Hyperparameter Results
                Hyperparameter    AHBO-TPE Search Hyperparameter Results                      Range                     Default Value
                   Penalty                         L2                                   L1,L2, elasticnet                    L1
                     Loss                        Hinge                     hinge, log, modified-huber, squared-hinge,      Hinge
                 Max-iterations                    10                                         10:200                        1000
Table 9
RF Model AHBO-TPE Search Hyperparameter Results
               Hyperparameter      AHBO-TPE Search Hyperparameter Results               Range                Default Value
                n_estimators                        100                                 10:100                    10
                 max_depth                           30                                   2:60                  None
                max_features                       auto                             auto,log2,sqrt               auto
              min_samples_split                      10                                   2:10                    2
              min_samples_leaf                       30                                   1:10                    1
                  criterion                        gini                              gini,entropy                gini




Table 10
LR Model AHBO-TPE Search HyperparameterResults
              Hyperparameter   AHBO-TPE Search Hyperparameter Results                 Range                   Default Value
                 max_iter                       200                                   10:200                       100
                    C                           8.0                                  0.0:20.0                     auto
                  solver                        sag                     ’liblinear’,’lbfgs’, ’sag’, ’saga’        lbfgs




Table 11
KNN Model AHBO-TPE Search Hyperparameter Results
              Hyperparameter       AHBO-TPE Search Hyperparameter Results                  Range         Default Value
               n_neighbors                          15                                      1:31              5




Figure 2: Shows ROC AUC Comparison for AHBO-TPE (Cyan color), Random Search (Yellow color)
and Default Benchmark Model (Red color).