=Paper= {{Paper |id=Vol-3125/paper9 |storemode=property |title=Bayesian Hyper-Parameter Optimization for Malware Detection |pdfUrl=https://ceur-ws.org/Vol-3125/paper6.pdf |volume=Vol-3125 |authors=Fahad T. AlGorain,John A. Clark |dblpUrl=https://dblp.org/rec/conf/sgai/AlGorainC21 }} ==Bayesian Hyper-Parameter Optimization for Malware Detection== https://ceur-ws.org/Vol-3125/paper6.pdf

Bayesian Hyper-Parameter Optimisation for Malware
Detection
Fahad T ALGorain1 , John A Clark1
1
University of Sheffield, Sheffield S10 2TN, UK

Abstract
Malware detection is a major security concern and a great deal of academic and commercial research and
development is directed at it. Machine Learning is a natural technology to address malware detection
and many researchers have investigated its use. However, drawing comparisons between different
techniques is a fraught affair. For example, the performance of ML algorithms often depends significantly
on parametric choices, so the question arises as to what parameter choices are optimal. In this paper,
we investigate the use of a variety of ML algorithms for building malware classifiers and also how best
to tune the parameters of those algorithms – generally known as hyper-parameter optimisation. We
examine the effects of some simple (model-free) ways of parameter tuning together with a state-of-the-art
Bayesian model-building approach. Our work is carried out using EMBER, a major published malware
benchmark dataset on Windows Portable Execution (PE) metadata samples.

Keywords
Hyper-parameter optimisation, Automated Machine Learning, Static Malware Detection, Tree Parzen
Estimators, Bayesian optimisation, Random Search, Grid Search

1. Introduction
Malware is one of the most pressing problems in modern cybersecurity and its detection has
been a longstanding focus for both academic and commercial research and development [1].
Detection must be effective (low false positives and negatives) but also efficient, particularly
in areas such as forensics or threat hunting where vast file storage may need to be scanned
for malware. Furthermore, the malware environment constantly changes and so (re-)training
speed of detectors is also important [2]. Machine Learning (ML) is an obvious avenue to pursue,
with various advantages to harnessing it: an ML approach can significantly reduce manual
effort in developing detectors, giving more rapid deployment; it can play a critical role in the
extraction of insight from malware samples; and ML-based detectors can detect some unseen
malware, e.g. unseen malware that has features that are similar to those of known malware
may be detected because of the loose pattern matching that underpins many ML classification
approaches. A large number of ML techniques have been brought to bear on the malware
detection problem, often attaining good results. However, ML must not be seen as a toolkit
that can simply be thrown at a problem. Many ML techniques are parametrised and the choice
of parameters may make a significant difference to performance. In modern, widely used ML

AI-CyberSec 2021: Workshop on Artificial Intelligence and Cyber Security, December 14, 2021, Cambridge, UK
$ ftalgorain1@sheffield.ac.uk (F. T. ALGorain); john.clark@sheffield.ac.uk (J. A. Clark)
0000-0003-0547-1402 (F. T. ALGorain); 0000-0002-9230-9739 (J. A. Clark)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
toolkits, ML algorithms often have many tens of parameters (and sometimes more). This leads
to the thorny issue of how such parameters may be best set, a problem generally referred to
as hyper-parameter optimisation (HPO). HPO is a significant focus of research in optimisation
and has the potential to improve on the results obtained by a specific detection approach, but
also to enable fair comparison of techniques with techniques that are not the specific focus of
investigation: comparing a new technique with ‘vanilla’ variants of existing techniques is a
recurring motif in many studies. Bergstra et al. (2015) [3] advocate that hyper-parametrisation
should be a ‘formal outer loop’ in the learning process, a view we very much support. Manual
tuning is often simply impossible. (As an aside, we observe that many commercial users of
ML spend a great deal of time tuning for their specific needs.) ML toolkits seek to address this
problem by adopting default values for parameters; these values are set to values that have
been shown to work plausibly well over many problems. However, for any specific problem at
hand it is far from clear that the default values will be the best, or even good, choices. We have
significant domain incentives to gain the best possible results for malware detection. Both
false positive (FP) and false negative (FN) classifications have major consequences. The former
lead to significant wasted effort investigating the (non-) malware together with denial of service
to the application concerned, while the latter means malware goes undetected, with potentially
catastrophic consequences. For malware detection based on ML making high-performing
hyper-parameter choices therefore matters. In this paper we, initially explore a variety
of ML techniques applied to classification of a specific form of malware (Windows PE files).
We aim to demonstrate that hyper-parameter optimisation may have a significant effect on
detection of this type of malware. More generally, we argue that it should play an important
part in ML-based malware detection research research, and in security applications more widely.
The contributions of our paper are:

1. A demonstration of how well various ML-based Windows Portable Executable (PE) file
classifiers perform when trained with default parameters.
2. An evaluation of various hyper-parameter optimisation approaches applied to this prob-
lem, including:
a) Established model-free techniques, e.g. Grid Search and Random Search; and
b) A Bayesian optimisation model-based approach.
3. A demonstration that for our target problem the optimal choices of ML hyper-parameters
may vary considerably from the toolkit defaults.

Windows PE files are an important malware vector, and their detection has been the focus
of significant research. The work uses the EMBER dataset [4] – a recently published dataset
comprising header and derived information from a million examples of PE files. This dataset
contains examples of malware, benign software, and software of unknown malicious status.
The dataset elements are labelled accordingly and so enable supervised learning. EMBER is now
a major resource for ML and the malware community. Our evaluation includes both functional
performance and efficiency (time to train). The last point above should not be underestimated.
In a great many cases results will simply be sub-optimal if tuning is manual and the complexity
of the parameter space is huge (and for many techniques this will be the case), and, as we have
argued above, the costs of sub-optimality may be considerable.
2. Related Literature
2.1. Windows PE and ML Background
Numerous works have been produced regarding ML-based static Portable Executable (PE)
malware detection such as [5][6][7], but work has often been hampered by the absence of
a standard benchmark dataset. The publication of the EMBER dataset [4] has resolved this
problem. A baseline model for supervised ML has been provided by the authors to aid in using
the dataset as a benchmark. Their dataset is accompanied by Python access routines. In [8] the
authors considered imbalanced data set issues and model training duration (by reducing feature
dimensions). They also applied a static detection method using a Gradient Boosting Decision
Tree Algorithm. Their model achieved better performance than the baseline model with less
training time. (They used feature reduction based on the recommendation of the authors in
[4].) Another approach utilized a subset of the EMBER dataset for their work and compared
different ML models [9]. Their work is mainly concerned with scalability and efficiency. Their
goal was to identify malware families. The proposed Random Forest model achieved a slightly
better performance than the baseline model.

2.2. HPO Related literature
Multiple works have asserted the potential of hyper-parameter optimisation. For example
[10] indicated the importance of parameter tuning for increasing accuracy, indicating that
Random Search works better than Grid Search when tuning neural networks. Also, [11] applied
standard tuning techniques to a decision tree on 102 datasets and calculated the accuracy
differences between tuned and traditional models. [12][13] are concerned with the greedy
forward search which seeks to identify the most important hyperparametter to change next.
[14] stressed the importance of single hyperparameters after using sequential model-based
optimisation (SMBO) tuning. ANOVA is used to measure hyperparameter importance. The
authors in [15][16] assessed the performance of hyperparameters across different datasets. [16]
also utilized surrogate models that allow setting the arbitrary hyperparameter configurations
based on a limit on the number of evaluations carried out. Bayesian optimisation-based (BO)
hyper-parameter search was used in [15][17] for speeding up the analysis of the work. The
literature reveals that HPO, and in particular Bayesian optimisation-based (BO) approaches,
have much to offer.
In this paper, we will use Automated Hyper-parameter Optimisation using Tree Parzen
Estimators (AHBO-TPE) which takes advantage of BO to speed up the analysis of the search as
well.
3. Hyper-parameter optimisation (HPO)
3.1. Definition and Motivation
Hyper-parameters are parameters of a model that are not updated during the learning process
[17]. The HPO problem is defined similarly by many researchers as a search to find 𝑥*

𝑥* = arg min 𝑓 (𝑥), (1)
𝑥∈𝑋

where 𝑓 (𝑥) is an objective score to be minimised. Commonly it is an error rate of some form
evaluated on the validation set, e.g. the Root Mean Square Error (RMSE). 𝑥* is the hyper-
parameter vector that gives rise to the lowest score value, and 𝑥 can be any vector of parameters
in the specified domain. HPO seeks the hyper-parameter values that return the lowest score. For
malware and similar classification tasks suitable choices for the objective function are holdout
and cross-validation error. Furthermore, if we consider a loss function for the same problem a
possible choice is a misclassification rate [18]. For our proposed model, the loss function is taken
to be (ROC AUC score with cross-validation – 1). For in-depth background about validation
protocols see [19]. Our work aims also to investigate evaluation time. There are two clear ways
to do this. The first is to use a subset of folds in testing an ML algorithm [20]. The second is to
use a subset of the dataset, especially if we have a big data set [21][22], or use a less iteration.
Although HPO has a great deal to offer, it comes at a price, in particular a computational
price. For every hyper-parameter evaluation, we must train the model, make predictions on
the validation set, and then calculate the validation metrics. Developing a robust ML-based
classifier for Windows PE with a credibly sized and diverse dataset such as EMBER is therefore a
significant undertaking. The computational costs involved act as a disincentive to implementing
Bergstra et al.’s formal outer loop. There is a pressing need for traversing the hyper-parameter
space efficiently and we demonstrate how a leading HPO approach allows us to do so in this
paper.
Here, Windows PE files are a means to an end; the same issues apply to detecting other
malware. Although malware is our major interest, our work also seeks to motivate consideration
of HPO more widely in the application of ML in cybersecurity (since HPO issues apply there
too). For more information about the HPO problem interested readers should refer to [18].

3.2. Model-free Blackbox optimisation methods (Search Methods)
Perhaps the two most common HPO methods are Grid Search and Random Search. These
require only an evaluation function to work, i.e. they are what is commonly referred to as
‘blackbox’. In Grid Search the individual parameters are discretised, i.e. a number of specific
values are selected as ‘covering’ the particular parameter space. For example the elements in
the set {0.0, 0.25, 0.5, 0.75, 1.0} could be taken to cover a continuous parameter in the range
[0.0..1.0]. Grid Search evaluates the function over the cross product of the discretised sets of
hyper-parameters.
Random Search selects values randomly from the domain of each hyper-parameter. Usually,
the values selected from each domain by Random Search are independent, i.e. the value of one
parameter choice does not affect the value selected for another parameter choice. Furthermore,
for an individual parameter all values generally have the same probability of being selected. It
is possible to relax such properties, producing what is often referred to as a biased stochastic
search. Such biases often encode for domain insight which is not in the spirit of a blackbox
approach. In our work, we adopted a standard or ‘vanilla’ Random Search. Grid Search suffers
from the ‘curse of dimensionality’ [23] . As the number of parameters increases or finer grain
discretisation is adopted the computational complexity mushrooms. Furthermore, it is does
not learn from past evaluations; we generally refer to such approaches as being uninformed.
Consequently, it may spend a great deal of time evaluating candidates in regions where previous
evaluation of candidates has given rise to poor objective values. Random Search will search
the specified space until a certain number of evaluations, time, or budget has been reached.
It works better than Grid Search when we know the promising hyper-parameter regions and
so we can constrain the stochastic selection of candidates to lie somewhere in such regions
[10][11]. Combining Random Search with complex strategies allows a minimum convergence
rate and adds exploration that can improve model-based search [18][24]. Random Search is also
an uninformed method and so takes a long time to identify the best performing hyper-parameter
settings. It is not surprising that uninformed methods can be outperformed by methods that use
evaluation history to judge where to try next; indeed, such guided search usually outperforms
uninformed methods [15][25][26].

3.3. Bayesian optimisation (BO)
BO has emerged recently as one of the most promising optimisation methods for expensive
blackbox functions. It has gained a lot of traction in the HPO community with significant results
in several areas such as image classification, speech recognition, neural language modeling,
For an in-depth preview about BO, the reader is referred to [17][27]. BO is an informative
method that takes into consideration past results to find the best hyper-parameters. It utilizes
those previous results to form a probabilistic model that is based on a probability of the score
given a hyper-parameter which is denoted by the formula: 𝑃 (𝑠𝑐𝑜𝑟𝑒|ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟). [28]
refers to the probabilistic model as a surrogate for the objective function denoted by 𝑃 (𝑦|𝑥),
the probability of 𝑦 given 𝑥. The model or surrogate is more straightforward to optimize than
the objective function. BO works to find the next hyper-parameters to be evaluated using the
actual objective function by selecting the best performing hyper-parameters on the surrogate
function. A 5-step processes to do this is given by [28]. The first step builds a surrogate
probability model of the objective function. The second finds the hyper-parameters with best
results on the surrogate. The third applies those values to the real defined objective function.
The fourth updates the surrogate with this new real objective function result. Steps 2–4 are
repeated until the maximum iteration or budgeted time is reached [29]. BO has two primary
components: probabilistic model and an acquisition function to decide the next place to evaluate.
Furthermore, BO trades off exploration and exploitation; instead of assessing the costly blackbox
function, the acquisition function is cheaply computed and optimized. There are many choices
for the acquisition function but here we use the most common – expected improvement (EI)
[30]. The goal of Bayesian reasoning is to become more accurate as more performance data
is acquired. The previous 5-step processes is repeated to keep the surrogate model updated
after each evaluation of the objective function [15]. BO spends a little more time generating
sets of hyper-parameter choices that are likely to provide real improvements whilst keeping
calls to the actual objective function as low as possible. Practically, the time spent on choosing
the next hyper-parameters to evaluate is often trivial compared to the time spent on the (real)
objective function evaluation. BO can find better hyper-parameters than Random Search in
fewer iterations [25]. This is one of the issues we seek to address in our work: whether AHBO-
TPE could find better hyper-parameters than Random Search with fewer iterations specially for
our target domain - Windows PE file malware detection.

3.4. Sequential Model-Based optimisation (SMBO)
There are several options for the SMBO’s evaluation of the surrogate model 𝑃 (𝑦|𝑥) [15]. One
of the choices is to use Expected Improvement (EI) as defined in the equation below:
∫︁ 𝑦*
𝐸𝐼𝑦* (𝑥) = (𝑦 * − 𝑦)𝑃 (𝑦|𝑥)𝑑𝑦 (2)
−∞

Here 𝑦 * is the threshold value of the objective function, 𝑥 is the hyper-parameter, 𝑦 is the
actual value of the objective function using the hyper-parameters 𝑥, and 𝑃 (𝑦|𝑥) is the surrogate
probability model expressing the probability (density) of 𝑦 given 𝑥. The goal is to find the best
hyper-parameters under function 𝑃 (𝑦|𝑥). The threshold value 𝑦 * is the best objective value
obtained so far. We aim to improve on (i.e. get a lower value than) the best value obtained so far.
For such minimisation problems, if a value 𝑦 is greater than the threshold value, then it is not
an improvement. Only values less that the threshold are improvements. For a value 𝑦 less than
the threshold 𝑦 * − 𝑦 is the improvement. By integrating over all such improvements, weighted
by the density function 𝑃 (𝑦|𝑥) gives the overall expected improvement given the vector of
parameter values 𝑥. When better values of 𝑥 are found (i.e. giving rise to actual improvements
in the real objective function) the threshold value 𝑦 * is updated. The above description is an
idealised view of Expected Improvement. In practice the choice of threshold value is more
flexible, i.e. 𝑦 * need not be the best objective value witnessed so far; this is actually the case for
the Tree-Parzen Estimator approach outlined immediately below.

3.5. Tree-Structured Parzen Estimators (TPE)
TPE constructs its model utilizing Bayesian rules. Its model 𝑃 (𝑦|𝑥) is built from two model
components. One component models values less than a threshold and the other models values
greater than that threshold.
{︃
𝑙(𝑥) 𝑖𝑓 𝑦 < 𝑦 *
𝑃 (𝑥|𝑦) = (3)
𝑔(𝑥) 𝑖𝑓 𝑦 >= 𝑦 *

TPE uses 𝑦 * to be some quantile 𝛾 of the observed 𝑦 values, i.e. such that 𝑃 (𝑦 < 𝑦 * ) = 𝛾 [3].
This allows data to be available to construct the indicated densities. 𝑙(𝑥) is a density based
on the set of evaluated values of 𝑥 that have been found to give objective values less than the
threshold. 𝑔(𝑥) is the density based on the remaining evaluated 𝑥 values. Here 𝑃 (𝑥|𝑦) is the
density of hyper-parameter 𝑥 given a objective function score of 𝑦. It is expressed as follows
[15]:
𝑃 (𝑥|𝑦) * 𝑃 (𝑦)
𝑃 (𝑦|𝑥) = (4)
𝑃 (𝑥)
Bergstra et al. (2012) also show that to maximise improvement we should seek parameters 𝑥
with high probability under 𝑙(𝑥) and low probability under 𝑔(𝑥). Thus, they seek to maximise
𝑔(𝑥)/𝑙(𝑥). The best such 𝑥 outcome is then evaluated in the actual objective function and will
be expected to have a better value. The surrogate model estimates the objective function; if the
hyper-parameter that is selected does not make an improvement, the model won’t be updated.
The updates are based upon previous history/trials of the objective function evaluation. As
mentioned before, the previous trials are stored in a pair of (score, hyper-parameters) by the
algorithm after building the lower threshold density 𝑙(𝑥) and higher threshold density 𝑔(𝑥). It
uses the history of these previous trials to improve the objective function with each iteration.
The motivation to use TPE with SMBO to reduce time and find better hyper-parameters came
from other papers [15][25][31][32]. SMBO uses Hyperopt [3], a Python library that implements
BO or SMBO. Hyperopt makes SMBO an interchangeable component that could be applied to
any search problem. Hyperopt supports more algorithms but we are interested in TPE only in
our work. Our contribution lies in the demonstration of the usefulness of SMBO using
TPE for malware classification purposes.

4. Experiments
Here we outline the experiments carried out and provide sample data and execution environment
details. Discussion of results is given in the next section.

4.1. Execution Environment and Dataset
Our machine learning models evaluation used Scikit-learn [33] and Hyperopt [3]. The experi-
ments were carried out on Windows 10 operating system, 8GB RAM, AMD Ryzen 5 3550 H
with Radeon Vega Mobile Gfc 2.10 GHz, 64-bit operating system, x64-based processor. Also, a
MacBook Air(Catalina version 10.15), 1.8 GHz Dual-core Intel i5, 8GB 1600 Mhz DDR3, Intel
HD graphics 6000 1536MB. Version 2018 of the EMBER dataset [34] were used. It contains
1M samples in total. We used 300k benign and 300k malicious samples for training with 100k
benign and 100k malicious samples for testing purposes. The 200k unlabelled examples of the
dataset were not used in our experiments. Furthermore, the results were obtained using Jupyter
Notebook version 6.1.0 and Python version 3.6.0.

4.2. Experiments with Default Settings
Table 1 shows the results when various ML techniques are applied with default parameter
settings. The techniques include well-established approaches (Stochastic Gradient Descent
classifier (SGD), Logistic Regression classifier (LR), Gaussian Naïve Bayes (GNB), K-nearest
Neighbour (KNN), and Random Forest (RF) [33], [35]) and a state-of-the-art approach – Light-
GBM [36]. This technique has over a hundred parameters and so introduces major challenges
for hyper-parametrisation. Some of its categorical parameters (e.g. boosting type) give rise
to conditional parameters. For initial experiments we adopted the default parameter settings
adopted by the Scikit-Learn toolkit for all techniques other than LightGBM (which has its own
defaults). The evaluation metric is Area Under the Receiver Operating Characteristic
Curve (ROC AUC) [37]. ROC AUC plays an important role in many security classification
tasks, e.g. it also occurs frequently as an evaluation metric in intrusion detection research. Grid
Search results were obtained using the MacBook Air, while the rest (AHBO-TPE and Random
Search) were obtained under the Windows 10 operating system for faster performance.

4.3. Hyper-parameter Optimisation
The most promising of the evaluated ML algorithms, taking into account functional perfor-
mance and speed of training, was LightGBM. We choose to further explore hyper-parameter
optimisation on this technique. Since LightGBM has over a 100 parameters, some of which
are continuous, we simply cannot do exhaustive search. Accordingly, we have had to select
parameters as a focus in this work. We focused on what we believe are the most important
parameters. For Grid Search in particular we had to be particularly selective in what we op-
timised. Moreover, for Random Search we specified a budget of 100 iterations. We examine
Grid Search, Random Search, and AHBO with Tree Parzen Estimators as HPO approaches. We
therefore provide comparison between model-free (blackbox) approaches (Grid Search and
Random Search), and AHBO-TPE, an approach that uses evaluation experience to continually
update its model and suggest next values of the hyper-parameters. We applied AHBO-TPE in
two phases, the first one we initially set it to 3 iterations, while the second being allowed 100
more iterations for fair comparison (with Random Search).

4.4. Additional Materials
Implementation details of our experiments can be found on our github repository [38].

5. Results and discussion
We have carried out our experiments on a benchmark dataset that has been assembled, in part,
to include examples that present challenges to ML classification approaches [34]. The results
given in Table 1 show that the major ML approaches vary hugely in their suitability for
the static malware classification tasks. Our benchmark dataset is comprised of summaries
of Windows PE files. It would seem prudent for malware detection researchers to evaluate
multiple ML algorithms for non-Windows malware detection too. The results also show the
clear promise of using a particular state-of-the-art algorithm – LightGBM – for the malware
detection task. Taking both time and performance into account, Table 1 shows that LightGBM is
clearly the best performing approach. The subsequent tables summarise our attempts to apply
HPO approaches to the most promising of the original ML techniques. We can see in Table
2 that HPO can offer significant improvements. Random search performs very well. So
does AHBO-TPE, but the initial optimisation is far more efficient. Figure 1 further illustrates
how Random Search and AHBO-TPE evolve as iteration number increases. The tool default
parameter choices cannot be relied upon to produce the best or even good results. Tables 5, 6
and 7 illustrate the difficulty of manually tuning parameters for this task. In some cases the
defaults and the best found values are at the opposite ends of the parameter ranges,
e.g. the bagging fraction in Table 5. Many are significantly different to the default value, e.g.
𝑛𝑢𝑚_𝑙𝑒𝑎𝑣𝑒𝑠 in Tables 6 and 7 and 𝑛_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠 of Table 7. Some binary choices are reversed,
e.g. 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 and 𝑖𝑠_𝑢𝑛𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑 of Table 6. Further results are given in the appendix were
we use AHBO-TPE with the selected ML techniques. The results are shown with 10 iterations
(a constraint imposed for reasons of computational practicality) and 3 fold cross validation.
Moreover, the hyper-parameters that are promising in each ML model are added in Tables 7,
8, 9, 10 and 11. Also, in Figure 2 a comparison is given between our search methods and the
benchmark model.

Table 1
Comparison Between Default ML Models Parameters Scores
ML Model Time to train Score (AUC-ROC)
GNV 11 min 56 s 0.406
SGD 19 min 11s 0.563
LightGBM Benchmark 26 min 0.922
RF 57 min and 52s 0.90
LR 1 hrs and 44 min 0.598
KNN 3 hrs 14 min and 59s 0.745

Table 2
Search Methods Score Comparison
Search Methods Best ROC AUC score Number of Iteration Time To Complete Search
BenchMark LightGBM Model non optimized 0.922 100 26 mins
Grid Search optimisation 0.944 965 Almost 3 months
Random Search optimisation 0.955 60 15 days, 13 hrs and 12 mins
AHBO-TPE initial Optimisation 0.955 3 4 hrs
AHBO-TPE with extra 100 iteration optimisation 0.957 26 27 days, 23 hrs and 11 mins

Table 3
Selected ML Models with AHBO-TPE Score Comparison
ML Model Score (AUC-ROC) Score (AUC-ROC) After optimisation
GNV 0.406 same
SGD 0.563 0.597
LightGBM Benchmark 0.922 0.957
RF 0.901 0.936
LR 0.598 0.618
KNN 0.745 0.774
Figure 1: Highest Validation Score (the star is the best score achieved through iterations) AHBO-TPE (yellow)
with Random Search Hyperparameter (blue) optimisation.

Table 4
Selected ML Models with AHBO-TPE Performance Results
ML Model Time to train Training Time Reduction by
GNV 11 min 56s same
SGD 4 min 35s 14 min 35s
LightGBM Benchmark 18 min 30s 8mins
RF 31 min 14s 26 min
LR 1 hr 5min 37s 38 min
KNN 4hrs 37 min and 30s increased by 1hr 23min 29s

6. Limitations and Conclusions
6.1. Limitations
A significant limitation is that the work uses a specific dataset (EMBER) concerned with Windows
PE files. This limits rigorously founded generalisation of results. Evaluation using further
malware datasets would build further insight into applicability beyond the direct case study
malware type reported here. Also we have used a single (albeit highly effective) ‘informed’
hyper-parametrisation approach. The use of other informed hyper-parametrisation approaches
would provide further insight and possible improvements. For practical purposes we informally
identified plausible parameters that should be subject to variation and allowed the remaining
ones to be set at the defaults. It is possible that improvements in results could be obtained by
allowing variation in those parameters fixed at their default values.
6.2. Conclusions
The results overall would suggest that researchers in malware and ML are missing a significant
opportunity to improve results attained by specific techniques of interest. The importance of
sound optimisation in this domain is considerable: every improvement matters to the security
of the system and there are major cost implications for gaining improvements. As, we argued
earlier, sub-optimality may have major costs. Researchers who apply particular techniques to
malware detection are interested in gaining maximal performance and showing their technique
(accurately) at its best. We need a fair comparison of techniques at their best. Fair comparison
is also crucial to the development of the field. HPO has the ability to place ML-based malware
detection on a sound empirical footing. We recommend hyper-parameter optimisation for
malware detection and the ML community. We also recommend Bayesian optimisation as a
particularly promising approach with informative approaches generally being an important
avenue for future research. For future work it seems investigating the applicability of our model
in dynamic malware detection settings is worthwhile. The major available dataset (EMBER)
contains unknown malware samples. Accordingly, applying HBO in a semi-supervised regime
seems a plausible avenue to pursue.
We have shown how Bayesian Optimisation can be harnessed to advantage. We have applied
a leading state-of-the-art approach to show how malware detection can be enhanced. We should
be alert to major developments in the HPO field to ensure we deliver the very best hopes for
malware detection stakeholders. In summary: the parametrisation of ML approaches exerts
significant effect on the results obtained; Hyper-parameter Optimisation is an important device
for obtaining best results for any specific technique; it is arguably an essential component for
the rigorous comparison of techniques, which is a fundamental criterion for the development
of the topic. It would be very useful for the ML-based malware detection community to agree
comparison protocols. The above are our conclusions for ML applied to malware detection
(here malicious Windows PE files). We propose that HPO be an essential element of the ML
process for malware detection applications.

References
[1] A. K. Pandey, A. K. Tripathi, G. Kapil, V. Singh, M. W. Khan, A. Agrawal, R. Kumar, R. A.
Khan, Trends in malware attacks: Identification and mitigation strategies, in: Critical
Concepts, Standards, and Techniques in Cyber Forensics, IGI Global, 2020, pp. 47–60.
[2] A. Al-Sabaawi, K. Al-Dulaimi, E. Foo, M. Alazab, Addressing malware attacks on connected
and autonomous vehicles: Recent techniques and challenges, in: Malware Analysis Using
Artificial Intelligence and Deep Learning, Springer, 2021, pp. 97–119.
[3] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, Hyperopt: a python library for
model selection and hyperparameter optimization, Computational Science & Discovery 8
(2015) 014008.
[4] H. S. Anderson, P. Roth, Ember: an open dataset for training static pe malware machine
learning models, arXiv preprint arXiv:1804.04637 (2018).
[5] M. G. Schultz, E. Eskin, F. Zadok, S. J. Stolfo, Data mining methods for detection of new
malicious executables, in: Proceedings 2001 IEEE Symposium on Security and Privacy.
S&P 2001, IEEE, 2000, pp. 38–49.
[6] J. Z. Kolter, M. A. Maloof, Learning to detect and classify malicious executables in the
wild., Journal of Machine Learning Research 7 (2006).
[7] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, C. K. Nicholas, Malware detection
by eating a whole exe, in: Workshops at the Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
[8] H.-D. Pham, T. D. Le, T. N. Vu, Static pe malware detection using gradient boosting decision
trees algorithm, in: International Conference on Future Data and Security Engineering,
Springer, 2018, pp. 228–236.
[9] C. Fawcett, H. H. Hoos, Analysing differences between algorithm configurations through
ablation, Journal of Heuristics 22 (2016) 431–458.
[10] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization., Journal of
machine learning research 13 (2012).
[11] F. Hutter, H. Hoos, K. Leyton-Brown, An efficient approach for assessing hyperparameter
importance, in: International conference on machine learning, PMLR, 2014, pp. 754–762.
[12] J. N. Van Rijn, F. Hutter, Hyperparameter importance across datasets, in: Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
2018, pp. 2367–2376.
[13] A. Biedenkapp, M. Lindauer, K. Eggensperger, F. Hutter, C. Fawcett, H. Hoos, Efficient
parameter importance analysis via ablation with surrogates, in: Proceedings of the AAAI
Conference on Artificial Intelligence, volume 31, 2017.
[14] K. Eggensperger, M. Lindauer, H. H. Hoos, F. Hutter, K. Leyton-Brown, Efficient bench-
marking of algorithm configurators via model-based surrogates, Machine Learning 107
(2018) 15–41.
[15] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimiza-
tion, in: 25th annual conference on neural information processing systems (NIPS 2011),
volume 24, Neural Information Processing Systems Foundation, 2011.
[16] P. Probst, A.-L. Boulesteix, B. Bischl, Tunability: Importance of hyperparameters of
machine learning algorithms., J. Mach. Learn. Res. 20 (2019) 1–32.
[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of
the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (2015) 148–175.
[18] M. Feurer, F. Hutter, Hyperparameter optimization, in: Automated Machine Learning,
Springer, Cham, 2019, pp. 3–33.
[19] B. Bischl, O. Mersmann, H. Trautmann, C. Weihs, Resampling methods for meta-model val-
idation with recommendations for evolutionary computation, Evolutionary computation
20 (2012) 249–275.
[20] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection
and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and data mining, 2013,
pp. 847–855.
[21] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, et al., Fast bayesian hyperparameter
optimization on large datasets, Electronic Journal of Statistics 11 (2017) 4945–4968.
[22] O. Maron, A. W. Moore, The racing algorithm: Model selection for lazy learners, Artificial
Intelligence Review 11 (1997) 193–225.
[23] R. Bellman, Dynamic programming princeton university press princeton, New Jersey
Google Scholar (1957).
[24] F. Hutter, H. Hoos, K. Leyton-Brown, An evaluation of sequential model-based optimization
for expensive blackbox functions, in: Proceedings of the 15th annual conference companion
on Genetic and evolutionary computation, 2013, pp. 1209–1216.
[25] J. Bergstra, D. Yamins, D. Cox, Making a science of model search: Hyperparameter opti-
mization in hundreds of dimensions for vision architectures, in: International conference
on machine learning, PMLR, 2013, pp. 115–123.
[26] S. Falkner, A. Klein, F. Hutter, Bohb: Robust and efficient hyperparameter optimization at
scale, in: International Conference on Machine Learning, PMLR, 2018, pp. 1437–1446.
[27] E. Brochu, V. M. Cora, N. De Freitas, A tutorial on bayesian optimization of expensive
cost functions, with application to active user modeling and hierarchical reinforcement
learning, arXiv preprint arXiv:1012.2599 (2010).
[28] I. Dewancker, M. McCourt, S. Clark, Bayesian optimization primer, 2015. URL:
https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8412e2/static/pdf/
SigOpt_Bayesian_Optimization_Primer.pdf.
[29] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, F. Hutter, Auto-sklearn:
efficient and robust automated machine learning, in: Automated Machine Learning,
Springer, Cham, 2019, pp. 113–134.
[30] R. J. Donald, Efficient global optimization of expensive black-box function, J. Global Optim.
13 (1998) 455–492.
[31] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, K. Leyton-Brown,
Towards an empirical foundation for assessing bayesian optimization of hyperparameters,
in: NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, 2013,
p. 3.
[32] L. Yang, A. Shami, On hyperparameter optimization of machine learning algorithms:
Theory and practice, Neurocomputing 415 (2020) 295–316.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
Journal of machine Learning research 12 (2011) 2825–2830.
[34] H. S. Anderson, P. Roth, elastic/ember, 2021. URL: https://github.com/elastic/ember/blob/
master/README.md.
[35] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Pretten-
hofer, A. Gramfort, J. Grobler, et al., Api design for machine learning software: experiences
from the scikit-learn project, arXiv preprint arXiv:1309.0238 (2013).
[36] Lightgbm documentation, ???? URL: https://lightgbm.readthedocs.io/en/latest.
[37] sklearn.metrics.roc_auc_score, ???? URL: https://scikit-learn.org/stable/modules/
generated/sklearn.metrics.roc_auc_score.html.
[38] F. ALGorain, J. Clark, Bayesian hyper parameter optimization
for malware detection, 2021. URL: https://github.com/fahadgorain/
Bayesian-Hyper-Parameter-Optimization-for-Malware-Detection.
[39] M. Amin, T. A. Tanveer, M. Tehseen, M. Khan, F. A. Khan, S. Anwar, Static malware
detection and attribution in android byte-code through an end-to-end deep system, Future
Generation Computer Systems 102 (2020) 112–126.
[40] F. Cohen, Computer viruses: theory and experiments, Computers & security 6 (1987)
22–35.
[41] Willkoehrsen, Automated model tuning, 2018. URL: https://www.kaggle.com/willkoehrsen/
automated-model-tuning.
[42] S.-H. Zhang, C.-C. Kuo, C.-S. Yang, Static pe malware type classification using machine
learning techniques, in: 2019 International Conference on Intelligent Computing and its
Emerging Applications (ICEA), IEEE, 2019, pp. 81–86.
[43] R. G. Mantovani, T. Horváth, R. Cerri, J. Vanschoren, A. C. de Carvalho, Hyper-parameter
tuning of a decision tree induction algorithm, in: 2016 5th Brazilian Conference on
Intelligent Systems (BRACIS), IEEE, 2016, pp. 37–42.
[44] F. Hutter, H. H. Hoos, K. Leyton-Brown, Identifying key algorithm parameters and instance
features using forward selection, in: International Conference on Learning and Intelligent
Optimization, Springer, 2013, pp. 364–381.
[45] V. Nguyen, S. Gupta, S. Rana, C. Li, S. Venkatesh, Filtering bayesian optimization approach
in weakly specified search space, Knowledge and Information Systems 60 (2019) 385–413.
[46] B. Shahriari, A. Bouchard-Côté, N. Freitas, Unbounded bayesian optimization via regular-
ization, in: Artificial intelligence and statistics, PMLR, 2016, pp. 1168–1176.
[47] H. J. Escalante, M. Montes, L. E. Sucar, Particle swarm model selection., Journal of Machine
Learning Research 10 (2009).
[48] B. Komer, J. Bergstra, C. Eliasmith, Hyperopt-sklearn: automatic hyperparameter con-
figuration for scikit-learn, in: ICML workshop on AutoML, volume 9, Citeseer, 2014,
p. 50.
[49] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based optimization for gen-
eral algorithm configuration, in: International conference on learning and intelligent
optimization, Springer, 2011, pp. 507–523.
[50] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning
algorithms, arXiv preprint arXiv:1206.2944 (2012).
[51] G. E. Dahl, T. N. Sainath, G. E. Hinton, Improving deep neural networks for lvcsr using
rectified linear units and dropout, in: 2013 IEEE international conference on acoustics,
speech and signal processing, IEEE, 2013, pp. 8609–8613.
[52] G. Melis, C. Dyer, P. Blunsom, On the state of the art of evaluation in neural language
models, arXiv preprint arXiv:1707.05589 (2017).
[53] J. Singh, J. Singh, A survey on machine learning-based malware detection in executable
files, Journal of Systems Architecture (2020) 101861.

7. Appendix
Table 5
LightGBM Grid Search Hyperparameter results
Hyper-parameter Grid Search Best Hyper-parameter Settings Range Default Value
boosting_type GBDT GBDT,DART,GOSS GBDT
num_iteration 1000 500:1000 100
learning_rate 0.005 0.005:0.05 0.1
num_leaves 512 512:2048 31
feature_fraction(subsample) 1.0 0.5:1.0 1.0
bagging_fraction(subsample_freq) 0.5 0.5:1.0 1.0
objective binary binary None

Table 6
LightGBM Random Search Hyperparameter results
Hyperparameter Random Search Best Hyperparameteres Range Default Value
boosting_type GBDT GBDT or GOSS GBDT
num_iteration 60 1:100 100
learning_rate 0.0122281 0.005:0.05 0.1
num_leaves 150 1:512 31
feature_fraction(subsample) 0.8 0.5:1.0 1.0
bagging_fraction(subsample_freq) 0.8 0.5:1.0 1.0
objective binary binary only None
min_child_samples 165 20:500 20
reg_alpha 0.102041 0.0:1.0 0.0
reg_lambda 0.632653 0.0:1.0 0.0
colsample_bytree 1.0 0.0:1.0 1.0
subsample 0.69697 0.5:1.0 1.0
is_unbalance True True or False False

Table 7
LightGBM AHBO-TPE Search Hyper-parameter results
Hyperparameter AHBO-TPE Search Hyperparameter Results Range Default Value
boosting_type GBDT GBDT or GOSS GBDT
num_iteration 26 1:100 100
learning_rate 0.02469 0.005:0.05 0.1
num_leaves 229 1:512 31
feature_fraction(subsample) 0.78007 0.5:1.0 1.0
bagging_fraction(subsample_freq) 0.93541 0.5:1.0 1.0
objective binary binary only None
min_child_samples 145 20:500 20
reg_alpha 0.98803 0.0:1.0 0.0
reg_lambda 0.45169 0.0:1.0 0.0
colsample_bytree 0.89595 0.0:1.0 1.0
subsample 0.63005 0.0:1.0 1.0
is_unbalance True True or False False
n_estimators 1227 1:2000 100
Subsample_for_bin 160000 2000:200000 200000

Table 8
SGD Model AHBO-TPE Search Hyperparameter Results
Hyperparameter AHBO-TPE Search Hyperparameter Results Range Default Value
Penalty L2 L1,L2, elasticnet L1
Loss Hinge hinge, log, modified-huber, squared-hinge, Hinge
Max-iterations 10 10:200 1000
Table 9
RF Model AHBO-TPE Search Hyperparameter Results
Hyperparameter AHBO-TPE Search Hyperparameter Results Range Default Value
n_estimators 100 10:100 10
max_depth 30 2:60 None
max_features auto auto,log2,sqrt auto
min_samples_split 10 2:10 2
min_samples_leaf 30 1:10 1
criterion gini gini,entropy gini

Table 10
LR Model AHBO-TPE Search HyperparameterResults
Hyperparameter AHBO-TPE Search Hyperparameter Results Range Default Value
max_iter 200 10:200 100
C 8.0 0.0:20.0 auto
solver sag ’liblinear’,’lbfgs’, ’sag’, ’saga’ lbfgs

Table 11
KNN Model AHBO-TPE Search Hyperparameter Results
Hyperparameter AHBO-TPE Search Hyperparameter Results Range Default Value
n_neighbors 15 1:31 5

Figure 2: Shows ROC AUC Comparison for AHBO-TPE (Cyan color), Random Search (Yellow color)
and Default Benchmark Model (Red color).