=Paper=
{{Paper
|id=Vol-3920/paper5
|storemode=property
|title=PEVuln: A Benchmark Dataset for Using Machine Learning to Detect Vulnerabilities in PE Malware
|pdfUrl=https://ceur-ws.org/Vol-3920/paper05.pdf
|volume=Vol-3920
|authors=Nathan Ross,Oluwafemi Olukoya,Jesús Martínez del Rincón,Domhnall Carlin
|dblpUrl=https://dblp.org/rec/conf/camlis/RossORC24
}}
==PEVuln: A Benchmark Dataset for Using Machine Learning to Detect Vulnerabilities in PE Malware==
PEVuln: A Benchmark Dataset for Using Machine Learning
to Detect Vulnerabilities in PE Malware
Nathan Ross1,* , Oluwafemi Olukoya1 , Jesús Martínez del Rincón1 and Domhnall Carlin1
1
Centre for Secure Information Technologies (CSIT), Queen’s University Belfast
Abstract
In this paper, we present a benchmark dataset for training and evaluating static PE malware machine learning
models, specifically for detecting known vulnerabilities in malware. Our goal is to enable further research
in defense against malware by exploiting their bugs or weaknesses. After recognising limitations in current
malware datasets regarding exploitable malware, our dataset addresses these gaps by utilizing the malware
vulnerability database Malvuln, and software vulnerability database ExploitDB to create a new malware dataset
with 684 vulnerable malware samples, 35,241 non-vulnerable malware samples, 1,425 vulnerable benign samples,
and 7,905 non-vulnerable benign samples, detailed with timestamps, families, threat mapping, vulnerability
mapping, and obfuscation analysis. This 4-class dataset lays the foundation for advancing future research in
analysis and vulnerability exploitation in malware using machine learning. We also provide baseline results using
state-of-the-art models for malware classification to benchmark the performance of the dataset, where the binary
tasks achieve F1 scores above 0.90, while the multi-class task attains an F1-Score of 0.958.
Keywords
Dataset, Machine Learning, Malware, Vulnerabilities
1. Introduction
Given the rapid evolution of cybersecurity and cyberattacks, relying solely on techniques like signature-
based detection[1, 2, 3] is inadequate in detecting and preventing malware from infecting devices,
systems, and networks. This poses a significant threat to critical systems that contain private information.
When malware does infect a system, whether it be by exploitation of an open port, un-patched software
bug, or user error in terms of a phishing attack, incorrect configuration, etc, the cost can be significant,
both monetarily and reputably, that is assuming a security expert can identify the problem and halt
the spread of malware, close a backdoor, or reverse an encryption attack from ransomware. In this
context, AI-based malware detection approaches pose a powerful and effective approach to defense.
They, however, rely on annotated and curated data repositories for their training, evaluation, and
updates.
Currently, some datasets enable malware detection through the use of machine learning (ML) such
as BODMAS[4], EMBER[5], SOREL-20M[6], and the Microsoft Malware Classification Challenge (BIG
2015)[7]. They allow the user to train a model to identify patterns and behaviors in the features
extracted from the malicious and benign samples. These datasets usually contain annotations on
malicious samples versus benign samples and information on the malware families and timestamps.
However, none of the datasets currently available to train ML approaches include information on the
exploitable vulnerabilities that exist in the malware. The availability of annotated malware samples with
known proof of concept (POC) vulnerabilities will not only create ML-based malware detection models
CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA
*
Corresponding author.
$ nross12@qub.ac.uk (N. Ross); o.olukoya@qub.ac.uk (O. Olukoya); j.martinez-del-rincon@qub.ac.uk (J. M. d. Rincón);
d.carlin@qub.ac.uk (D. Carlin)
https://pure.qub.ac.uk/en/persons/nathan-ross (N. Ross); https://pure.qub.ac.uk/en/persons/oluwafemi-olukoya
(O. Olukoya); https://pure.qub.ac.uk/en/persons/jesus-martinez-del-rincon (J. M. d. Rincón);
https://pure.qub.ac.uk/en/persons/domhnall-carlin (D. Carlin)
0009-0001-7324-4837 (N. Ross); 0000-0003-2771-2553 (O. Olukoya); 0000-0002-9574-4138 (J. M. d. Rincón);
0000-0002-8424-2757 (D. Carlin)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
but will support the identification of vulnerabilities within them. As an initial step, this approach will
not only detect but also stop the spread of malware once it is in the system.
This paper addresses an existing shortfall in malware vulnerability datasets by creating the first
comprehensive dataset for static analysis of Windows portable executable (PE) malware vulnerabilities
for detection and analysis. Our dataset contains 45,255 samples comprised of malware and benign
samples that are labeled vulnerable and non-vulnerable to allow for the development of ML models for
detecting malware, vulnerabilities, or both. The creation of this dataset aims to underpin innovative
approaches to enhance the security of cyber systems.
The main contributions of this paper are:
• A comprehensive and curated malware vulnerability dataset, which is publicly available1 .
• Extensive annotations on the data including timestamps, families, threat details, vulnerabil-
ity details, CWEs, packer types and complexities, vulnerability payloads, and Mitre ATT&CK
framework mapping.
• A detailed feature set ready to be used that utilizes the well-spread Ember feature extraction2 ,
which provides greater flexibility to researchers and developers.
• Data visualization of our dataset was performed using dimensionality reduction techniques to
depict the data distribution in our dataset.
• Timestamp collection, obfuscation and vulnerability analysis, and CWE and ATT&CK mappings
between benign and malware samples, were performed for accurate annotation and richer analysis
of the dataset and potential defenses.
• A comprehensive ML benchmark using our dataset for binary (malware vs. benign, vulnerable vs.
non-vulnerable) and multi-class classification tasks.
2. Background
In this section, we describe approaches for identifying vulnerabilities within malware in Section 2.1,
the application of machine learning techniques for malware classification and vulnerability detection in
Section 2.2, and finally, a comparative analysis of relevant datasets used in malware classification and
detection, which emphasizes the contribution of our dataset in Section 2.3.
2.1. Malware Vulnerabilities
Research into vulnerabilities within malware is a promising yet under-explored domain. The idea that
malware can harbour exploitable flaws makes for an imperative defensive mechanism [8]. In 2010,
Caballero et al.[9] introduced a novel approach, termed stitched dynamic symbolic execution, aimed at
uncovering vulnerabilities within the malware. They aimed to facilitate the discovery of exploitable
bugs as a significant step in automated malware analysis. Caballero et al. identified six distinct bugs
in four common malware families, highlighting their persistence across multiple evolutions. These
weaknesses were exploitable to disrupt or at least hinder their operations defensively. These findings
were corroborated in [10], where the authors highlight that malware is prone to exploitable bugs, just
like benign software. The authors suggest that most malware does not go through quality control
processes, meaning not only are they vulnerable to common issues during the software development
lifecycle, but exposed to them in a greater manner. Defensive exploitation of such vulnerabilities can
allow defenders to delay, mitigate or frustrate malware-based attacks.
1
https://github.com/nross12/PEVuln/
2
https://github.com/elastic/ember
Given that these vulnerabilities exist, they can be identified similarly to how software is labelled in
lists such as MITRE Common Weakness Enumeration (CWE)3 . The studies in [10, 11] also demonstrate
that no proper quality assurance is carried out when developing malware since it contains multiple
bugs persistent in families across a timeline. This exemplifies how inherited flaws within malware can
be exploited defensively. The research in [11] also highlights a bug that crashes Command & Control
(C2) servers in the Mirai code, and how the vulnerability persists in many variants. Similar case studies
have shown that a DLL-hijack vulnerability was found and exploited in the LockBit ransomware [12],
which worked against nearly every other ransomware family. This vulnerability has been likened to
a Pandora’s box of vulnerabilities [13, 14, 15]. Apart from the lack of quality control, code reuse is
likely another reason for the persistence of malware vulnerabilities. A study on malware evolution
and code reuse over four decades[16] found many instances of code reuse. Findings from research
conducted by Intezer and McAfee, which involved analyzing malware samples and cyber campaigns,
revealed substantial evidence of code reuse spanning 10 years (2007-2017) [17]. This investigation
helped uncover previously unknown connections among North Korea’s malware families, indicating
that the reuse of malware code is widespread in cybercrime. The lack of quality control, the rise of
malware variants, code reuse, and malware-as-a-service ensure that malware vulnerabilities remain.
One of the well-known cases of using exploitable vulnerabilities in malware for offensive security
in detection technology is WannaCry, the biggest ransomware attack in history, which spread within
days to more than 250,000 systems in 150 countries and was stopped by registering a web domain
found in the malware’s code [18]. Once the ransomware checked the URL and found it was active, it
was shut down – buying precious time and giving organizations room to update their systems. Such
vulnerabilities can often persist in malware and its variants for a long time across different target
platforms [11, 19]. Similar studies have identified and exploited flaws in ransomware encryption
techniques, assisting victims of ransomware and defending against the threats [20, 21, 22]. Other studies
have investigated the identification and exploitation of malware vulnerabilities for covert monitoring
of C&C servers through protocol infiltration for botnet disruption and breakdown [23, 19, 24, 25]. To
develop an automated system that rapidly uncovers exploitable flaws and defects in malicious software
as a kill-switch approach, PE malware machine learning models for vulnerability identification and
detection could be designed and utilized, under the assumption that benchmark annotated datasets are
available for training.
2.2. Machine Learning
ML has become a cornerstone for malware detection and classification with various techniques being
applied by researchers to identify patterns and behaviors that can discern malware from typical benign
software [26, 27, 28]. ML models can be trained on features derived from three primary analytical
methods: static, dynamic and hybrid analysis [29]. Static analysis involves examining an executable
without explicitly executing it, making it one of the safest and most efficient methods to obtain
information [30]. Dynamic Analysis, on the other hand, allows you to gather features based on
the output observed by the behavior of the executable during runtime on the system [31]. This method
is more intensive on the system and can take much longer to generate data, especially when working
with a large dataset. Hybrid analysis is a time-consuming and resource-intensive method that combines
the strength of static and dynamic techniques to provide a comprehensive understanding of malware
samples[32]. Additionally, there are memory analysis techniques that create a memory image of the
malware during dynamic execution for analysis [19, 33, 34]. This paper proposes specifically the
extraction of features by static analysis of PE files.
2.3. Datasets
Machine learning models for malware detection rely heavily on datasets that are curated, annotated and
comprehensive. Many benchmark datasets are instrumental for malware detection, which can be seen
3
https://cwe.mitre.org/
in Table 1, but they lack vulnerabilities and metadata within these datasets on vulnerable executables.
Additionally, a comparison of the features, vulnerable samples and metadata proposed in our dataset
and other open datasets for PE malware is presented in Table 1.
Table 1
Analysis of the features in each dataset compared to the dataset we present in this paper.
BODMAS EMBER SOREL-20M Microsoft Our
Features
[4] [5] [6] [7] Dataset
# Samples 134,435 2,100,000 19,724,997 10,868 45,255
Threat Details #
Family #
Vulnerable
# # # #
Samples
Vulnerability
# # # #
Details
Vulnerability
# # # #
Payloads
CWE # # # #
Packer
# # # #
(Type/Complexity)
ATT&CK
# # # #
Techniques
Note: #: Absent : Present
3. Dataset Description
We created this dataset by systematically collecting data from several malware repositories, data sources,
and techniques which are categorized into four main classes as outlined below:
• Vulnerable Malware (VM): The samples and metadata for vulnerable malware were collected
from four main sources - Malvuln4 , VirusTotal5 , VulDB6 and VirusShare7 . Malvuln is a resource
dedicated to malware security vulnerability research and provides information on identifying
and exploiting malware, including details such as the threat, vulnerability, description, family,
hash, exploit POC, etc. The metadata of each vulnerable malware sample is collected, and the
binary executables are downloaded from VirusTotal and VirusShare using the MD5 of the sample
and VulDB using the MVID allocated by Malvuln.
• Non-vulnerable Malware(NVM): An abundance of malware was collected, including samples
caught by honeypots and from large data dumps on VirusShare. Similarly to the vulnerable
malware data, the non-vulnerable malware data contains additional information extracted about
each sample from VirusTotal and VirusShare.
• Vulnerable Benign Software(VB): We leveraged Exploit-DB8 , which is an exploit database similar
to Malvuln, except that it holds labelled conventional software vulnerabilities with options for
downloading the vulnerable application, metadata, and exploitation payload.
4
https://malvuln.com/
5
https://www.virustotal.com/
6
https://vuldb.com/
7
https://virusshare.com/
8
https://www.exploit-db.com/
• Non-vulnerable Benign Software (NVB): To create a dataset of benign applications, we extracted
executables from a fresh installation of Windows 10 (located in C:\Windows\System32). This
is a popular approach in the malware community for creating a benign dataset [30]. Further
vulnerability scans were carried out to ensure that the executables were not vulnerable.
The data sources and types of information collected are summarized in Table 2. Creating an augmentation
of an existing dataset or using a combination of multiple sources to create a dataset is consistent with
the literature. For instance, an enhanced version of the EMBER dataset, EMBERSim, was developed
in [35] to include similarity information, addressing the problem of binary code similarity search in
Windows PE files. The expanded tags were created using a combination of automatic tagging tools
such as AVClass[36, 37] to include class, family, behavior, and file properties. In this research, we
utilized AVClass for tagging our data which can provide additional context for addressing the issue of
vulnerabilities in PE malware by highlighting the prevalence of specific vulnerabilities across different
threat types or families. Table 1 expands on these classes with details on the number of samples in the
classes and the number of unique families, vulnerabilities, and CWEs. The vulnerable malware subset
is naturally the minority class given how little analysis exists of vulnerabilities in malware samples.
This results in a highly imbalanced dataset, and may preclude specific training strategies as we will
describe in Section 4.
This dataset will be regularly updated as new vulnerable PE malware and benign samples are collected
from the repository.
Table 2
Data Sources and Types of Information Collected.
PE MD5/ Threat Vulnerability Mitre
Source Family CWE
File SHA256 Type Details ATT&CK
Malvuln # # #
ExploitDB # # #
VirusTotal # # # #
VirusShare # # # # #
VulDB #
Honeypots # # # # # #
Table 3
Table showing the Dataset Distribution
Class # Samples # Families # Vulnerabilities # CWE
VM 864 128 112 35
NVM 35,241 520 N/A N/A
VB 1,425 N/A 627* 36
NVB 7,905 N/A N/A N/A
*
Due to the specificity of some vulnerabilities in the Vulnerable Benign class that has been extracted from ExploitDB, the
number of unique vulnerability types is fairly high.
3.1. Framework Overview
We have developed a systematic approach, depicted in Figure 1, for the creation of this malware
vulnerability dataset, from meticulously seeking out raw data that has gone through various stages
of processing and cleaning, followed by ML model training catering to both binary and multi-class
classification tasks to obtain preliminary results and data visualization. Firstly, both vulnerable and non-
vulnerable malware and benign software executables, with their corresponding metadata, were sourced
and subsequently pre-processed, ensuring that clean and well-structured outputs capture the attributes
and features necessary to perform a deeper analysis of the performance of the dataset. In the executable
pre-processing phase, feature extraction is performed using Ember[5] to derive meaningful features
from the executables for each class. Obfuscation scanning and static application security testing (SAST)
were performed to obtain additional information about the executables. Simultaneously, metadata
pre-processing focuses on the supplementary data associated with the executables. This involves
mapping hashes such as md5 and sha256 where appropriate so that the samples can be identified and
extracting timestamps, threat types, families, and various vulnerability details (such as the vulnerability
type and the payload to exploit it). Additionally, the Mitre ATT&CK framework and common weakness
enumeration (CWE) were mapped, and obfuscation details (such as the types of packer used) were
analyzed and categorized based on complexity. To generate an ML benchmark for the community, as
well as to prove useful models can be generated from our dataset, a comprehensive set of common
machine learning algorithms were trained, including LightGBM (LGBM), Random Forest (RF), k-Nearest
Neighbors (KNN), Support Vector Machines (SVM), and Artificial Neural Networks (ANN). These models
are trained to perform both binary and multi-class classification tasks to distinguish between benign
and malware data and further categorize if it is vulnerable or not. Full details on the ML baseline
models are given in Section 4. Finally, we also visualize our dataset and its analysis in Section 3.7. This
presents detailed visualizations of the full dataset by applying dimensionality reduction techniques,
namely Principal Component Analysis (PCA) and t-SNE. These techniques allow us to produce 2D
representations of our large dimensional data. Data samples are depicted using different labels and
colors according to the previously mentioned metadata including vulnerabilities, obfuscation details,
CWEs, etc. Furthermore, we have visualized the Mitre ATT&CK mappings to better understand the
attack surfaces in which the vulnerable malware targets develop relationships between these attack
surfaces and the vulnerabilities present.
Dataset Collection and Pre-processing
Preliminary Model Benchmarks
PE pre-processing LGBM
Random Forest
Train/Test
Feature Extraction KNN
models
Obfuscation Scanning SVM
Static Application Security Testing ANN
Data Sources
Malvuln Binary Classification
ExploitDB Metadata pre-processing VM vs VB
VirusTotal Dataset Visualisation
VM vs NVM
VirusShare Timestamps VM vs NVB
VulDB VM & VB
Hashes (md5 and sha256) VB vs NVB
Honeypots Vulnerability Types
Threat Type
CWEs
Family Multi-class Classification
Obfuscation Details
Vulnerability Type (All Classes)
Vulnerability Details
VM:
Vulnerability Payload
ATT&CK
CWEs
Mitre ATT&CK
All: VM, VB, NVM, NVB
Obfuscation Details (Type/Complexity)
Obfuscation Details
Figure 1: Framework showing the steps to produce this dataset and how preliminary benchmark results were
generated with common classifier implementations.
3.2. SAST Tools Scanning
When working with ML, the data needs to be curated and precisely annotated to avoid outliers, training
discrepancies and even potential poisoning and/or backdoor attacks. In our case, this is especially
relevant regarding the vulnerabilities that could theoretically exist in the non-vulnerable data as it
had not gone through rigorous testing in the compilation of this dataset. So for that, Source Code
Analysis Tools (or Static Application Security Testing, SAST) were used on the non-vulnerable class
executables to find potential security flaws through the means of signature-based pattern matching,
semantic analysis, and taint analysis. This allows for the removal of data that can be ruled out as
vulnerable from the non-vulnerable classes which, in theory, should make the vulnerable classes more
prominent when training models. To perform this SAST on the data, two tools were used; cve-bin-tool9
and Vulnscan10 .
3.3. Packing in Vulnerable & Non-vulnerable Malware
As we are working with executables, particularly user software and malware, they can be affected by
evasion techniques, more specifically in our case obfuscation. To check for this, we used two tools:
Bintropy[38], a Python tool that detects obfuscation based on entropy, and Detect It Easy (DIE)11 which
works similarly to Bintropy 12 but also provides information on the packer, linker, and compiler. It
is possible to categorize the different types of packers based on their complexities which can be seen
in [39]. The types of packers range from Type-I to Type-VI, with higher values indicative of a greater
complexity. This method of categorization provides additional information which allows us to conduct
a more nuanced analysis of the different packers used in malware compared to benign software. This is
critical as it allows an investigation into whether malicious actors lean towards using more complex
packers with specialized techniques over others for evasion. These patterns could also inform us of a
pattern or correlation between the use of specific packers in malware with vulnerabilities compared to
those with no vulnerabilities.
Table 4
Different Packers used on Malware samples
Packer Type Packer # V-Malware # NV-Malware
Type-I UPX 107 2570
Type-III ASPack, ASProtect, 43 2504
FSG, NSPack,
PE Compact, Upack
Unknown ezip, MEW, MoleBox, 13 2363
MPRESS, NeoLite, Petite,
PKLITE, RLPack, DXPack,
kkrunchy, PyInstaller,
Packman, NakedPacker,
SpoonStudio, BeRo, KByS,
ASPPack, nPack, JDPack,
.NETZ
# Unique Packers 14 25
# Packed 398 20998
As shown in Table 4 and Table 5, the most often used packers in the vulnerable and control classes of
the benign and malicious samples are rather simple, spanning from Type-I to Type-III based on the
packer taxonomy proposed by Ugarte-Pedrero et al.[39]. This packer distribution result is consistent
with multiple longitudinal studies [39, 19, 40, 30] investigating the complexity of custom and off-the-
shelf run-time packers in the wild. The implication of this study is two-fold. Firstly, packer complexity
in control (non-vulnerable) and vulnerable malware follows the same pattern, making it possible to
use analysis and detection techniques for packed malware to packed vulnerable malware. Secondly,
there is an overlap between packers used in vulnerable and controlled malware samples, so we can
9
https://github.com/intel/cve-bin-tool
10
https://github.com/zhutoulala/vulnscan
11
https://github.com/horsicq/Detect-It-Easy
12
https://github.com/packing-box/bintropy
Table 5
Different Packers used on Benign samples
Packer Type Packer # V-Benign # NV-Benign
Type-I UPX 127 1
ASPack, NOS Packer,
Type-III 11 0
PECompact
Unknown Petite, PKLITE 4 0
# Unique Packers 6 1
# Packed 2193 3400
develop classifiers that do not associate specific packers with vulnerability. This finding is significant,
as research on the impact of machine-learning-based malware detection on packed samples utilizing
static analysis features has observed that classifiers frequently link particular packers to malicious
activity because there is insufficient overlap between packers utilized in malicious and benign samples
[30]. Additionally, observations of the count of unique packers in the control benign are restricted to
one, which can be attributed to how we collected the data. That being, Windows 10 likely does not
need to implement various packers as their executables and DLLs are distributed and managed with
the Operating System (OS), and their consistent employment of UPX for all packed executables and
DLLs is likely driven by several factors13 ; mainly concerning maintaining consistency and widespread
applicability across various executables and DLL files. We can therefore conclude that packers are not a
feature for determining the nature of an executable as controlled or vulnerable malware.
3.4. CWE Mapping
For richer vulnerability analysis, we have mapped the CWEs for both the vulnerable malware and
vulnerable benign classes in this dataset and conducted strategic similarity analysis between them so
that we can understand the crossover between vulnerabilities present in benign software or malware.
By doing this, we can further analyze the correlation between the vulnerabilities with other attributes
such as system architecture, programming languages used, attack vectors, or exploitation complexities.
Mapping CWEs in vulnerable malware is also crucial for linking vulnerabilities across certain malware
families to demonstrate both the persistence of vulnerabilities in newer malware of the same family and
the possible introduction of newer vulnerabilities. Another advantage of CWE mapping for malware
vulnerabilities is that it enables the creation of automated exploits based on the identified weakness
category[41]. This is similar to the automated exploit generation for vulnerabilities seen in commercial
and open-source programs, such as Stack-based buffer overflow(CWE-121)[42, 43], PHP object injection
vulnerabilities(CWE-502, CWE-915)[44] and XML injection vulnerabilities(CWE-91)[45].
In Table 6, an analysis of comprehensive weakness categories is presented for vulnerable malicious
and benign samples. The analysis reveals that the top 10 most frequent vulnerability categories in the
vulnerable malware dataset have a 20% similarity with the top 10 most frequent vulnerabilities in the
vulnerable benign dataset, with CWE-200 and CWE-120 being the common vulnerability categories.
This suggests that both classes contain vulnerabilities that can lead to information exposure and buffer
overflow. The findings provide insight into the categories of vulnerabilities present in vulnerable
malware and their prevalence. Additionally, the analysis highlights the overlap and differences be-
tween the vulnerabilities in legitimate software applications and malicious binaries, indicating that
vulnerable malicious binaries often exhibit different categories of weaknesses compared to their benign
counterparts.
Approximately 71.69% of vulnerabilities in the malware samples are attributed to five main classes of
weaknesses. These classes include Permission Issues, Hidden Functionality, Buffer Overflows, Improper
13
https://upx.github.io/
Authentication, and Use of Hard-coded Credentials. They have consistently been the primary sources of
vulnerabilities, making them favoured targets for defenders seeking to exploit these security issues.
The Permission Issues category refers to weaknesses related to improper assignment or handling of
permissions. A comprehensive study on covert monitoring of C&C servers [19] found that over-
permissioned protocols are prevalent in the malware operational landscape, with nearly 1 in 3 malware
bots exhibiting this vulnerability, confirming the significance of this weakness category. Another
study on IoT botnets [23] highlighted weak and default passwords in C2 servers, aligning with the
identification of Improper Authentication and Use of Hard-coded Credentials as primary sources of
malware vulnerabilities. It is also expected that CWE-912, Hidden Functionality, is prevalent in our
dataset, as it can take the form of embedded malicious code and is useful for attacks that modify
the control flow of the application, aligning with malicious behavior. Another noteworthy weakness
category in the Top 10 for malware is CWE-426, a weakness category associated with DLL Hijacking,
which is prevalent and exploitable in most ransomware families for preventing file encryption [13, 14, 15].
In benign executables, 76.64% of vulnerabilities are attributed to two main classes of weaknesses. The
primary sources of vulnerabilities are Improper Restriction of Operations within the Bounds of a Memory
Buffer and Improper Input Validation. These weaknesses are commonly targeted by threat actors when
trying to exploit security issues. Our ranking, based on our dataset, is further supported by the 2023 CWE
Top 10 KEV weaknesses14 , which is a catalogue of Known Exploited Vulnerabilities maintained by the
Cybersecurity and Infrastructure Security Agency (CISA)15 for vulnerability management prioritization.
Notably, four of the Top 10 CWEs identified in the vulnerable benign dataset, CWE-787 (#3), CWE-20
(#4) and CWE-22 (#9), are among the top 10 class of weaknesses exploited by threat actors as observed
in the wild. This suggests that the vulnerabilities and prevalence identified in the dataset accurately
reflect real-world observations.
We have further conducted a detailed analysis of the weakness categories present in the vulnerable
samples by aligning the weaknesses with the OWASP Top 1016 and the 2023 CWE Top 2517 , as shown
in Table 7. The identified malware CWEs correspond to 8 of the OWASP Top 10 application security
risks, whereas benign CWEs correspond to 6 categories. Similarly, the malware CWEs align with 9 of
the top 25 most dangerous software weaknesses, while benign CWEs align with 8 categories. The CWE
classifies vulnerabilities in benign and malicious binaries into the same four categories in the OWASP
Top 10. This indicates that vulnerability detection and identification tools developed for these categories
in benign software may apply to malicious binaries due to the overlap. For example, the technique
of symbolic execution, used in TaintScope[46] to find bugs in benign software, was also applied to
identify bugs in prevalent families of bots and other malware in a study by Caballero et al[9]. Moreover,
the CWE from both classes sometimes maps to unique categories of critical risk in the OWASP Top
10 without overlapping. This supports the initial assertion about the differences in vulnerabilities in
both samples. For instance, only the malware samples have vulnerabilities associated with A02:2021 -
Cryptographic Failures. One characteristic of this category is a failure in the encryption mechanism,
which is significant as one common vulnerability exploited in ransomware is encryption failures[21, 22].
When the first ransomware bug bounty operation was launched, Locker Bugs, which covers encryption
errors, was prioritized[12]. This analysis demonstrates that although there is awareness of software
vulnerabilities, it is essential to recognize that malware possesses inherent vulnerabilities that are critical
and impactful. These vulnerabilities can be exploited for offensive security in defense technology.
14
https://cwe.mitre.org/top25/archive/2023/2023_kev_list.html
15
https://www.cisa.gov/known-exploited-vulnerabilities-catalog
16
https://owasp.org/www-project-top-ten/
17
https://cwe.mitre.org/top25/index.html
18
https://nvd.nist.gov/vuln/categories
Table 6
Number of observations per CWE in our vulnerable malware and benign dataset using the NVD CWE
Slice 18
CWE in CWE CWE in CWE
Count Count
Malware Description Benign Description
Improper Restriction of Operations
CWE-275 Permission Issues 133 CWE-119 452
within the Bounds of a Memory Buffer
CWE-912 Hidden Functionality 106 CWE-20 Improper Input Validation 50
Buffer Copy without Checking Size Improper Control of Generation of
CWE-120 40 CWE-94 21
of Input (’Classic Buffer Overflow’) Code (’Code Injection’
CWE-287 Improper Authentication 35 CWE-399 Resource Management Errors 20
Improper Limitation of a Pathname to
CWE-121 Stack-based Buffer Overflow 31 CWE-22 20
a Restricted Directory (’Path Traversal’)
CWE-798 Use of Hard-coded Credentials 27 CWE-189 Numeric Errors 17
CWE-404 Improper Resource Shutdown or Release 18 CWE-787 Out-of-bounds Write 12
Buffer Copy without Checking Size
CWE-259 Use of Hard-coded Password 18 CWE-120 10
of Input (’Classic Buffer Overflow’)
Exposure of Sensitive Information to an
CWE-426 Untrusted Search Path 17 CWE-200 6
Unauthorized Actor
Exposure of Sensitive Information
CWE-200 11 CWE-264 Permissions, Privileges, and Access Controls 5
to an Unauthorized Actor
CWE-122 Heap-based Buffer Overflow 10 CWE-416 Use After Free 5
Cleartext Transmission of Improperly Implemented Security Check
CWE-319 10 CWE-358 5
Sensitive Information for Standard
CWE-428 Unquoted Search Path or Element 9 CWE-352 Cross-Site Request Forgery (CSRF) 4
Improper Restriction of XML
CWE-284 Improper Access Control 8 CWE-611 3
External Entity Reference
Improper Neutralization of Input During
CWE-306 Missing Authentication for Critical Function 8 CWE-79 3
Web Page Generation (’Cross-site Scripting’)
Improper Restriction of Operations
CWE-119 6 CWE-428 Unquoted Search Path or Element 2
within the Bounds of a Memory Buffer
CWE-427 Uncontrolled Search Path Element 4 CWE-36 Absolute Path Traversal 2
Improper Neutralization of Special Elements
CWE-77 3 CWE-824 Access of Uninitialized Pointer 2
used in a Command (’Command Injection’)
DEPRECATED: Pathname Traversal Improper Neutralization of Special Elements
CWE-21 3 CWE-89 2
and Equivalence Errors used in an SQL Command (’SQL Injection’)
Improper Neutralization of Input During
CWE-79 3 CWE-476 NULL Pointer Dereference 2
Web Page Generation (’Cross-site Scripting’)
Use of Externally-Controlled
CWE-476 NULL Pointer Dereference 3 CWE-134 2
Format String
CWE-918 Server-Side Request Forgery (SSRF) 2 CWE-287 Improper Authentication 1
CWE-312 Cleartext Storage of Sensitive Information 2 CWE-918 Server-Side Request Forgery (SSRF) 1
Loop with Unreachable Exit Condition
CWE-300 Channel Accessible by Non-Endpoint 2 CWE-835 1
(’Infinite Loop’)
CWE-352 Cross-Site Request Forgery (CSRF) 2 CWE-763 Release of Invalid Pointer or Reference 1
Incorrect Permission Assignment for
CWE-23 Relative Path Traversal 2 CWE-732 1
Critical Resource
Allocation of Resources Without Limits
CWE-256 Plaintext Storage of a Password 1 CWE-770 1
or Throttling
Improper Neutralization of Special Elements
CWE-89 1 CWE-190 Integer Overflow or Wraparound 1
used in an SQL Command (’SQL Injection’)
Unrestricted Upload of File with
CWE-434 1 CWE-522 Insufficiently Protected Credentials 1
Dangerous Type
CWE-521 Weak Password Requirements 1 CWE-121 Stack-based Buffer Overflow 1
Authentication Bypass Using an Improper Neutralization of Argument
CWE-288 1 CWE-88 1
Alternate Path or Channel Delimiters in a Command (’Argument Injection’)
CWE-313 Cleartext Storage in a File or on Disk 1 CWE-427 Uncontrolled Search Path Element 1
CWE-759 Use of a One-Way Hash without a Salt 1 CWE-255 Credentials Management Errors 1
Improper Restriction of XML
CWE-611 1 CWE-269 Improper Privilege Management 1
External Entity Reference
CWE-314 Cleartext Storage in the Registry 1 NVD-CWE-Other* Other 291
*NVD does not use the whole CWE for mapping; instead, it uses NVD-CWE-noinfo** Insufficient Information 11
a subset of CWE that does not include the weakness type.
** Insufficient information to classify the issue caused by unknown
or unspecified details.
3.5. MITRE ATT&CK Mapping
The Mitre ATT&CK framework is essential in understanding and leveraging tactics and techniques
for both vulnerability exploitation and defense mechanisms. Tactics19 within the ATT&CK framework
represent the strategic objectives or the “why" behind an adversary’s actions, whereas techniques20
detail the “how" by describing the specific methods used to accomplish these tactical goals. With this
19
https://attack.mitre.org/tactics/enterprise/
20
https://attack.mitre.org/techniques/enterprise/
Table 7
CWEs from Vulnerable Malware (M) and Benign samples (B) mapped to the OWASP Top 10 and CWE Top 25
Lists.
CWE CWE 2023 CWE
OWASP Top 10
(Benign) (Malware) Top 25
CWE-23
CWE-22
A01:2021 CWE-200 (B) CWE-22 (#8)
CWE-200
Broken Access CWE-275 (B) CWE-352 (#9)
CWE-264
Control CWE-284 (M) CWE-352 (#9)
CWE-352
CWE-352
A02:2021
CWE-319
Cryptographic
CWE-759
Failures
(B) CWE-20 (#6)
CWE-20 (M) CWE-77 (#16)
CWE-79 CWE-77 (B) CWE-79 (#2)
A03:2021
CWE-88 CWE-79 (M) CWE-79 (#2)
Injection
CWE-89 CWE-89 (B) CWE-89 (#3)
CWE-94 (M) CWE-89 (#3)
(B) CWE-94 (#23)
CWE-256
A04:2021 CWE-269 CWE-312 (B) CWE-269 (#22)
Insecure Design CWE-522 CWE-313 (M) CWE-434 (#10)
CWE-434
A05:2021
Security CWE-611 CWE-611
Misconfiguration
CWE-259
CWE-287
A07:2021 (B) CWE-287 (#13)
CWE-288
Identification CWE-255 (M) CWE-287 (#13)
CWE-300
and Authentication CWE-287 (M) CWE-306 (#20)
CWE-306
Failures (M) CWE-798 (#18)
CWE-521
CWE-798
A08:2021
Software and Data CWE-426
Integrity Failures
A10:2021
(B) CWE-918 (#19)
Server Side CWE-918 CWE-918
(M) CWE-918 (#19)
Request Forgery
knowledge, potential attack paths in malware can be established and exploited, or conversely can
be used for counteracting adversarial actions in benign software. In our dataset, these tactics and
techniques are mapped according to the vulnerabilities in the vulnerable malware class, which can be
seen in Table 8.
Table 8 shows the top ATT&CK Tactics and Techniques adopted in the vulnerable malware, which
are overwhelmingly File and Directory Permissions Modification, Obtain Capabilities (malware), Brute
Force (Password Guessing), Hijack Execution Flow and Valid Accounts (Default Accounts) and makes
up over 82% of the techniques. These techniques show that the adversaries want to infiltrate the
victim’s network, evade detection throughout the compromise process, establish resources they can use
to support operations, steal account credentials, obtain higher-level permissions and maintain their
foothold. It is important to note that the tactics and techniques in Table 8 are only representative of our
dataset, which is influenced by the number of malware families and samples. Under different conditions,
the order will be different. In any case, the present tactics and techniques of vulnerable malware are the
ones you would expect from malware generally [47, 48], which shows that the vulnerability in malware
Table 8
Mitre Tactics and Techniques, mapped to their description with the number of times they appear in the
dataset mapped to the Vulnerable Malware class.
# Total Techniques Technique ID Tactics
133 T1222 File and Directory Permissions Modification TA0005 Defense Evasion
106 T1588.001 Obtain Capabilities (malware) TA0042 Resource Development
27 T1110.001 Brute Force (Password Guessing) TA0006 Credential Access
21 T1574 Hijack Execution Flow TA0003 Persistence
TA0004 Privilege Escalation
TA0005 Defense Evasion
18 T1499 Endpoint Denial of Service TA0040 Impact
18 T1078.001 Valid Accounts (Default Accounts) TA0001 Initial Access
TA0003 Persistence
TA0004 Privilege Escalation
TA0005 Defense Evasion
11 T1592 Gather Victim Host Information TA0043 Reconnaissance
10 T1040 Network Sniffing TA0006 Credential Access
TA0007 Discovery
9 T1574.009 Hijack Execution Flow (Path Interception by Unquoted Path) TA0003 Persistence
TA0004 Privilege Escalation
TA0005 Defense Evasion
8 T1068 Exploitation for Privilege Escalation TA0004 Privilege Escalation
5 T1006 Direct Volume Access TA0005 Defense Evasion
4 T1555 Credentials from Password Stores TA0006 Credential Access
3 T1202 Indirect Command Execution TA0005 Defense Evasion
3 T1059.007 Command and Scripting Interpreter (JavaScript) TA0002 Execution
2 T1557 Adversary-in-the-Middle TA0006 Credential Access
TA0009 Collection
2 T1552 Unsecured Credentials TA0006 Credential Access
1 T1608.002 Stage Capabilities (Upload Tool) TA0042 Resource Development
1 T1110.002 Brute Force (Password Cracking) TA0006 Credential Access
1 T1505 Server Software Component TA0003 Persistence
is not some sophistication of tactics and techniques. Similarly to every piece of software code, malware
is also prone to vulnerabilities and weaknesses that can be exploited. To find the TTPs that were
present in over 600,000 malware samples that were gathered between January 2023 and December 2023,
Picus Labs [48] conducted an analysis. The study identified the top 10 MITRE ATT&CK techniques,
which differ from the top 10 ATT&CK techniques used in this study’s vulnerable malware samples.
A related study that looked into ATT&CK Trends and Techniques in 951 Windows malware families
between 2017 and 2018 revealed that the examined dataset had a varied number of observations for
each ATT&CK [47]. For example, only three of the dataset’s top ten ATT&CK approaches were adopted
by threat groups and malware, according to the report by Picus Labs. Top ATT&CK techniques can
vary depending on the specific malware family being considered, as different families have different
why and how due to their tactical goals and actions. For instance, the techniques used by ransomware
will differ from those used by spyware. For example, the Centre for Threat Informed Defense, MITRE
Engenuity [49], analyzed 22 ransomware groups over 3 years and compiled a list of the Top 10 ATT&CK
techniques. As anticipated, the Top 10 ATT&CK Techniques for ransomware only had three techniques
(T1486 - Data Encrypted for Impact, T1027 - Obfuscated Files or Information, T1055 - Process Injection) in
common with the top ATT&CK techniques identified by Picus Labs [48] and the trends in Windows
Malware [47].
Vulnerable malware is still malware and adheres to the same characteristics and propagation strategies.
Therefore, its vulnerabilities can persist in malware families and variants for a long time because the
authors either do not recognize the vulnerabilities, have poor operational security, or just lack a quality
control process and are more likely to have a bug that can persist in malware families and variants
for a long time, which persists as a notable observation that we have highlighted by Singh in [10].
Vulnerable or not, malware tactics and techniques are still adversarial. To prioritize ATT&CK techniques
for defending against malware attacks, creating a top ATT&CK techniques list in Table 8 can be a
helpful starting point. This approach is similar to MITRE’s methodology21 , which takes into account
the prevalence of techniques, common attack choke points, and actionability to help defenders focus
on the most relevant techniques for their organization. Including the number of observations per
technique allows us to measure how frequently an attacker uses a specific MITRE ATT&CK technique,
which can be useful in identifying important techniques when dealing with vulnerable malware for
exploitation. Additionally, defenders have the opportunity to mitigate or defend against each ATT&CK
based on publicly available threat intelligence derived from real-world observations. MITRE ATT&CK
offers additional information for mitigation, detection, procedure examples, and references for each
identified significant technique, which can serve as a knowledge base for offensive security and detection
technology.
3.6. EMBER Feature Set
For the data to be used with ease as a benchmark for classification models, Ember, an open-source dataset
and feature extraction research project that uses the LIEF22 project to extract features from PE files for
static malware detection, was utilized to systematically generate a comprehensive set of vectorized
features. The resulting feature vector per sample contains 2381 elements, using the version 2 extractor,
from the PE files in our data. This feature set is widely applied in the ML malware detection literature
[35]. Thus, the extraction of static EMBER features from our dataset is consistent with other benchmark
datasets for malicious PE detection in the literature[4, 6]. For the features extracted using Ember, data
normalization is applied to allow for the values between the different classes to be distributed across a
common scale. This allows us to work with data within the desired range without losing much of the
original distribution. The MinMaxScaler is the method used to normalize input features to the [0, 1]
range based on the training data. This normalization technique has been applied for feature embedding
when working with other benchmark PE malware datasets such as the EMBER dataset to enhance ML
training [50, 51].
3.7. Visualizations of Data Distributions
To visualize the data distribution and have a better understanding of our dataset, Principal Component
Analysis (PCA) was applied. PCA is a statistical method that reduces the dimensionality of large datasets,
in our case our complexity was derived from the 4 classes and a vector length of 2381 from the Ember
features, by transforming them into smaller linear and uncorrelated variables, known as principal
components. By reducing the dimensionality to only 2 variables (the two first principal components),
the dataset can be visualized as an image. Similarly, t-distributed Stochastic Neighbor Embedding
(t-SNE)[52] was also used, by employing an unsupervised, non-linear reduction of dimensionality.
However, unlike PCA, t-SNE preserves the local structure of the data which allows for more visible
clusters.
From Figure 2, it is clear that there is a discernible separation between the malware and benign
software classes. Notably, there is an overlap between the vulnerable and control malware. This is to be
expected as the patterns and behavior in malware should be consistent between samples. The same is
expected and seen in the vulnerable and control benign samples. There is, however, an existing overlap
between the vulnerable malware and vulnerable benign samples. This raises the assumption that PCA
is categorizing these two classes based on what we hope to be their vulnerabilities.
The analysis derived from Figure 2 translates well to Figure 3. However, there is even more discernible
separation, especially between the different samples within each class, seen from the prominent clusters
formed. Similar to before, the malware and benign software classes are separated. However, the
grouping for vulnerable malware and vulnerable benign software is more noticeable, which again we
can assume to be features that capture their vulnerabilities.
21
https://top-attack-techniques.mitre-engenuity.org/methodology
22
https://github.com/lief-project/LIEF
Visualization of dataset using PCA
nonvulnerable_benign
3 vulnerable_benign
vulnerable_malware
nonvulnerable_malware
2
1
PCA2
0
1
2 1 0 1 2 3
PCA1
Figure 2: Visualization of the data distribution using PCA as a scatterplot.
From those visualizations, it can be concluded that our dataset represented as Ember feature vectors
will allow for the effective separation of malware and benign samples and the training of a malware
detector. On the contrary, the identification of vulnerabilities within the program may result in a more
challenging problem.
4. Model Benchmarks
4.1. Model Setup
As previously described, a comprehensive set of common ML-based models were used, including
LightGBM (LGBM)23 , Random Forest (RF)24 , K-Nearest Neighbors (KNN)25 , Support Vector Machine
(SVM)26 , and an Artificial Neural Network (ANN), specifically a Multi-layer Perceptron (MLP)27 . By
using these models, which operate on different principles and assumptions, we can observe distinct
perspectives on the dataset and provide a comprehensive set of baseline models that the scientific
community can use to benchmark their novel approaches.
In our experimental setup, these classifiers were trained initially with an 80:20 train and test split.
Additionally, we implemented stratified K-fold28 as a way to mitigate any overfitting that may exist
in our models and provide a more comprehensive and realistic evaluation. The reasoning for using a
23
https://lightgbm.readthedocs.io/en/stable/
24
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
25
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
26
https://scikit-learn.org/stable/modules/svm.html
27
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
28
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Visualization of dataset using t-SNE
60 nonvulnerable_benign
vulnerable_benign
vulnerable_malware
40 nonvulnerable_malware
20
TSNE2
0
20
40
60
60 40 20 0 20 40 60
TSNE1
Figure 3: Visualization of the data distribution using t-SNE as a scatterplot.
stratified K-fold ensures that each fold maintains approximately the same percentage of samples for
each target class in the entire dataset, especially considering the class imbalance that exists where the
vulnerable malware class is underrepresented. We used a k-fold of “K=5" and the cross-validation model
provides train/test indices to split data into train/test sets as a default. After the results were calculated,
we took a mean of each estimator to obtain the final values. Performance metrics such as accuracy,
weighted F1-score, weighted precision, and weighted recall were calculated to evaluate the efficacy of
the models on both default parameters and hyperparameter optimization. As for stratified K-fold, the
accuracy metric was replaced with balanced_accuracy, defined as the average of recall obtained on each
class, specifically designed for dealing with unbalanced classes.
A set of ten different tasks were addressed and validated using our dataset, which includes:
• Vulnerable Malware vs Non-vulnerable Malware in Table 12
• Vulnerable Malware vs Vulnerable Benign in Table 13
• Vulnerable Malware vs Non-vulnerable Benign in Table 14
• Vulnerable Benign vs Non-vulnerable Benign in Table 15
• Vulnerable Benign vs Non-vulnerable Malware in Table 16
• Non-vulnerable Benign vs Non-vulnerable Malware in Table 17
• Vulnerable Malware vs All Benign (VB+NVB) in Table 18
• Vulnerable Benign vs All Malware (VM+NVM) in Table 19
• Malware (VM+NVM) vs Benign (VB+NVB) in Table 10
• Multi-class (All four classes: VM, NVM, VB and NVB) in Table 11
Hyperparameter optimization was applied to provide the best possible performance for each machine
learning technique. While some studies use default settings and parameters, such as those in EMBER[5],
a better approach involves extensive hyperparameter tuning, as seen in the UCSB Packed Malware
dataset[30]. Our model parameter settings align with these approaches. Subsequently, we implemented
hyperparameter optimization using gridsearch29 for all models. These parameters can be seen in Table 9.
Additionally, we used the optimized parameters when implementing the estimators for cross validation.
Table 9
Parameters used for both multi-class and binary tasks obtained from Hyperparameter Optimization
Classifier Multi-class Binary
LGBM ’bagging_fraction’: 0.8, ’feature_fraction’: 0.9, ’learning_rate’: 0.1, ’bagging_fraction’: 0.8, ’feature_fraction’: 0.9, ’learning_rate’: 0.1,
’max_bin’: 20, ’max_depth’: 30, ’min_data_in_leaf’: 20, ’max_bin’: 20, ’max_depth’: 5, ’min_data_in_leaf’: 80,
’min_sum_hessian_in_leaf’: 0, ’n_estimators’: 200, ’num_leaves’: 24, ’min_sum_hessian_in_leaf’: 0, ’n_estimators’: 200,
’objective’: ’multiclass’, ’subsample’: 0.01 ’num_leaves’: 80, ’subsample’: 0.01
RF ’bootstrap’: False, ’max_depth’: 70, ’max_features’: ’sqrt’, ’bootstrap’: False, ’max_depth’: 30, ’max_features’: ’sqrt’,
’min_samples_leaf’: 1, ’min_samples_split’: 2, ’min_samples_leaf’: 1, ’min_samples_split’: 2,
’n_estimators’: 800 ’n_estimators’: 400
KNN ’metric’: ’manhattan’, ’n_neighbors’: 5, ’weights’: ’distance’ ’metric’: ’minkowski’, ’n_neighbors’: 5, ’weights’: ’distance’
SVM ’C’: 100, ’gamma’: ’scale’, ’kernel’: ’rbf’ ’C’: 100, ’gamma’: ’scale’, ’kernel’: ’rbf’
ANN ’activation’: ’relu’, ’alpha’: 0.0001, ’hidden_layer_sizes’: (100,), ’activation’: ’relu’, ’alpha’: 0.001, ’hidden_layer_sizes’: (50,),
’learning_rate’: ’constant’, ’solver’: ’adam’ ’learning_rate’: ’adaptive’, ’solver’: ’adam’
4.2. Binary Classification
Table 10
Benchmarks on different models for a binary malware detection task. The classes used were all malware samples
vs all benign samples.
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.984 0.984 0.984 0.984
Parameters RF 0.977 0.977 0.977 0.977
KNN 0.979 0.979 0.979 0.979
SVM 0.949 0.949 0.951 0.949
ANN 0.976 0.976 0.976 0.976
Hyperparameter LGBM 0.983 0.983 0.983 0.983
Optimization RF 0.980 0.980 0.980 0.980
KNN 0.981 0.981 0.981 0.981
SVM 0.980 0.980 0.980 0.980
ANN 0.974 0.974 0.975 0.974
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.952 0.950 0.963 0.952
K-Fold RF 0.951 0.950 0.963 0.952
KNN 0.949 0.947 0.961 0.949
SVM 0.948 0.947 0.960 0.948
ANN 0.941 0.939 0.953 0.941
Firstly, combining all classes into a binary task (malware vs. benign) allows us to benchmark our
dataset against others that exist in [4, 5, 6, 7] for malware detection. This not only aligns with the
conventional practices seen in the literature but also simplifies this classification task so that we can
analyze a more straightforward model performance.
29
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
4.3. Multi-class Classification
Table 11
Benchmarks on different models using default parameters, hyperparameter optimization and stratified K-Fold
cross-validation with the dataset for a multi-class classification task.
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.957 0.955 0.956 0.957
Parameters RF 0.940 0.934 0.936 0.940
KNN 0.942 0.942 0.940 0.942
SVM 0.918 0.898 0.879 0.918
ANN 0.934 0.925 0.931 0.934
Hyperparameter LGBM 0.960 0.958 0.958 0.960
Optimization RF 0.940 0.933 0.938 0.940
KNN 0.935 0.933 0.931 0.935
SVM 0.948 0.946 0.945 0.948
ANN 0.938 0.936 0.934 0.938
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.855 0.956 0.957 0.958
K-Fold RF 0.790 0.935 0.939 0.942
KNN 0.830 0.941 0.940 0.943
SVM 0.843 0.948 0.948 0.951
ANN 0.822 0.939 0.940 0.943
Subsequently, we can compare the results from Table 10 with the results we have obtained in Table 11
to discern between how the ML classifiers distinguish between an added complexity of vulnerabilities.
Across all tasks and techniques, the LGBM model consistently outperformed others, achieving an
average F1-Score of between 0.929 and 0.999. Albeit, all models demonstrated a strong performance,
especially considering the notable variance in class sizes and similar data structure between the related
classes, that being either the vulnerable and non-vulnerable malware or benign classes. This tells us that
these models can find patterns in the data that can recognise if a sample is either 1) malware or benign
and 2) vulnerable or non-vulnerable. One alternative method for establishing model benchmarks could
involve utilizing representation learning[53, 50, 54, 55]. This approach aims to differentiate samples
that share similarities in the feature space. This could be a viable research direction as the t-SNE plots
in Figure 3 show some overlap among the VM, VB, and NVM classes.
5. Discussion and Conclusion
5.1. Finding Vulnerabilities in Malware: Applications and Ethical Considerations
The public disclosure of vulnerabilities in commercial and open-source programs has sparked conversa-
tions and led to the creation of responsible disclosure programs. These programs provide guidelines
for reporting and cataloguing vulnerabilities. For instance, CISA KEV[56] is a catalogue of Known
Exploited Vulnerabilities that updates its entries only when a vulnerability has been assigned a common
identifier for publicly known cybersecurity vulnerability, actively exploited, and there is clear remedia-
tion guidance. On the other hand, VulnCheck KEV[57] adds vulnerabilities to their catalogue as long
as the vulnerability is publicly reported as exploited in the wild. Unlike CISA KEV, VulnCheck KEV
does not have additional criteria such as an identifier and clear remediation guidance. It is still unclear
whether there might be a similar requirement for vulnerability disclosure in malware. Currently, the
Malvuln project is leading that charge by cataloguing vulnerabilities found in malware and providing
exploitation details.
The dataset and benchmarks in this research focus on using exploitable vulnerabilities in malware for
offensive security in defense technology, rather than public disclosure of malware vulnerabilities. While
it is known that attackers exploit vulnerabilities in benign software, we are looking at the opposite
scenario, where defenders can identify vulnerabilities in malware to enhance threat intelligence for
cyber defense. The dataset aims to advance research on using exploitable malware vulnerabilities as a
defense layer against malware.
5.2. Future Work
The contributions to the POC vulnerabilities in malware within this dataset are supplied only by Malvuln.
This single-author database, although regularly updated, requires subsequent future work to aid in
obtaining samples to investigate the impact of the sample size on the performance metrics which is
necessary for the research of vulnerability detection in malware.
One potential enhancement from Section 3.4 is to the CWE mapping of malicious binaries and the
incorporation of CWE chains and composite[58]. This involves identifying the relationships between
different CWEs in a vulnerable sample, which can be implicit, named or composite as defined by
MITRE’s knowledge base. By identifying these relationships, we can understand how weaknesses
can be combined to create vulnerabilities which can then inform the process of generating automated
exploits. For example, focusing on only one weakness in the chain or one composite component might
limit the comprehensive understanding of vulnerabilities. RedHat has adopted the practice of chaining
multiple CWEs together for the root cause analysis of security flaws in their products, aiming to go
beyond tracking a single root cause, which is the current approach to CWEs in the industry[59, 60].
5.3. Conclusion
This paper presents a new PE malware dataset that leverages the use of Malvuln to open new doors
in malware vulnerability research. Our dataset introduces information unavailable in the other com-
plementary PE malware datasets, such as the presence and annotation of vulnerabilities, obfuscation
techniques or ATT&CK. Our contribution brings together the vulnerabilities found in malware from
the Malvuln dataset and vulnerabilities in benign software from the ExploitDB database, CWE mapping
in both vulnerable classes, Mitre ATT&CK Tactics and Techniques from VulDB, and analysis of the use
of obfuscation across all classes. Additionally, our dataset utilizes Ember to extract the static PE features
from all compatible samples to form feature vectors across the four classes to be used for Machine
Learning. We obtained benchmark and baseline results on binary malware and benign classifiers,
vulnerability detection and multi-class (vulnerable/non-vulnerable, malware/benign) classifiers using
these features. From our results, we can derive various assumptions. For all tasks, it is clear that the
classifiers could effectively and accurately distinguish between the two cases; 1. malware and benign
software, and also 2. vulnerable and non-vulnerable samples but to a lesser degree. We also provide an
in-depth analysis of the threat and vulnerability mapping and the correlation of vulnerabilities between
malware and benign software. By providing these insights and a deep analysis of the importance of
exploitable malware vulnerabilities and the potential of their identification as a defensive mechanism,
we hope that our contributions encourage further studies into exploitability for defense.
Acknowledgments
We wish to acknowledge funding from the UK Government through the New Deal for Northern Ireland.
The funding is delivered on behalf of the Northern Ireland Office and the Department for Science,
Innovation and Technology by Innovate UK. Domhnall Carlin is funded by the UKRI EPSRC Research
Software Engineering Fellowship (EP/V052284/1).
References
[1] A. Elhadi, M. Maarof, A. Hamza Osman, Malware detection based on hybrid signature behaviour
application programming interface call graph, American Journal of Applied Sciences 9 (2012)
283–288.
[2] R. Wang, D.-G. Feng, Y. Yang, P.-R. Su, Semantics-based malware behavior signature extraction
and detection method, Journal of Software 23 (2012) 378–393. doi:10.3724/SP.J.1001.2012.
03953.
[3] A. Jalilian, Z. Narimani, E. Ansari, Static signature-based malware detection using opcode and
binary information, in: Data Science: From Research to Application, Springer, 2020, pp. 24–35.
doi:10.1007/978-3-030-37309-2_3.
[4] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, G. Wang, Bodmas: An open dataset for learning
based temporal analysis of pe malware, in: 2021 IEEE Security and Privacy Workshops (SPW),
2021, pp. 78–84. doi:10.1109/SPW53761.2021.00020.
[5] H. S. Anderson, P. Roth, EMBER: an open dataset for training static PE malware machine learning
models, CoRR abs/1804.04637 (2018). URL: http://arxiv.org/abs/1804.04637. arXiv:1804.04637.
[6] R. E. Harang, E. M. Rudd, SOREL-20M: A large scale benchmark dataset for malicious PE detection,
CoRR abs/2012.07634 (2020). URL: https://arxiv.org/abs/2012.07634. arXiv:2012.07634.
[7] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, M. Ahmadi, Microsoft Malware Classification
Challenge, 2018. URL: http://arxiv.org/abs/1802.10135, arXiv:1802.10135 [cs].
[8] M. Henriquez, Bugs in malware creating backdoors for security
researchers, 2021. URL: https://www.securitymagazine.com/articles/
96348-bugs-in-malware-creating-backdoors-for-security-researchers, date accessed: 07-06-2024.
[9] J. Caballero, P. Poosankam, S. McCamant, D. Babi ć, D. Song, Input generation via decomposition
and re-stitching: Finding bugs in malware, in: Proceedings of the 17th ACM conference on
Computer and communications security, 2010, pp. 413–425. doi:10.1145/1866307.1866354.
[10] N. Singh, U. Pratap Singh, VB2021 paper: Bugs in malware – uncovering vulnerabilities
found in malware payloads, Virus Bulletin (2021) 1–14. URL: https://vblocalhost.com/uploads/
VB2021-Singh-Singh.pdf.
[11] A. Anubhav, Crash and Burn :: How to crash a Mirai C2 server & why it works., 2019. URL:
https://www.ankitanubhav.info/post/crash.
[12] L. Abrams, Vulnerabilities allow hijacking of most ransomware to prevent
file encryption, 2022. URL: https://www.bleepingcomputer.com/news/security/
lockbit-30-introduces-the-first-ransomware-bug-bounty-program/, date accessed: 15-06-
2024.
[13] J. Page, Ransomlord anti-ransomware exploit tool., 2024. URL: https://github.com/malvuln/
RansomLord, date accessed: 15-06-2024.
[14] I. Ilascu, Conti, revil, lockbit ransomware bugs exploited to block encryption, 2022.
URL: https://web.archive.org/web/20220601204439/https://www.bleepingcomputer.com/news/
security/conti-revil-lockbit-ransomware-bugs-exploited-to-block-encryption/, date accessed: 15-
06-2024.
[15] E. Kovacs, Vulnerabilities allow hijacking of most ransomware to prevent file encryp-
tion, 2022. URL: https://web.archive.org/web/20220504180432/https://www.securityweek.com/
vulnerabilities-allow-hijacking-most-ransomware-prevent-file-encryption/, date accessed: 15-06-
2024.
[16] A. Calleja, J. Tapiador, J. Caballero, The malsource dataset: Quantifying complexity and code
reuse in malware development, IEEE Transactions on Information Forensics and Security 14 (2018)
3175–3190. doi:10.1109/TIFS.2018.2885512.
[17] J. Rosenberg, C. Beek, Examining code reuse reveals undiscovered links among north ko-
rea’s malware families, 2018. URL: https://www.mcafee.com/blogs/other-blogs/mcafee-labs/
examining-code-reuse-reveals-undiscovered-links-among-north-koreas-malware-families/, date
Accessed: 09-07-2024.
[18] MalwareTech, Finding the kill switch to stop the spread of ransomware, 2017. URL: https://www.
ncsc.gov.uk/blog-post/finding-kill-switch-stop-spread-ransomware-0, date Accessed: 07-06-2024.
[19] J. Fuller, R. P. Kasturi, A. Sikder, H. Xu, B. Arik, V. Verma, E. Asdar, B. Saltaformaggio, C3po:
large-scale study of covert monitoring of c&c servers via over-permissioned protocol infiltration,
in: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security,
2021, pp. 3352–3365. doi:10.1145/3460120.3484537.
[20] SecurityScorecard, When hackers get hacked: A cybersecurity triumph, 2023. URL: https://app.
daily.dev/posts/HpFtuXxKC, date Accessed: 07-06-2024.
[21] A. Gdanski, L. Kessem, From thanos to prometheus: When ransomware encryption goes wrong,
2021. URL: https://securityintelligence.com/posts/ransomware-encryption-goes-wrong/, date ac-
cessed: 07-06-2024.
[22] P. Arntz, Oops! black basta ransomware flubs encryption, 2024. URL: https://www.threatdown.
com/blog/oops-black-basta-ransomware-flubs-encryption/, date accessed: 07-06-2024.
[23] A. Anubhav, Several iot botnet c2s compromised by a threat actor due to weak credentials., 2019.
URL: https://www.ankitanubhav.info/post/c2bruting, date Accessed: 07-06-2024.
[24] S. Walla, C. Rossow, Malpity: Automatic identification and exploitation of tarpit vulnerabilities in
malware, in: 2019 IEEE European Symposium on Security and Privacy (EuroS&P), IEEE, 2019, pp.
590–605. doi:10.1109/EuroSP.2019.00049.
[25] H. Griffioen, C. Doerr, Could you clean up the internet with a pit of tar? investigating tarpit
feasibility on internet worms, in: 2023 IEEE Symposium on Security and Privacy (SP), IEEE, 2023,
pp. 2551–2565. doi:10.1109/SP46215.2023.10179467.
[26] G. Apruzzese, P. Laskov, E. Montes de Oca, W. Mallouli, L. Brdalo Rapa, A. V. Grammatopoulos,
F. Di Franco, The role of machine learning in cybersecurity, Digital Threats: Research and Practice
4 (2023) 1–38. doi:10.1145/3545574.
[27] J. Singh, J. Singh, A survey on machine learning-based malware detection in executable files,
Journal of Systems Architecture 112 (2021) 101861. URL: https://www.sciencedirect.com/science/
article/pii/S1383762120301442. doi:10.1016/j.sysarc.2020.101861.
[28] D. Ucci, L. Aniello, R. Baldoni, Survey of machine learning techniques for malware analysis,
Computers & Security 81 (2019) 123–147. doi:10.1016/j.cose.2018.11.001.
[29] R. Sihwail, K. Omar, K. Z. Ariffin, A survey on malware analysis techniques: Static, dynamic,
hybrid and memory analysis, Int. J. Adv. Sci. Eng. Inf. Technol 8 (2018) 1662–1671. doi:10.18517/
ijaseit.8.4-2.6827.
[30] H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, C. Kruegel,
When malware is packin’heat; limits of machine learning classifiers based on static analysis
features, in: Network and Distributed Systems Security (NDSS) Symposium 2020, 2020.
[31] R. Sihwail, K. Omar, K. A. Zainol Ariffin, S. Al Afghani, Malware Detection Approach Based
on Artifacts in Memory Image and Dynamic Analysis, Applied Sciences 9 (2019) 3680. URL:
https://www.mdpi.com/2076-3417/9/18/3680. doi:10.3390/app9183680, number: 18 Publisher:
Multidisciplinary Digital Publishing Institute.
[32] K. A. Roundy, B. P. Miller, Hybrid analysis and control of malware, in: Recent Advances in Intrusion
Detection: 13th International Symposium, RAID 2010, Ottawa, Ontario, Canada, September 15-17,
2010. Proceedings 13, Springer, 2010, pp. 317–338. doi:10.1007/978-3-642-15512-3_17.
[33] B. Cheng, J. Ming, J. Fu, G. Peng, T. Chen, X. Zhang, J.-Y. Marion, Towards paving the way
for large-scale windows malware analysis: Generic binary unpacking with orders-of-magnitude
performance boost, in: Proceedings of the 2018 ACM SIGSAC Conference on Computer and
Communications Security, 2018, pp. 395–411. doi:10.1145/3243734.3243771.
[34] O. Alrawi, M. Ike, M. Pruett, R. P. Kasturi, S. Barua, T. Hirani, B. Hill, B. Saltaformaggio, Forecasting
malware capabilities from cyber attack memory images, in: 30th USENIX security symposium
(USENIX security 21), 2021, pp. 3523–3540.
[35] D. G. Corlatescu, A. Dinu, M. P. Gaman, P. Sumedrea, Embersim: A large-scale databank for
boosting similarity search in malware analysis, Advances in Neural Information Processing
Systems 36 (2024).
[36] M. Sebastián, R. Rivera, P. Kotzias, J. Caballero, Avclass: A tool for massive malware labeling,
in: Research in Attacks, Intrusions, and Defenses: 19th International Symposium, RAID 2016,
Paris, France, September 19-21, 2016, Proceedings 19, Springer, 2016, pp. 230–253. doi:10.1007/
978-3-319-45719-2_11.
[37] S. Sebastián, J. Caballero, Avclass2: Massive malware tag extraction from av labels, in: Proceedings
of the 36th Annual Computer Security Applications Conference, 2020, pp. 42–53. doi:10.1145/
3427228.3427261.
[38] R. Lyda, J. Hamrock, Using entropy analysis to find encrypted and packed malware, IEEE Security
& Privacy 5 (2007) 40–45. doi:10.1109/MSP.2007.48.
[39] X. Ugarte-Pedrero, D. Balzarotti, I. Santos, P. G. Bringas, Sok: Deep packer inspection: A longi-
tudinal study of the complexity of run-time packers, in: 2015 IEEE Symposium on Security and
Privacy, IEEE, 2015, pp. 659–673. doi:10.1109/SP.2015.46.
[40] T. Muralidharan, A. Cohen, N. Gerson, N. Nissim, File packing from the malware perspective:
techniques, analysis approaches, and directions for enhancements, ACM Computing Surveys 55
(2022) 1–45. doi:10.1145/3530810.
[41] T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, D. Brumley, Automatic exploit generation,
Communications of the ACM 57 (2014) 74–84. doi:10.1145/2560217.2560219.
[42] S. Xu, Y. Wang, Bofaeg: Automated stack buffer overflow vulnerability detection and exploit
generation based on symbolic execution and dynamic analysis, Security and Communication
Networks 2022 (2022) 1251987. doi:10.1155/2022/1251987.
[43] V. A. Padaryan, V. Kaushan, A. Fedotov, Automated exploit generation for stack buffer over-
flow vulnerabilities, Programming and Computer Software 41 (2015) 373–380. doi:10.1134/
S0361768815060055.
[44] S. Park, D. Kim, S. Jana, S. Son, {FUGIO}: Automatic exploit generation for {PHP} object injection
vulnerabilities, in: 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 197–214.
[45] S. Jan, A. Panichella, A. Arcuri, L. Briand, Automatic generation of tests to exploit xml injection
vulnerabilities in web applications, IEEE Transactions on Software Engineering 45 (2017) 335–362.
doi:10.1109/TSE.2017.2778711.
[46] T. Wang, T. Wei, G. Gu, W. Zou, Taintscope: A checksum-aware directed fuzzing tool for automatic
software vulnerability detection, in: 2010 IEEE Symposium on Security and Privacy, IEEE, 2010,
pp. 497–512. doi:10.1109/SP.2010.37.
[47] K. Oosthoek, C. Doerr, Sok: Att&ck techniques and trends in windows malware, in: Security
and Privacy in Communication Networks: 15th EAI International Conference, SecureComm
2019, Orlando, FL, USA, October 23-25, 2019, Proceedings, Part I 15, Springer, 2019, pp. 406–425.
doi:10.1007/978-3-030-37228-6_20.
[48] Picus Security, Picus red report 2024: The top 10 most prevalent mitre att&ck techniques -
the rise of hunter-killer malware, 2024. URL: https://www.picussecurity.com/resource/report/
picus-red-report-2024.
[49] Centre for Threat Informed Defense, MITRE ENGENUITY., Top att&ck techniques, 2023. URL:
https://top-attack-techniques.mitre-engenuity.org/, date accessed: 11-06-2024.
[50] P. Xu, Y. Zhang, C. Eckert, A. Zarras, Hawkeye: cross-platform malware detection with representa-
tion learning on graphs, in: Artificial Neural Networks and Machine Learning–ICANN 2021: 30th
International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17,
2021, Proceedings, Part III 30, Springer, 2021, pp. 127–138. doi:10.1007/978-3-030-86365-4_
11.
[51] A. T. Nguyen, F. Lu, G. L. Munoz, E. Raff, C. Nicholas, J. Holt, Out of distribution data detection
using dropout bayesian neural networks, in: Proceedings of the AAAI Conference on Artificial
Intelligence, volume 36, 2022, pp. 7877–7885. doi:10.1609/aaai.v36i7.20757.
[52] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research
9 (2008).
[53] S. Chakraborty, R. Krishna, Y. Ding, B. Ray, Deep learning based vulnerability detection: Are we
there yet?, IEEE Transactions on Software Engineering 48 (2021) 3280–3296. doi:10.1109/TSE.
2021.3087402.
[54] T. Bilot, N. El Madhoun, K. Al Agha, A. Zouaoui, A survey on malware detection with graph
representation learning, ACM Computing Surveys (2023). doi:10.1145/3664649.
[55] Y. Gao, H. Hasegawa, Y. Yamaguchi, H. Shimada, Malware detection by control-flow graph level
representation learning with graph isomorphism network, IEEE Access 10 (2022) 111830–111841.
doi:10.1109/ACCESS.2022.3215267.
[56] CISA.gov, Reducing the significant risk of known exploited vulnerabilities, 2022. URL: https:
//www.cisa.gov/known-exploited-vulnerabilities, date accessed: 15-06-2024.
[57] Vulncheck Kev, Coverage criteria, 2024. URL: https://docs.vulncheck.com/community/
vulncheck-kev/coverage-criteria, date accessed: 15-06-2024.
[58] MITRE, Chains and composites, 2023. URL: https://cwe.mitre.org/data/reports/chains_and_
composites.html, date Accessed: 16-06-2024.
[59] Red Hat Product Security, cwe-toolkit - cwe chaining concept and tools, 2020. URL: https://github.
com/RedHatProductSecurity/cwe-toolkit, date Accessed: 16-06-2024.
[60] Red Hat Customer Portal, Cwe compatibility for red hat customer portal, 2024. URL: https://access.
redhat.com/articles/cwe_compatibility, date Accessed: 16-06-2024.
[61] K. Allix, T. F. Bissyandé, J. Klein, Y. Le Traon, Are your training datasets yet relevant? an
investigation into the importance of timeline in machine learning-based malware detection, in:
International Symposium on Engineering Secure Software and Systems, Springer, 2015, pp. 51–67.
doi:10.1007/978-3-319-15618-7_5.
[62] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, {TESSERACT}: Eliminating
experimental bias in malware classification across space and time, in: 28th USENIX security
symposium (USENIX Security 19), 2019, pp. 729–746.
[63] A. Guerra-Manzanares, M. Luckner, H. Bahsi, Concept drift and cross-device behavior: Challenges
and implications for effective android malware detection, Computers & Security 120 (2022) 102757.
doi:10.1016/j.cose.2022.102757.
[64] G. Chin, Correlation vs SHAP: Understanding Feature Impor-
tance in ML Models, 2024. URL: https://medium.com/@gawainchin/
correlation-vs-shap-understanding-feature-importance-in-ml-models-d6b52b1fba28.
[65] K. Främling, Feature Importance versus Feature Influence and What It Signifies for Explainable AI,
2023. URL: http://arxiv.org/abs/2308.03589, arXiv:2308.03589 [cs].
[66] R. Boemer, Selecting Features With Shapley Values, 2023. URL: https://medium.com/
the-ml-practitioner/selecting-features-with-shapley-values-b2da08b5b14c.
[67] R. Kübler, Shapley Values Clearly Explained, 2024. URL: https://towardsdatascience.com/
shapley-values-clearly-explained-a7f7ef22b104.
A. Appendix
A.1. Timestamps
It’s important to consider the timeline of a dataset in machine-learning-based malware detection and
classification systems. This helps in capturing temporal dependencies within the data, thus avoiding
experimental bias or incorrect conclusions [61, 62]. Benchmark datasets such as BODMAS, EMBER, and
SOREL-20M release their datasets with timestamped samples. For instance, BODMAS uses the first-seen
time of a sample based on VirusTotal reports, EMBER uses the timestamp in the header from the COFF
header to split its dataset, and SOREL-20M curates the first and last-seen times of the samples, eventually
using the first-seen time for temporal splitting of the data. Similar to BODMAS and SOREL-20M, our
dataset is annotated using the first seen time from VirusTotal reports, as it has been demonstrated to be
robust to alteration, reliable, and accurate compared to other timestamp metrics[63]. The timeseries
graphs in Figures 4, 5, 6, and 7 are scaled between the years 2000 and 2024.
Time Series of Values for Vulnerable Malware
Classes
Vulnerable Malware
120
100
80
Frequency
60
40
20
0
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Timestamp
Figure 4: Timeseries graph for Vulnerable Malware from 2000 to 2024.
Time Series of Values for Vulnerable Benign
Classes
Vulnerable Benign
70
60
50
40
Frequency
30
20
10
0
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Timestamp
Figure 5: Timeseries graph for Vulnerable Benign from 2000 to 2024.
Time Series of Values for Non-vulnerable Malware
Classes
Non-vulnerable Malware
30000
25000
20000
Frequency
15000
10000
5000
0
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Timestamp
Figure 6: Timeseries graph for Non-vulnerable Malware from 2000 to 2024.
Time Series of Values for Non-vulnerable Benign
Classes
Non-vulnerable Benign
1200
1000
800
Frequency
600
400
200
0
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
-01
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Timestamp
Figure 7: Timeseries graph for Non-vulnerable Benign from 2000 to 2024.
A.2. Binary Classification
A.3. Feature Importance
Feature importance is a concept in machine learning that quantifies the contribution of each feature
to the prediction of the model so we can derive explanations for what features influence the model’s
determinate output. Various studies of use methods such as correlation analysis [64] and Shapley values
[65, 66, 67] to assess feature importance. Correlation measures the strength and direction of a linear
relationship on a scale from -1 to 1 between features and the target variable, offering a straightforward
but overly simplistic view. Whereas Shapley values, which are built from cooperative game theory
as a framework explaining the output of a model in correlation to its input features provide a more
nuanced and comprehensive evaluation by considering all possible combinations of features and their
interactions, which are only bounded by the output magnitude range of the model. This ensures that
there is a fair distribution of importance among features which can not only identify key predictive
features but also aid in the interpretation of the model and its results. We focused on using Shapley
when working with feature importance for our classes, as it also has many methods for plotting graphs
Table 12
Benchmarks on different models for a binary classification task. The classes used were vulnerable malware and
non-vulnerable malware
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.943 0.934 0.940 0.943
Parameters RF 0.941 0.929 0.943 0.941
KNN 0.921 0.912 0.909 0.921
SVM 0.907 0.865 0.885 0.910
ANN 0.920 0.918 0.916 0.920
Hyperparameter LGBM 0.945 0.938 0.942 0.945
Optimization RF 0.940 0.930 0.939 0.940
KNN 0.931 0.924 0.923 0.931
SVM 0.934 0.929 0.927 0.934
ANN 0.929 0.923 0.921 0.929
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.754 0.942 0.946 0.948
K-Fold RF 0.699 0.929 0.939 0.940
KNN 0.718 0.923 0.922 0.930
SVM 0.770 0.938 0.938 0.943
ANN 0.725 0.928 0.932 0.935
Table 13
Benchmarks on different models for a binary classification task. The classes used were vulnerable malware and
vulnerable benign
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.944 0.944 0.944 0.944
Parameters RF 0.934 0.933 0.934 0.934
KNN 0.929 0.929 0.928 0.929
SVM 0.909 0.907 0.908 0.909
ANN 0.922 0.920 0.922 0.922
Hyperparameter LGBM 0.932 0.931 0.931 0.932
Optimization RF 0.942 0.941 0.941 0.942
KNN 0.934 0.934 0.934 0.934
SVM 0.937 0.936 0.937 0.937
ANN 0.922 0.920 0.921 0.922
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.913 0.929 0.930 0.929
K-Fold RF 0.910 0.926 0.927 0.926
KNN 0.899 0.914 0.916 0.914
SVM 0.912 0.931 0.931 0.932
ANN 0.909 0.926 0.927 0.926
in its library to better visualize the data. Typically in Shapley graphs, it introduces a gradient scale from
red to blue which corresponds to the raw values of the variables for each instance.
• Red points indicate high-impact SHAP values, suggesting that the feature value led the model to
increase its prediction.
• Blue points indicate low-impact SHAP values, suggesting that the feature value led the model to
decrease its prediction.
Table 14
Benchmarks on different models for a binary classification task. The classes used were vulnerable malware and
non-vulnerable benign
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.996 0.996 0.996 0.996
Parameters RF 0.996 0.996 0.996 0.996
KNN 0.999 0.999 0.999 0.999
SVM 0.995 0.995 0.995 0.995
ANN 0.995 0.995 0.995 0.995
Hyperparameter LGBM 0.997 0.997 0.997 0.997
Optimization RF 0.996 0.996 0.996 0.996
KNN 0.999 0.999 0.999 0.999
SVM 0.998 0.998 0.998 0.998
ANN 0.998 0.998 0.998 0.998
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.989 0.996 0.996 0.996
K-Fold RF 0.967 0.991 0.991 0.991
KNN 0.983 0.995 0.995 0.995
SVM 0.990 0.996 0.996 0.996
ANN 0.990 0.996 0.996 0.996
Table 15
Benchmarks on different models for a binary classification task. The classes used were vulnerable benign and
non-vulnerable benign
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.999 0.999 0.999 0.999
Parameters RF 0.996 0.996 0.996 0.996
KNN 0.999 0.999 0.999 0.999
SVM 0.997 0.997 0.997 0.997
ANN 0.999 0.999 0.999 0.999
Hyperparameter LGBM 0.999 0.999 0.999 0.999
Optimization RF 0.998 0.998 0.998 0.998
KNN 0.999 0.999 0.999 0.999
SVM 0.999 0.999 0.999 0.999
ANN 0.999 0.999 0.999 0.999
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.998 0.998 0.998 0.998
K-Fold RF 0.993 0.996 0.996 0.996
KNN 0.995 0.997 0.997 0.997
SVM 0.995 0.996 0.996 0.996
ANN 0.995 0.996 0.996 0.996
It is worth noting that SHAP gives insights into the model’s behavior in the context of the data used,
but causal relationships or generalizable patterns with unseen data could be overlooked. SHAP could
also generate unexpected relationships between features and predictions which indicate possible data
issues or model artefacts.
From the SHAP results in Figure 8, which are derived from the LGBM model performing multi-class
classification on the data, many observations can be made. The first of which we can see is that feature
637 had a high positive impact on both vulnerable and non-vulnerable malware predictions. Whereas
Table 16
Benchmarks on different models for a binary classification task. The classes used were vulnerable benign and
non-vulnerable malware
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.971 0.971 0.971 0.971
Parameters RF 0.958 0.958 0.958 0.958
KNN 0.958 0.958 0.958 0.958
SVM 0.946 0.945 0.945 0.946
ANN 0.952 0.953 0.958 0.952
Hyperparameter LGBM 0.970 0.970 0.970 0.970
Optimization RF 0.970 0.967 0.967 0.967
KNN 0.960 0.960 0.960 0.960
SVM 0.968 0.968 0.969 0.968
ANN 0.963 0.964 0.964 0.963
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.953 0.973 0.973 0.973
K-Fold RF 0.931 0.962 0.962 0.962
KNN 0.933 0.961 0.961 0.962
SVM 0.953 0.968 0.969 0.968
ANN 0.923 0.956 0.956 0.956
Table 17
Benchmarks on different models for a binary classification task. The classes used were non-vulnerable benign vs
non-vulnerable malware
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.998 0.998 0.998 0.998
Parameters RF 0.996 0.996 0.996 0.996
KNN 0.996 0.996 0.996 0.996
SVM 0.998 0.998 0.998 0.998
ANN 0.997 0.997 0.997 0.997
Hyperparameter LGBM 0.999 0.999 0.999 0.999
Optimization RF 0.998 0.998 0.998 0.998
KNN 0.996 0.996 0.996 0.996
SVM 0.998 0.998 0.998 0.998
ANN 0.997 0.997 0.997 0.997
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.997 0.997 0.997 0.997
K-Fold RF 0.995 0.995 0.996 0.995
KNN 0.995 0.995 0.995 0.995
SVM 0.997 0.997 0.997 0.997
ANN 0.996 0.996 0.996 0.996
feature 637 had a significantly low positive impact in its prediction for non-vulnerable benign with
more weight being in the negative direction. After investigating this feature, we found that it exists
in the COFF File Header, specifically the machine type30 . There were 2 machine types identified; I386
and AMD64. The I386 machines consistently dominate across all classes compared to the AMD64, with
557/3 in VM, 1,408/4 in VB, 32,615/213 in NVM, and 2,910/2222 in NVB. From this distribution, we can
see that, while I386 machines are predominantly more common in VM, VB, and NVM classes, the NVB
30
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format
Table 18
Benchmarks on different models for a binary classification task. The classes used were vulnerable malware vs all
benign samples
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.983 0.983 0.983 0.983
Parameters RF 0.979 0.979 0.979 0.979
KNN 0.974 0.975 0.975 0.974
SVM 0.971 0.971 0.971 0.971
ANN 0.980 0.980 0.981 0.980
Hyperparameter LGBM 0.980 0.980 0.981 0.980
Optimization RF 0.979 0.979 0.979 0.979
KNN 0.977 0.977 0.978 0.977
SVM 0.982 0.982 0.982 0.982
ANN 0.981 0.981 0.982 0.981
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.930 0.937 0.979 0.922
K-Fold RF 0.902 0.932 0.974 0.917
KNN 0.921 0.935 0.977 0.920
SVM 0.926 0.940 0.978 0.926
ANN 0.920 0.934 0.978 0.919
Table 19
Benchmarks on different models for a binary classification task. The classes used were vulnerable benign vs all
malware samples
Technique Model Accuracy F1-Score Precision Recall
Default LGBM 0.967 0.967 0.967 0.967
Parameters RF 0.962 0.961 0.961 0.962
KNN 0.950 0.950 0.950 0.95
SVM 0.943 0.943 0.942 0.943
ANN 0.959 0.958 0.958 0.959
Hyperparameter LGBM 0.962 0.962 0.962 0.962
Optimization RF 0.962 0.962 0.962 0.962
KNN 0.954 0.954 0.954 0.954
SVM 0.960 0.960 0.960 0.960
ANN 0.950 0.951 0.953 0.950
Balanced
Technique Model F1-Score Precision Recall
Accuracy
Stratified LGBM 0.940 0.963 0.964 0.964
K-Fold RF 0.924 0.957 0.957 0.958
KNN 0.920 0.953 0.954 0.954
SVM 0.941 0.959 0.960 0.958
ANN 0.924 0.954 0.955 0.954
class has a relatively higher proportion of AMD64 machines, which explains the data distribution in
Figure 2. A broader comparison was made in Table 20 by demonstrating the distribution of features
across the top 15 SHAP values for each class.
clean_benign in lgbm vulnerable_benign in lgbm
High High
Feature 637 Feature 626
Feature 658 Feature 757
Feature 683 Feature 520
Feature 679 Feature 679
Feature 2371 Feature value Feature 505
Feature value
Feature 677 Feature 2355
Feature 182 Feature 509
Feature 2359 Feature 1627
Feature 589 Feature 503
Feature 626 Feature 2359
Feature 654 Feature 613
Feature 657 Feature 132
Feature 618 Feature 196
Feature 91 Feature 1561
Feature 116 Feature 95
Low Low
0 10 1 0 1
SHAP value (impact on model output) SHAP value (impact on model output)
vulnerable_malware in lgbm clean_malware in lgbm
High High
Feature 637 Feature 637
Feature 677 Feature 626
Feature 626 Feature 691
Feature 590 Feature 677
Feature 106 Feature 2371
Feature value
Feature 968
Feature 623
Feature 613
Feature 89 Feature value
Feature 616 Feature 683
Feature 612 Feature 204
Feature 2359 Feature 2373
Feature 681 Feature 784
Feature 1936 Feature 502
Feature 396 Feature 986
Feature 601 Feature 132
Feature 2370 Feature 658
Low Low
0 2 2.5 0.0 2.5
SHAP value (impact on model output) SHAP value (impact on model output)
Figure 8: SHAP Summary Plot of Feature Contributions for the Multi-Class Classification Task in LGBM Model
Table 20
showing the distribution of top 15 SHAP features per class with the corresponding feature description.
Feature NVB VB NVM VM Feature Description
89 X ByteHistogram-88
91 X ByteHistogram-90
95 X ByteHistogram-94
106 X ByteHistogram-105
116 X ByteHistogram-115
132 X X ByteHistogram-131
182 X ByteHistogram-181
196 X ByteHistogram-195
204 X ByteHistogram-203
396 X ByteEntropyHistogram-139
502 X ByteEntropyHistogram-245
503 X ByteEntropyHistogram-246
505 X ByteEntropyHistogram-248
509 X ByteEntropyHistogram-252
520 X frequency of printable characters - $
589 X frequency of printable characters - i
590 X frequency of printable characters - j
601 X frequency of printable characters - u
612 X entropy of the byte histogram
613 X X occurrences of the string "c:\" (ignore case)
616 X occurrences of "MZ"
618 X virtual size of the lief parsed binary
623 X whether the binary has a "Resources" object
626 X X X X # of Symbols
637 X X X hash COFF machine type - 9
654 X hash optional subsystem - 6
657 X hash optional subsystem - 9
658 X X hash optional dll_characteristics - 0
677 X X X hash optional magic number - 9
679 X X minor_image_version
681 X minor_linker_version
683 X X minor_operating_system_version
691 X # of sections with empty name
757 X hash on pair of section name and entropy - 13
784 X hash on pair of section name and entropy - 40
968 X hash on list of unique imported libraries - 24
986 X hash on list of unique imported libraries - 42
1561 X hash on list of library:function - 361
1627 X hash on list of library:function - 427
1936 X hash on list of library:function - 736
2355 X virtual size of data directories - 1
2359 X X X virtual size of data directories - 3
2370 X size of data directories - 9
2371 X X virtual size of data directories - 9
2373 X virtual size of data directories - 10