=Paper=
{{Paper
|id=Vol-3920/paper10
|storemode=property
|title=Ransomware Evolution: Unveiling Patterns Using HDBSCAN
|pdfUrl=https://ceur-ws.org/Vol-3920/paper10.pdf
|volume=Vol-3920
|authors=Prajna Bhandary,Robert J. Joyce,Charles Nicholas
|dblpUrl=https://dblp.org/rec/conf/camlis/BhandaryJN24
}}
==Ransomware Evolution: Unveiling Patterns Using HDBSCAN==
<pdf width="1500px">https://ceur-ws.org/Vol-3920/paper10.pdf</pdf>
<pre>
                         Ransomware Evolution: Unveiling Patterns Using
                         HDBSCAN⋆
                         Prajna Bhandary1,* , Robert J. Joyce1 and Charles Nicholas1
                         1
                             University of Maryland, Baltimore County, 1000 Hilltop Cir,Maryland 21250


                                        Abstract
                                        This research presents an innovative approach to enhancing ransomware detection by leveraging Windows
                                        API calls and PE header information to develop precise signatures capable of identifying ransomware families.
                                        Our methodology introduces a novel application of hierarchical clustering using the HDBSCAN algorithm,
                                        in conjunction with the Jaccard similarity metric, to cluster ransomware into discrete families and generate
                                        corresponding signatures. This technique, to our knowledge, marks a pioneering effort in applying hierarchical
                                        density-based clustering to over 1.1 million malicious samples, specifically focusing on ransomware and using
                                        the clusters to automatically generate signatures.
                                            We show that identifying unique Windows API function patterns within these clusters enables the differenti-
                                        ation and characterization of various ransomware families. Furthermore, we conducted a case study focusing
                                        on the distinctive function combinations within prominent ransomware families such as GandCrab, WannaCry,
                                        Cerber, Gotango, and CryptXXX, unveiling unique behaviors and API function usage patterns. Our scalable
                                        implementation demonstrates the ability to efficiently cluster large volumes of malicious files and automatically
                                        generate robust, actionable function signatures for each. Validation of these signatures on an independent
                                        malware dataset yielded a precision rate of 98.34% and specificity rate of 99.72%, affirming their effectiveness
                                        in detecting known ransomware families with minimal error. These findings underscore the potential of our
                                        methodology in bolstering cybersecurity defenses against the evolving landscape of ransomware threats.

                                        Keywords
                                        Ransomware, HDBSCAN, API call


                         1. Introduction
                         Ransomware has emerged as a formidable and destructive threat to individuals, organizations, and even
                         governments worldwide. Designed to block access to files or a computer system until a sum of money is
                         paid, this type of malicious software has continued to grow in both sophistication and impact, resulting
                         in significant disruptions and financial losses. As of 2023, over 72% of businesses worldwide have been
                         affected by ransomware attacks [1]. A Palo Alto Networks study found a 50% surge in ransomware
                         attack announcements on leak sites in 2023, indicating a significant rise in ransomware incidents and
                         underscoring the continually-changing ransomware threat landscape [2].
                            The dynamic nature of ransomware calls for additional research and analysis to understand their
                         evolving patterns and to develop effective countermeasures. Ransomware actors continually update
                         their malware to add new functionality and to evade detection by antivirus and other endpoint security
                         products. New families of ransomware emerge frequently, and it is not uncommon for existing ran-
                         somware families to be “rebranded". Furthermore, it’s crucial to recognize that ransomware affiliates
                         may change their tactics, switch between families, or even manage several families concurrently. This is
                         done to avoid scrutiny and potentially target the same victims again under different guises [3]. Because
                         of these factors, the naming of ransomware families and the tracking of ransomware campaigns over
                         time is challenging and often a source of confusion in reporting.
                            To identify individual families of ransomware, the traditional approach has been antivirus signature
                         detection. Antivirus companies deploy signatures that search for unique patterns (typically byte
                         sequences) which occur within files belonging to a particular malware family [4]. Although they are


                          CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA
                         *
                           Corresponding author.
                          $ prajnab1@umbc.edu (P. Bhandary); joyce8@umbc.edu (R. J. Joyce); nicholas@umbc.edu (C. Nicholas)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
effective against known threats, antivirus signatures cannot accurately detect novel types of malware,
and they may fail to detect older malware families which utilize common evasion techniques (e.g.
polymorphism and packing). In these instances, antivirus products may fall back to heuristics and
machine learning detection methods, but these methods cannot always reliably identify ransomware
families.
   The core of our analysis revolves around the examination of Windows API functions, which serve
as indicators of ransomware behavior and intent. We observe that files within a given ransomware
family tend to import a common set of functions which can uniquely identify files belonging to that
family. Our work proposes a strategy for automatic ransomware signature generation which utilizes
unique combinations of imported functions rather than traditional byte patterns. We note that, unlike
traditional byte pattern signatures, our function signatures are robust against polymorphic evasion
techniques since they are unaffected by code alterations.
   Our methodology automatically identifies groups of malware belonging to the same family using
the HDBSCAN hierarchical clustering algorithm, complemented by a custom distance metric based
on function imports and PE metadata similarity[5]. We re-implemented large portions of an existing
HDBSCAN library to support sparse matrices, allowing us to cluster over 600,000 ransomware files[6, 7].
   This methodology enables the granular differentiation of ransomware families, laying the groundwork
for the creation of precise and actionable signatures based on unique sets of imported functions. Our
experiments show that our function signatures can accurately detect ransomware and distinguish
between different ransomware families.


2. Background
We briefly review the terminology we use in this paper, including ransomware categories, Windows
API functions, and PE header metadata.

2.1. Categories of Ransomware
Diverse ransomware families, each with unique characteristics and attack methodologies, have pro-
liferated across the digital landscape. Understanding the distinctions among these families is crucial,
as it enables cybersecurity professionals to develop targeted defense mechanisms. Familiarity with
the operational nuances of various ransomware strains, such as their encryption tactics, propagation
methods, and communication with command and control servers, provides essential insights necessary
for effective threat detection and mitigation.
   Ransomware can be broadly classified into several key types [8], [9], including File-encrypting
ransomware (e.g. Cryptolocker, Wannacry, and Locky), which encrypt valuable files demanding a
ransom for decryption keys; Locker ransomware (e.g. earlier versions of Petya and Satana), which deny
user access to the infected device; Scareware (e.g. Reveton), which feign authority to extort victims;
and Doxware, (e.g. Jigsaw), which threaten to leak sensitive information. Additionally, Ransomware as
a Service (RaaS) platforms like GandCrab are examples of commercialization of these attacks, enabling
a broader range of actors to deploy ransomware. Understanding these types aids in devising targeted
defense strategies against the multifaceted ransomware threat landscape.

2.2. The Windows API
The Windows Application Programming Interface (API), also known as WinAPI, is Microsoft’s core set
of API functionality on Windows. WinAPI consists of numerous functions that allow developers to
handle many low-level tasks, such as creating and managing windows and dialogues, processing user
input, managing files and directories, and controlling various aspects of the system’s operation. These
functions are grouped into different libraries, such as USER32.DLL for user interface operations and
KERNEL32.DLL for system operations. [10]
Table 1
Comparison of previous research works
                Reference             # of malware       Method          Ransomware focused
          Alazab et al. (2010) [11]     66,703        Static Analysis              -
          Walker et al. (2023) [12]      4,533       Dynamic Analysis              -
          Daeef et al. (2022) [13]       7,107       Dynamic Analysis              -
          Mowri et al. (2022) [14]       1,550       Dynamic Analysis              ✓
          Anand et al. (2022) [15]        653        Dynamic Analysis              ✓
                Our work                627,298       Static Analysis              ✓


   The Windows API is important in malware analysis because it offers deep insights into how malware
interacts with the Windows operating system. Malware utilizes these APIs to perform a variety of
operations such as procuring system information, executing processes, and manipulating files or
registry entries. By leveraging specific API calls, malware can maintain persistence, propagate, and
evade detection, all while conducting its intended activities. Analysing API calls used by malware
provides critical clues about its functionality and behaviour, helping malware analysts understand the
malware’s objectives, measure its impact on the infected system, and identify strategies for mitigation
or removal.

2.3. The PE Header
The Portable Executable (PE) header plays a crucial role in malware analysis as it contains essential
information about the executable file’s structure and behavior. This header, specific to executable files
for the Windows operating system, includes metadata about a file, such as its size, architecture, and
the locations of its code and data segments. Malware analysts scrutinize the PE header to identify
suspicious characteristics that might indicate malicious intent. For instance, anomalies in the section
names, sizes, or the presence of unusual attributes can signal potential malware. Additionally, the PE
header includes information about the file’s dependencies e.g. the libraries it requires, which can reveal
attempts to exploit particular vulnerabilities or perform malicious activities. By analyzing the PE header,
experts can gain insight into the malware’s functionality, origin, and potential impact, aiding in the
development of effective detection and mitigation strategies.


3. Related Work
In the rapidly evolving field of malware analysis, leveraging API call sequences to understand and
classify malware behavior has become increasingly prevalent. However, despite the advancements,
there remains a gap in the application of unsupervised clustering methodologies to these sequences,
especially on the scale of our dataset, which exceeds one million malicious files. Several studies have
specifically focused on ransomware detection using API calls similar to our study; however, none have
addressed it with the dataset size comparable to ours. Mowri et al. (2022) [14] harnessed supervised
machine learning to achieve a high accuracy rate in classifying Crypto and Locker types of ransomware
by extracting potential features through dynamic analysis. While their reported accuracy of 99.15% is
impressive, the scope of their dataset, despite being larger than some, still did not approach the size
of ours, potentially limiting the generalizability of their findings. Similarly, Anand et al. (2022) [15],
focused on identifying key API calls through feature importance methods to build robust classification
models. Their study employed both dynamic and static analysis techniques, highlighting the significance
of API call analysis in detecting ransomware, yet their dataset was considerably smaller.
   Broader malware analysis work using API calls includes Alazab et al. (2010) [11], which laid the
groundwork in this domain by analyzing malware behavior through API call sequences. Despite its
contributions, this study did not leverage machine learning techniques and was limited by a relatively
small dataset. Walker et al. (2023) [12], performed dynamic analysis on malware samples and employed
cosine similarity to discern variants within malware families, addressing the issue of duplicate API calls.
Nevertheless, these approaches did not incorporate unsupervised clustering, which might unveil more
nuanced insights from the data. Furthermore, the work of Daeef et al. (2022) [13] highlights the critical
role of API calls in malware classification, employing Jaccard similarity and visualization to extract
meaningful patterns from API sequences. Their API sequences were extracted by executing malware in
a sandbox. While this approach enhanced classification accuracy, it still did not leverage the potential
of unsupervised learning nor did it analyze data at the scale we have.
   Research has also explored the use of the HDBSCAN clustering technique. For instance, Azteni et al.
(2018) [16] used HDBSCAN to cluster Android malware, utilizing features from APK file metadata and
dynamic analysis. In similar fashion, Rahat et al. (2024) employed HDBSCAN to cluster IoT malware by
analyzing static features, such as opcode sequences and control flow graphs, to effectively identify and
categorize various malware families. In contrast, our research approach relies on static analysis alone,
for the sake of scalability.
   Expanding our horizons beyond API call clustering, research by Raff et al. (2020) [17] on clustering
techniques for YARA rule generation emphasizes the versatility of unsupervised learning in cybersecu-
rity applications. Their findings underscore the potential of such methods to improve threat detection
mechanisms.
   In Table 1, we summarize the previous works related to API call analysis and the dataset sizes used.
Our work is distinguished by its scale and focus on ransomware. To our knowledge, no prior work has
applied hierarchical density-based clustering to a dataset of ransomware of over 1.1 million malicious
files. By integrating diverse clustering methodologies, our research aims to bridge the gap in the current
literature, leveraging the scalability and robustness of unsupervised clustering techniques to analyze
large volumes of malware data. This approach not only enhances the detection and classification of
ransomware families but also provides actionable insights through the generation of robust function
signatures.


4. Methodology
This section presents a detailed account of the methodology used to automatically generate signatures
for ransomware. After extracting Windows API functions and PE metadata from a large corpus of
malware, our methodology follows three major steps, as shown in Figure 1.
   First, distances between pairs of files are measured as a function of the Jaccard similarity between
imported functions and PE metadata. We hypothesize that PE metadata, although subject to spurious
correlations, combined with imported API functions, provides a broader context of the malware’s
behavior and structure. Specifically, API calls are critical since ransomware, like any other malware,
ultimately relies on these calls to execute its malicious activities. Despite the potential for evasion
through dead code or runtime loading, our approach seeks to capture the core functionalities that
ransomware cannot avoid.
   Second, these distances are stored in a sparse matrix and clustered using HDBSCAN. Given the large
size of our dataset, with over 1.1 million malicious files, we re-implemented significant portions of the
HDBSCAN algorithm to support a sparse data format. This re-implementation allows us to handle the
high dimensionality and sparsity of our data more efficiently than the original dense format.
   Finally, a YARA rule is generated for each large cluster identified by HDBSCAN. The generated
signatures match ransomware files which share a unique combination of imported functions. The
YARA rule generation process is crucial for practical application, as it translates the clustering results
into actionable detection rules. For each cluster, we identify common API functions that appear in a
significant portion of the samples, specifically focusing on those functions that are not overly generic
but rather indicative of ransomware behavior.
   To clarify, the YARA rules include both “common" and “rare" functions within the cluster. Common
functions are those present in more than 5% of the samples, serving as the core indicators for the rule.
Conversely, functions imported by 5% or fewer of the malicious ransomware samples are considered to
Figure 1: Workflow diagram


be “rare". We later show that this threshold, although initially an ad-hoc selection, permits an acceptable
balance between detection precision and recall.
   By integrating these steps, our methodology aims to leverage the robustness of unsupervised clus-
tering techniques to analyze and categorize an unprecedented volume of malware data, ultimately
enhancing our ability to detect and classify ransomware.

4.1. Dataset Preparation
We are using a subset of the Sophos-ReversingLabs 20 Million dataset (SoReL-20M) for this research
[18]. SoReL-20M is a large-scale dataset consisting of pre-extracted features, metadata, and labels for
approximately 20 million malicious files. The SoReL dataset is roughly balanced between malicious
and benign files, and the malicious files are tagged according to 11 different behaviors. We selected
the 1,152,354 malicious files with the “ransomware" tag from SoReL-20M for use in the following
experiments. In addition, we queried the VirusTotal API to obtain antivirus scan results for each of these
malicious files. These API queries were made in April 2022. We then ran AVClass on these VirusTotal
scans to obtain malware family names for many of the files [19].

4.1.1. Extracting Unique Windows API Function Names
We began our methodology with the careful construction and cleaning of our training dataset.
   SoReL-20M includes lists of functions imported by each file, and the name of the DLL from which each
imported function originates. We identified 15,918 unique functions among the 1,152,354 ransomware
files in SoReL-20M. Functions that correspond to unrelated libraries (not in the Windows API), as well
as functions which were unidentifiable, too short, or had nonsensical names, were discarded. We then
assigned a numeric value to each remaining function. After data cleaning, we were left with 3,464
unique Windows API functions.
4.1.2. Extracting PE Header Information
The raw metadata in SoReL-20M also contains parsed PE header metadata. We selected the following
PE header fields: “Machine", “Characteristics", “MajorLinkerVersion", “MinorLinkerVersion", “MajorOp-
eratingSystemVersion" , “MinorOperatingSystemVersion", “MajorImageVersion", “MinorImageVersion",
“MajorSubsystemVersion" and “MinorSubsystemVersion". These PE header fields describe broad prop-
erties about the file, such as the minimum version of Windows it runs on, the version of the linker,
and flags regarding whether debug, symbol, and relocation information was stripped from the file.
These values were selected because they are less likely to be changed by minor alterations (e.g. fields
representing offsets and sizes) or by packing (e.g. PE section fields).

4.1.3. Dataset Statistics
After extracting PE header metadata and imported functions from the 1,152,354 ransomware samples
in SoReL, we discarded files which could not be used for our experiments. These included files for
which no PE metadata could be extracted, files with fewer than 10 imported functions, and files which
were not available on VirusTotal. After this process, our dataset for generating ransomware signatures
contained 627,298 ransomware files.

4.2. Measuring Jaccard Similarity Between Files
During the next stage of our methodology, we computed the distance between pairs of potentially-
similar files. The distance metric we developed is based on Jaccard similarity. Let 𝐴 and 𝐵 be two sets.
Then, the Jaccard similarity of 𝐴 and 𝐵 is defined as:
  𝐽(𝐴, 𝐵) = |𝐴∩𝐵|
            |𝐴∪𝐵|

  Let 𝑀𝑖 be the set of imports for file 𝑖 and let 𝑃𝑖 be the set of selected PE metadata fields for file 𝑖.
Then, define 𝑑𝑖𝑠𝑡(𝑖, 𝑗) as:
                     𝐽(𝑀𝑖 ,𝑀𝑗 )+𝐽(𝑃𝑖 ,𝑃𝑗 )
  𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 1 −             2           +𝜖

   More informally, 𝑑𝑖𝑠𝑡(𝑖, 𝑗) is one minus the average of the Jaccard similarity between the sets of
imported functions and the Jaccard similarity between the sets of PE metadata, plus a small 𝜖 term.
This 𝜖 term is necessary for our implementation, where non-sparse elements must be non-zero and
positive. Our custom distance metric requires overlap in both the imported functions and the selected
PE metadata fields in order for a pair of files to be considered similar.
   Let 𝑋 be a distance matrix, where 𝑋𝑖𝑗 ∈ 𝑋 = 𝑑𝑖𝑠𝑡(𝑖, 𝑗). Due to the size of our ransomware dataset,
a dense matrix of shape 627,298 x 627,298 could not be stored in memory. For the same reason, it would
also not be practical to compute the distances between every pair of files. Because of this, our approach
is selective in choosing pairs of files for which we compute distances. From our set of 3,464 observed
Windows API functions, we identified “rare functions" which were imported by fewer than 5% of the
ransomware samples. Of these, we only computed the distance between pairs of files which shared at
least one of these “rare functions". This serves two purposes: it significantly reduces the number of
comparisons and it ensures that resulting signatures are not simply combinations of common Windows
API functions.
   We used Compressed Sparse Row (CSR) matrices in our implementation for storing the distance
matrix 𝑋 in a sparse format.
   As a dense matrix, 𝑋 would have required ≈1.4TB of memory with a 32-bit floating point repre-
sentation, but our sparse implementation required just 5.24 MB - more than a 250,000× savings. For
scalability purposes, our implementation supports parallelized distance computations. Using 32 cores
on an AMD Threadripper 3970X CPU, we computed 𝑋 in 21.64 minutes.
4.3. Application of HDBSCAN
Once distances for all eligible pairs of files have been computed, the sparse distance matrix 𝑋 has
encoded a network of the functional and structural overlaps between a massive collection of ransomware
files. We observe dense, inter-connected regions within this network where all files are members of the
same families. Our approach uses the Hierarchical Density-Based Spatial Clustering of Applications
with Noise (HDBSCAN) to extract these dense regions as clusters [20, 6].

4.4. Adapting HDBSCAN for Sparse Matrices
Unfortunately, the Python hdbscan library only supports dense matrices. It was necessary for us to
re-implement major portions of HDBSCAN ourselves in order to cluster the ransomware data. Below,
we describe our adaptations to the HDBSCAN algorithm.

4.4.1. HDBSCAN Mutual Reachability Distance
The initial step of HDBSCAN involves determining the “core distance" 𝑐𝑜𝑟𝑒𝑘 (𝑖) for each data point,
given by the distance to the 𝑘 𝑡ℎ nearest point. HDBSCAN uses the core distance to approximate the
density of a region around a given data point. We selected 𝑘 = 2 in our approach.
   Once the core distances are established, our implementation calculates a second metric, the mutual
reachability distance, between each pair of points 𝑖, 𝑗 for which 𝑋𝑖,𝑗 ̸= 0. The mutual reachability
distance 𝑟𝑒𝑎𝑐ℎ(𝑖, 𝑗) is defined as follows in [6]:
  𝑟𝑒𝑎𝑐ℎ(𝑖, 𝑗) = 𝑚𝑎𝑥 {𝑐𝑜𝑟𝑒𝑘 (𝑖), 𝑐𝑜𝑟𝑒𝑘 (𝑗), 𝑑𝑖𝑠𝑡(𝑖, 𝑗)}

  Mutual reachability in HDBSCAN considers the core distances of both points and the actual distance
between them, artificially increasing the distances between points that are not in dense regions. These
mutual reachability distances are then efficiently stored in a second sparse matrix 𝑍.

4.4.2. HDBSCAN Robust Single Linkage
Each mutual reachability distance 𝑍𝑖𝑗 ∈ 𝑍 can be thought of as a weighted edge in a graph between
nodes 𝑖 and 𝑗. However, 𝑍 may not represent a fully-connected graph. Our implementation identifies the
connected components of 𝑍 and computes the minimum spanning tree (MST) of each connected com-
ponent. Each MST yields a backbone of connectivity based on the shortest possible mutual reachability
distances between points in the connected component.
   The edges of each MST are then sorted from highest to lowest distance. This sorting is preparatory
for the next step, where a single linkage tree is derived from the MST. The single linkage tree represents
hierarchical clustering, where clusters at one level are joined to form clusters at the next level, based on
the closest mutual reachability distance.
   Finally, our implementation uses the Python hdbscan library to condense the single linkage tree,
which involves trimming branches that do not contribute to cluster stability. Cluster stability is an
assessment of the strength and persistence of clusters over different scales of distance. This condensed
tree represents the final output of HDBSCAN, delineating stable clusters that are robust across different
data densities and separations.

4.4.3. Hyperparameter Selection
A series of experiments were conducted to empirically select key hyperparameters, namely the minimum
number of samples per cluster (min_samples) and the ’k’ value for core distance calculations. These
experiments were important in identifying the optimal settings that yielded the most coherent and
interpretable clustering results.
import pe

rule cluster_10 {
  meta:
    hash1 = "2f746653089765de4158b6dbda..."
    hash2 = "a2c12d3f0af37c9a5769dcb681..."
    hash3 = "3a1af65e3362d6371590099a5b..."
    ...
  condition:
    pe.imports("version.dll",
    "GetFileVersionInfoSizeA") and
    pe.imports("version.dll",
    "GetFileVersionInfoA") and
    pe.imports("version.dll",
    "VerQueryValueA") and
    ...
}

Listing 1: Example generated YARA rule


4.5. YARA Rule Signature Generation
The final stage of our methodology is the generation of function signatures for each cluster which
contains at least 10 files. We do this using YARA rules. Although YARA is typically used for matching
byte patterns in files, YARA rules can be written to match files based on solely their imported functions
using the YARA PE module.
   The process of generating a function signature begins by analyzing each cluster’s aggregated data
and identifying a core set of Windows API functions shared by each of the files in the cluster. In order
to generate a rule, all files in the cluster must import at least six common functions, with one of those
being a “rare function" that is imported by no more than 5% of the ransomware samples in our dataset.
Then, our implementation formats each function that is common to all files in the cluster (and the
name of the DLL from which the function was imported) into the “condition" section of a YARA rule.
Metadata, such as the hashes of each file in the cluster, is formatted into the “meta" section of the rule.
An example generated function signature is shown in Listing 1.


5. Experimental Analysis and Results
In this section, we present a detailed anal-
ysis of our methodology’s cluster forma-
tion and signature generation. In addi-
tion, we discuss the insights derived from
testing our generated signatures on fur-
ther ransomware data.

5.1. Cluster Validation
Clustering our dataset of 627,298 ran-
somware files yielded a total of 751 clus-
ters with size 10 or greater. Figure 2
shows the cluster to which each sample
                                            Figure 2: Distribution of all samples with their cluster.
belongs. We manually investigated each
of the 751 clusters. Of these, 386 clusters
predominantly featured various types of known ransomware. The remaining clusters either had incor-
rect SoReL-20M tags (and after further investigation we found that these clusters instead were better
categorized as RATs, PUPs, trojans, file infectors, or worms) or were groups of generic ransomware
without a known family name.
   We accepted the most common AV-
Class label as the family name of the clus-
ter, and Figure 3 shows the distribution
of clusters with known ransomware fam-
ilies [19]. Most of the clusters we identi-
fied were uniformly labeled by AVClass
or had only minor variance in family la-
beling. In some cases, these were due to
false positives in our methodology, and in
others they were due erroneous AVClass
outputs.                                        Figure 3: Distribution of ransomware and the cluster they
   We note that approximately 267 clus-                   belong.
ters among the clusters featuring ran-
somware were identified as containing solely Gandcrab ransomware, and these clusters tended to
be larger in size, as shown in Figure 3. Furthermore, our study detected clusters containing CryptXXX,
Exxroute, and Tovicrypt, which appear to be variants of the same malware. Besides these, we also
identified other frequently-occurring ransomware families including Cerber, Wannacry, Satan, Titirez,
and Gotango.

5.2. Discussion on Ransomware Family Clusters
Let us discuss the ransomware families we identified in more detail. The fundamental behavior of
ransomware remains consistent: once it gains access to a system, it encrypts the system or files and
demands a ransom. However, our experiments have revealed that each ransomware variant employs
distinct sets of Windows API functions to infiltrate and encrypt files.

5.2.1. Ransomware Function Groupings
Considering the approach of Alazab et al, [11] to group API functions into distinct behaviors, we
organized the frequently-occurring Windows API function calls identified across all ransomware
clusters into five function types. Each group of functions is related to a specific behavior common to
most ransomware. We describe each observed set of behaviors below, and more details are available in
Table 4 in Appendix A.

    • Function Type 1: System and Process Ransomware uses these functions to manipulate system
      resources and processes, allowing it to control system behavior and execute malicious actions
      discreetly.
    • Function Type 2: File and Library Interaction This category of functions enable ransomware
      to interact with files and to dynamically load functions from external libraries, facilitating tasks
      such as file encryption and enabling access to functionality not listed in the import address table.
    • Function Type 3: GUI and Window Functions Ransomware employ these functions to display
      ransom notes and manipulate window behavior to pressure users into paying the ransom.
    • Function Type 4: Shell and COM Operations Ransomware can leverage these functions to exe-
      cute system commands, access system resources, and exploit vulnerabilities through interactions
      with COM objects, enabling deeper infiltration and control of the system.
    • Function Type 5: Utility Functions This category provides ransomware with tools for memory
      management, string manipulation, and configuration tweaks, enhancing its capabilities in data
      manipulation, evasion, and persistence.
5.2.2. Ransomware Family Investigations
In addition to observing these function
groupings across all ransomware fami-
lies in our study, we also noticed trends
in the prevalence of certain functions
within each family. We measured the fre-
quency with which each Windows API
function was imported by each of the five
most common ransomware families in
our dataset. As we show in Figures 4-
8, there are clear differences in how fre-
quently certain functions are imported
by specific families. We provide a survey
of the five most common families in our
study, describing which behaviors their      Figure 4: Function counts in Gandcrab files
commonly-imported Windows API functions may enable them to perform.

GandCrab The unique imports uti-
lized by GandCrab are illustrated in
Figure 4. An analysis of the API
calls indicates a predominance of be-
haviors 1, 2 and 5. Functions like
“GetMailSlotInfo", “PostMessageW", “Ter-
minateThread", “SetProcessShutdownPa-
rameters", and “GetTickCount" are most
frequent within these clusters. This pat-
tern suggests that GandCrab ransomware
aims to access files and directories and
establish a connection to its command
and control server to encrypt the entire
machine.                                        Figure 5: Function counts in WannaCry files

Wannacry Figure 5 reveals that
Wannacry ransomware also employs
behaviors 2 and 3. However, Wan-
nacry distinguishes itself by utilizing
additional API functions such as
“LockResource",        “GetSystemMenu",
“InterlockedExchange", “DeleteFileA",
“RemoveDirectoryA", and “FindWin-
dowA", which are not as common in
other ransomware families. WannaCry
appears more likely to take destructive
actions than other ransomware, and
open-source reporting corroborates our
findings - it deletes and overwrites files        Figure 6: Function counts in Cerber files
on the desktop and in the User folder. [21]

Cerber As depicted in Figure 6, Cerber exhibits characteristics common to both GandCrab and
Wannacry. The observed frequency of all function names is consistently high, which suggests they
exhibit a wide variety of behaviors.
Gotango Figure 7 displays the API
functions that Gotango ransomware ex-
ploits, including behaviors 2 and 4.
Gotango engages in File and Library
Management operations, using functions
like “DeleteFileA" and “GetSystemDirec-
toryA", while showing a heightened fo-
cus on Shell and COM operations. Func-
tions such as "Process32Next" and "Pro-
cess32First" allow the malware to check
if processes of interest are currently run-
ning.
                                                       Figure 7: Function counts in Gotangto files

CryptXXX The API calls leveraged by
CryptXXXpredominantly exhibit behav-
iors 3 and 5. This ransomware vari-
ant utilizes string-related functions such
as “strCmpLogicalW", “StrStrNW", “Str-
CmpW", and “IsBadStringPtrA", among
others. Additionally, it engages functions
like “CryptVerifySignatureW" and makes
extensive use of graphics control com-
mands, including “TransparentBlt" and
“AlphaBlend", suggesting a diverse oper-
ational approach.

5.3. Function Signature Validation                    Figure 8: Function counts in CryptXXX files

We employed the MalDICT-Behavior dataset to verify the accuracy of our function signature generation
approach [22]. MalDICT-Behavior includes 105,849 malicious files tagged as ransomware. We queried
the VirusTotal API for these files between February and April 2023, and we used AVClass to obtain
malware family labels for them [19]. MalDICT-Behavior does not exclusively contain Windows PE files.
We selected the 386 generated signatures corresponding to ransomware with known family names
(Gandcrab, Cerber, WannaCry, CryptXXX, Exxroute, Tovicrypt, Gotango, Satan and Titirez).
   Then, we scanned the 105,849 ransomware files selected from MalDICT-Behavior using these 386
signatures. A total of 6667 unique files were matched, and there were 25,912 total signature matches.
This is an average of 3.88 signature matches per detected file, due to multiple signatures generated for
most families. As we later discuss in more detail, it was very rare for files to be detected by signatures
for different families.
   To measure the performance of the scans, we identified 62,308 Windows PE files with a valid Import
Address Table in our dataset of 105,849 files. Of these, 55,701 had AVClass labels. Because it was possible
for a file to be detected by multiple signatures, we strictly defined the conditions that must be met in
order to achieve a true positive:

    • True Positive (TP): The file is detected by at least one signature, and the AVClass family is correct
      (for all signatures, if there are multiple detections).
    • False Positive (FP): The file is detected by a signature, but the AVClass family does not match the
      family the signature is meant to detect.
    • False Negative (FN): The file belongs to one of the families of interest, but no signature detects it.
    • True Negative (TN): The file does not belong to any of the families of interest, and no signature
      matches it.
Table 2
Contingency Table Results
                                                     Predictive
                                     Actual     Positive Negative
                                    Positive      6558        9505
                                    Negative       109       39529


Table 3
Classifier Performance Metrics
                                         Measurement                 Value
                                         Precision (PPV)             0.9834
                                           Recall (TPR)              0.4082
                                        Specificity (TNR)            0.9972
                                             F1 Score                0.5765
                                    False Positive Rate (FPR)        0.0027
                                   False Negative Rate (FNR)         0.5918
                                 Positive Predictive Value (PPV)     0.9834
                                 Negative Predictive Value (NPV)     0.8053


  Table 2 is the resulting contingency table for the 55,701 scanned PE files with AVClass labels. We
computed multiple metrics to evaluate the performance of our generated signatures, shown in Table 3.

5.4. Discussion
In our analysis, we identified just 116 false positives, and only seven files were matched by signatures
for different families. Four files were matched by both Gandcrab and Titirez signatures and three files
were matched by both CryptXXX and Tovicrypt signatures. Since CryptXXX and Tovicrypt are variants,
these false positives are not unexpected.
   Based on the metrics in Table 3, we can conclude that our approach demonstrates satisfactory overall
performance in identifying known ransomware families. The high precision (98.34%) and specificity
(99.72%) values suggest that our signatures have a very low false positive rate. Due to the extreme class
imbalance of malware to benign files, antivirus products place high emphasis on this quality [23].
   Although our signature generation approach prioritizes minimizing false positives, users have the
option of changing multiple hyper-parameters which would increase recall at the expense of lowered
precision. Additionally, we believe that some of the false negatives produced by our generated signature
can be explained by our selected evaluation dataset. The SOREL-20M dataset used for generating
signatures only includes malware captured between 2017 and 2019 [18]. However, the MalDICT dataset
includes malware first captured between 2006 and 2023. There are many ransomware families in
MalDICT which do not appear in SOREL, and multiple ransomware families from our case study have
been under active development since 2019.


6. Conclusion
We have demonstrated a method for generating signatures for known ransomware families using unique
combinations of Windows API functions. Unlike traditional byte pattern signatures, our function signa-
tures are unaffected polymorphism and other common evasion techniques which alter the malware’s
code. We created a scalable implementation which can efficiently cluster hundreds of thousands of
malicious files and automatically generate a robust and actionable function signature for each.
  We performed a case study on the unique function combinations within the well-known GandCrab,
Wannacry, Cerber, Gotango, and CryptXXX families, identifying unique behaviors and API function
usage patterns. Each ransomware family’s distinct approach to system infiltration and file manipulation
was highlighted, providing insights into their operational mechanisms and objectives.
   Ultimately, our contributions underscore the significance of meticulous experimental analysis in
understanding ransomware’s evolving threats. Furthermore, although the scope of our study was
limited to ransomware, our function signature generation method can be applied to any Windows
executable malware. The implications of this research will help broaden understanding of malware
trends and offer a robust framework for automatically identifying and classifying these complex and
ever-evolving threats.


7. Future Work
The current study has demonstrated promising results in using HDBSCAN for clustering ransomware
samples and generating YARA rules. However, there are several areas for further investigation to
enhance the robustness and applicability of our methodology.
   While our signature generation method has shown good accuracy in detecting known ransomware
families, the detection error rate for new malware, especially in broad classes such as trojans and
backdoors, remains high. This indicates that while our approach is effective within the specific domain
of ransomware, its generalizability to other types of malware needs improvement. Future work will
focus on refining our algorithm to reduce this error rate, potentially through the incorporation of more
diverse features and advanced machine learning techniques.
   In addition, benchmarking our methodology against existing state-of-the-art approaches is essential
for placing our findings within the broader context of ransomware detection. Future work will involve
such benchmarking to validate our approach and highlight areas for further refinement. By addressing
these key areas, future research will build upon the foundation laid by our current study, enhancing the
robustness, accuracy, and comprehensiveness of ransomware detection methodologies.


References
 [1] A. Petrosyan, Global firms targeted by ransomware 2023, 2024. URL: https://www.statista.com/
     statistics/204457/businesses-ransomware-attack-rate/.
 [2] D. Santos, Ransomware Retrospective 2024: Unit 42 Leak Site Analysis, 2024. URL: https://unit42.
     paloaltonetworks.com/unit-42-ransomware-leak-site-data-analysis/.
 [3] C. Team, Ransomware Hit $1 Billion in 2023, 2024. URL: https://www.chainalysis.com/blog/
     ransomware-2024/.
 [4] M. Botacin, F. D. Domingues, F. Ceschin, R. Machnicki, M. A. Zanata Alves, P. L. De Geus, A. Grégio,
     AntiViruses under the microscope: A hands-on perspective, Computers & Security 112 (2022)
     102500. URL: https://linkinghub.elsevier.com/retrieve/pii/S0167404821003242. doi:10.1016/j.
     cose.2021.102500.
 [5] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering, The Journal of
     Open Source Software 2 (2017) 205.
 [6] hdbscan, The hdbscan Clustering Library — hdbscan 0.8.1 documentation (2016). URL: https:
     //hdbscan.readthedocs.io/en/latest/, available online:https://hdbscan.readthedocs.io/en/latest/.
 [7] R. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density
     estimates, volume 7819, 2013, pp. 160–172. doi:10.1007/978-3-642-37456-2_14.
 [8] Crowdstrike, 5 most common types of ransomware - crowdstrike, 2023. URL: https:
     //www.crowdstrike.com/cybersecurity-101/ransomware/types-of-ransomware/,                  available
     online:https://www.crowdstrike.com/cybersecurity-101/ransomware/types-of-ransomware/.
 [9] Kaspersky, Ransomware Attacks and Types – How Encryption Trojans Differ, 2023. URL: https:
     //www.kaspersky.com/resource-center/threats/ransomware-attacks-and-types, section: Resource
     Center.
[10] GrantMeStrength, Windows API index - Win32 apps, 2023. URL: https://learn.microsoft.com/
     en-us/windows/win32/apiindex/windows-api-list.
[11] M. Alazab, S. Venkataraman, P. Watters, Towards Understanding Malware Behaviour by the
     Extraction of API Calls, in: 2010 Second Cybercrime and Trustworthy Computing Workshop,
     IEEE, Ballarat, Australia, 2010, pp. 52–59. URL: http://ieeexplore.ieee.org/document/5615097/.
     doi:10.1109/CTC.2010.8.
[12] A. Walker, R. M. Shukla, T. Das, S. Sengupta, Runs in the Family: Malware Family Variants
     Identification through API Sequence and Frequency Analysis, in: 2023 International Conference
     on Multimedia Computing, Networking and Applications (MCNA), IEEE, Valencia, Spain, 2023, pp.
     55–61. URL: https://ieeexplore.ieee.org/document/10185752/. doi:10.1109/MCNA59361.2023.
     10185752.
[13] A. Y. Daeef, A. Al-Naji, J. Chahl, Features Engineering for Malware Family Classification Based
     API Call, Computers 11 (2022) 160. URL: https://www.mdpi.com/2073-431X/11/11/160. doi:10.
     3390/computers11110160.
[14] R. A. Mowri, M. Siddula, K. Roy, Application of Explainable Machine Learning in Detecting and
     Classifying Ransomware Families Based on API Call Analysis, 2022. URL: http://arxiv.org/abs/
     2210.11235, arXiv:2210.11235 [cs].
[15] P. Mohan Anand, P. Sai Charan, S. K. Shukla, A Comprehensive API Call Analysis for Detect-
     ing Windows-Based Ransomware, in: 2022 IEEE International Conference on Cyber Security
     and Resilience (CSR), IEEE, Rhodes, Greece, 2022, pp. 337–344. URL: https://ieeexplore.ieee.org/
     document/9850320/. doi:10.1109/CSR54599.2022.9850320.
[16] A. Atzeni, F. Díaz, A. Marcelli, A. Sánchez, G. Squillero, A. Tonda, Countering android malware: A
     scalable semi-supervised approach for family-signature generation, IEEE Access 6 (2018) 59540–
     59556. doi:10.1109/ACCESS.2018.2874502.
[17] E. Raff, R. Zak, G. L. Munoz, W. Fleming, H. S. Anderson, B. Filar, C. Nicholas, J. Holt, Automatic
     Yara Rule Generation Using Biclustering, in: Proceedings of the 13th ACM Workshop on Artificial
     Intelligence and Security, 2020, pp. 71–82. URL: http://arxiv.org/abs/2009.03779. doi:10.1145/
     3411508.3421372, arXiv:2009.03779 [cs, stat].
[18] R. Harang, E. M. Rudd, SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection,
     2020. URL: http://arxiv.org/abs/2012.07634, arXiv:2012.07634 [cs].
[19] M. Sebastián, R. Rivera, P. Kotzias, J. Caballero, AVclass: A Tool for Massive Malware Labeling,
     in: F. Monrose, M. Dacier, G. Blanc, J. Garcia-Alfaro (Eds.), Research in Attacks, Intrusions, and
     Defenses, volume 9854, Springer International Publishing, Cham, 2016, pp. 230–253. URL: http:
     //link.springer.com/10.1007/978-3-319-45719-2_11. doi:10.1007/978-3-319-45719-2_11, se-
     ries Title: Lecture Notes in Computer Science.
[20] L. McInnes, J. Healy, S. Astels, Parameter selection for hdbscan— hdbscan 0.8.1 doc-
     umentation, 2016. URL: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#
     min-samples-label, available online: https://hdbscan.readthedocs.io/en/latest/parameter_selection.
     html#min-samples-label.
[21] C. Team, How Can Disk Drill Help with Wannacry Ransomware Attack?, 2017. URL: https:
     //www.cleverfiles.com/help/wannacry-recovery.html.
[22] R. J. Joyce, E. Raff, C. Nicholas, J. Holt, MalDICT: Benchmark Datasets on Malware Behaviors,
     Platforms, Exploitation, and Packers, 2023. URL: http://arxiv.org/abs/2310.11706, arXiv:2310.11706
     [cs].
[23] A. T. Nguyen, E. Raff, C. Nicholas, J. Holt, Leveraging Uncertainty for Improved Static Malware
     Detection Under Extreme False Positive Constraints, 2021. URL: http://arxiv.org/abs/2108.04081,
     arXiv:2108.04081 [cs].
A. Appendix

Table 4
API Calls Categorization Examples


 Function Behavior                   Windows API Calls
 Behavior 1: System and Process      GetTickCount (438), SetProcessShutdownParameters (371), Termi-
 Functions                           nateThread (365), CompareFileTime (221), GetProcessShutdownParam-
                                     eters (192), GetNativeSystemInfo (178), VirtualProtect (115), GetVersion
                                     (109)
 Behavior 2: File and Library Man-   LoadLibraryW (279), LoadLibraryA (203), LockResource (154), GetH-
 agement                             GlobalFromStream (143), GetModuleFileNameW (162), GetModuleFile-
                                     NameA (121), GetSystemDirectoryA (109), CreateStreamOnHGlobal
                                     (109)
 Behavior 3: GUI and Window Man-     PostMessageW (413), CreateWindowExA (193), EqualRect (193),
 agement                             ShellAboutA (156), TransparentBlt (142)
 Behavior 4: Shell and COM Opera-    ShellExecuteW (6436), ShellExecuteA (246), CoGetCurrentProcess (244),
 tions                               CoRegisterMallocSpy (167), SHGetSpecialFolderLocation (125), SHCre-
                                     ateShellItem (125)
 Behavior 5: Utility Functions       GetMailslotInfo (470) GetLongPathNameW (218), GetLongPathNameA
                                     (192), InitCommonControls (187), MapVirtualKeyW (182), IsChild (167),
                                     FindFirstVolumeMountPointW (167), GetCPInfoExA (128), GetTemp-
                                     PathA (124), AreFileApisANSI (101), RaiseException (225)

</pre>