<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference on Applied Machine Learning for Information Security, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Navigating Concept Drift and Packing Complexity in Malware Family Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Numan Halit Guldemir</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oluwafemi Olukoya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesús Martínez-del-Rincón</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Secure Information Technologies (CSIT), Queen's University Belfast</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>In the rapidly evolving cybersecurity landscape, the classification of malware families presents significant challenges due to the dynamic nature of malware, a phenomenon known as concept drift. This research classifies Windows PE malware families using static analysis of raw opcode sequences. By leveraging Convolutional Neural Networks (CNNs) to extract unique features from these sequences, our approach achieves high classification accuracy rates of 97.93% and 89.55% on the Microsoft Malware Classification Challenge and BODMAS datasets, respectively. We also conducted a temporal analysis on BODMAS over 13 months to observe the evolution of malware families and identify periods where our model's accuracy decreases. We implemented a retraining strategy, allowing us to observe how retraining the model with new data helps it adapt to new malware patterns. The study also examined the impact of packed malware and diferent types of packers on the model's performance. Our findings indicate that packed malware significantly afects the model's accuracy, with some packers having a more pronounced impact than others. These results underscore the importance of regular model updates and specialized handling of packed malware to maintain robust detection capabilities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;malware family classification</kwd>
        <kwd>concept drift</kwd>
        <kwd>Windows PE malware</kwd>
        <kwd>temporal analysis</kwd>
        <kwd>packed malware</kwd>
        <kwd>packer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Detecting malware cannot be understated, and a great focus lies in identifying family-related threats. As
technology rapidly grows, so does our reliance on various software, making us vulnerable to cyberattacks.
This has led to an increase in harmful software created by cyberattackers. A malware statistics and
trends report by AV-Test [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] indicates that approximately 300,000 new malware emerge daily. This rise is
due to the accessibility of malware construction toolkits, which empower even individuals without the
expertise to create malware samples [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These toolkits facilitate the implementation of sophisticated
methods like polymorphism and metamorphism. Consequently, the need to prevent the spread of these
harmful files and efectively detect them remains of utmost importance.
      </p>
      <p>
        One critical challenge in the field of malware detection is the continuous evolution of malware.
Building reliable detection models becomes increasingly challenging as malware develops and changes
over time. This is because when these models come across new or evolved types of malware, they
may have dificulty accurately identifying them. This phenomenon is referred to as concept drift [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
model aging [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or time decay [5] in the literature. Drifted samples are prone to being misclassified by
detection models. Therefore, it is essential to identify and monitor concept drift in these models, as it
enables the detection of when the model is becoming outdated and incapable of efectively identifying
new and emerging malware threats.
      </p>
      <p>
        In this study, the primary focus of the malware detection system is on Windows malicious applications,
commonly referred to as Windows PE malware. This focus is based on the observation that, despite
the increasing prevalence of malware targeting other platforms such as Android, Windows remains
the primary target for malware creators [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Furthermore, Windows PE malware exhibits distinct
characteristics that set it apart from other types of malware. These diferences in behavior and structure
make it necessary to develop specialized detection models and strategies tailored to the unique challenges
posed by Windows PE malware.
      </p>
      <p>Various research on malware detection showcases a spectrum of approaches, from signature-based to
behavior-based methodologies [6]. Conventional approaches heavily relied on signatures and hashes to
identify binaries. Nevertheless, these techniques are susceptible to exploitation, as malware creators can
efortlessly manipulate them; for instance, they can alter hashes using methods like polymorphism or
metamorphism [7]. It is important to note that, by adopting these methods, alterations can be made to
the malware’s binary structure, including its hash, without afecting the underlying malicious behavior.
This ability to evade signature-based detection emphasizes the need for more sophisticated approaches
to malware detection.</p>
      <p>Static analysis presents a rapid means to extract characteristics of PE malware. What distinguishes
this approach is its independence from execution, thus rendering it a secure option. Through static
analysis, a range of attributes can be derived from PE malware, encompassing API calls, byte sequences,
string patterns, opcodes, and the inherent characteristics of the PE malware itself [8]. On the other
hand, dynamic malware analysis involves executing malware in controlled environments like sandboxes
to examine its behavior. This method requires more time and resources than static analysis and faces
scalability challenges. Some malware can even detect when running in controlled environments and
adjust its behavior to avoid detection, making dynamic analysis less efective [ 9]. Another method
known as memory analysis has gained popularity in recent years. This technique has piqued the interest
of malware analysts due to its ability to provide a comprehensive assessment of malware [10, 11, 12, 13].
However, this method can be time-consuming, with generating a single memory dump taking up to 2
minutes, and the resulting large dump files exceeding 1 gigabyte. These challenges intensify resource
requirements and hinder scalability and eficiency in malware detection and analysis [10].</p>
      <p>This research leans towards utilizing static features, mainly because of the limitations of the alternative
methodologies discussed. Within these static attributes, the emphasis on using opcode sequences stems
from their potential to ofer valuable insights into the operational characteristics of binary files, all
without necessitating their execution [14]. Opcodes are the fundamental building blocks of executables,
comprising a structured assembly of directives [15]. By examining opcode sequences, important patterns
of behavior can be extracted. This approach eliminates the necessity for real-time execution, making
the analysis process more eficient and secure.</p>
      <p>The contributions of this paper are highlighted below:
• We perform an initial temporal analysis to observe how the performance of our model changes
over time, providing insights into the presence of potential concept drift.
• We conduct preliminary retraining of the model with random samples from specific months to
observe how its performance may improve as it encounters new data patterns over time.
• We examine the impact of packed malware and diferent packers on model performance.</p>
      <p>This paper is organized as follows: Section 2 reviews relevant research on malware detection, malware
family classification and concept drift. Section 3 describes our methodology, including data processing
and model design. Section 4 outlines the datasets and experimental setup. Section 5 discusses our results
and key findings, and Section 6 examines the efects of packed malware. Finally, Section 7 concludes
with insights and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <sec id="sec-2-1">
        <title>2.1. Opcode-based Malware Family Detection</title>
        <p>
          Opcodes, which represent the fundamental instructions executed by a binary, provide valuable insights
into its behavior and characteristics. One notable study is Bilar’s work, which made significant
contributions to malware detection through opcode analysis [16]. In this study, the author investigated
an approach for identifying malicious code through the statistical analysis of opcode distributions.
The results showed statistical distinction in opcode distributions between malware and non-malicious
software, with infrequent opcodes demonstrating a higher predictive capability. Parildi et al. [17]
utilized opcodes that are obtained at runtime to make a binary classification of PE samples. They used
Word2Vec to generate opcode word embeddings, followed by CNN, RNN and Transformers. McLaughlin
et al. [18] proposed an Android malware detection system using a deep convolutional neural network.
The system performs static analysis of raw opcode sequences from disassembled programs to learn
features indicative of malware. Lu [19] employed opcodes of PE malware to determine malware family
statistically. They propose a two-stage LSTM model for malware detection, which utilizes two LSTM
layers and one mean-pooling layer to obtain the feature representations of opcode sequences of malware.
The experiments were conducted on a 969 malicious and 123 benign files dataset. Kakisim et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
introduced a malware detection method called SOEMD. This approach uses static analysis to detect
malware by creating a new feature space of sequential opcode embeddings through Random Walk and
Skip-Gram techniques. Kale et al. [20] investigated the use of word embedding techniques (HMM2Vec,
Word2Vec, BERT, and ELMo) for malware family classification. They employed four classification
algorithms (k-nearest neighbors, support vector machines, random forests, and convolutional neural
networks) on each word embedding technique. The experiments are conducted on opcode sequences
from seven malware families.
        </p>
        <p>In summary, all these methods demonstrated how the use of opcode sequences and deep learning
techniques is a powerful and efective combination in the detection of malware as well as the malware
family. However, since they rely on a training set collected at a given time, they are subject to
performance degradation over time. Additionally, these methods did not consider packed samples
and packer techniques, which can significantly impact the performance and efectiveness of malware
detection systems [8].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Concept Drift</title>
        <p>
          Although our research on concept drift is preliminary, it is essential to highlight the significance of this
phenomenon and recognize the valuable contributions of recent studies in this area. Jordaney et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
designed a framework called TRANSCEND to identify when a classification model was consistently
misclassifying new data, indicating that concept drift had occurred. The TRANSCEND framework uses
statistical metrics to detect concept drift. It employs a non-conformity measure to assess how well test
samples fit, using a p-value. Building on this, Barbero et al. developed TRANSCENDENT, which has
enhanced the eficiency and reduced the computational costs of the original Transcend framework. Yang
et al. [22] developed framework called CADE. The primary objective of the framework is to identify
drifting samples that deviate from the existing class and provide explanations for the occurrence of drift.
Ceschin et al. [23] investigated the impact of concept drift on Android malware detection, utilizing
datasets from DREBIN and AndroZoo. They examined how evolving malware characteristics afect
machine learning classifiers and proposed a data stream learning pipeline that updates both classifiers
and feature extractors. The authors [24] proposed a three-step forensic exploration method to evaluate
concept drift and model performance over time. This approach involves comparing models trained
with diferent temporal windows, evaluating the influence of temporal information on model drift, and
conducting a detailed misclassification and feature analysis to identify the causes of drift and model
failure.
        </p>
        <p>To sum up, the objectives of our experiments revolve around addressing the following research
questions:</p>
        <p>RQ1: Given the use of a recent dataset, can opcode analysis still efectively provide insights into the
behavior and functionality of various malware families?</p>
        <p>RQ2: How does the performance of our opcode-based classification model change over time when applied
to Windows PE malware, and what patterns can we observe?</p>
        <p>RQ3: What initial steps or observations can be made towards understanding and addressing concept
drift in an opcode-based analysis of Windows PE malware?</p>
        <p>RQ4: How do packed samples and diferent packers influence the performance of malware family
classification models?</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. METHOD</title>
      <p>We employ a deep learning approach inspired by natural language processing techniques to process the
opcodes. First, we transform these opcodes into word embedding arrays. We then utilize a CNN model
to classify the malware using these transformed representations.</p>
      <sec id="sec-3-1">
        <title>3.1. Disassembly of PE Malware</title>
        <p>
          In this study, we employed the distorm3 disassembler to extract instructions from PE malware samples.
Distorm3 [25] is a disassembler tool that converts binary code into human-readable assembly language.
To streamline our analysis and focus on high-level behavioral patterns, we discarded operands similar
to [
          <xref ref-type="bibr" rid="ref2">18, 2, 20, 17</xref>
          ], which specify the data to be manipulated by the instructions, from the extracted
instructions.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Packer Detection</title>
        <p>To detect packers in the BODMAS dataset, we employed the Detect It Easy [26]. Detect It Easy is
known for its comprehensive capability to identify various types of packers embedded in PE files.
This tool provides detailed information regarding the presence of packers within the PE binaries. By
utilizing Detect it Easy tool, we were able to identify whether a sample was packed as well as packer
information, which enabled us to analyze the impact of packed samples and packers on our model’s
performance. This detection step is crucial for understanding how diferent packing techniques may
afect the performance of our malware classification model and for addressing the specific challenges
posed by these techniques. Additionally, as we do not have binaries in the BIG2015 dataset, we were
unable to apply this method as it requires actual binaries.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Architecture</title>
        <p>Leveraging deep learning techniques, we automated the feature extraction process for analyzing
PE malware through opcode sequences. We integrated an opcode embedding layer into our model
architecture. Unlike integer-encoding and one-hot vector representations, which are limited in capturing
complex patterns, the embedding layer enables the model to learn patterns and semantic relationships
within the opcode data. Initially, the weights of the embedding layer are set randomly and updated
during training. This allows the embedding layer to represent opcodes in a continuous vector space,
where similar opcodes are positioned closer together. This approach reduces the dimensionality of the
input data, alleviating computational complexity and improving generalization across various opcode
sequences.</p>
        <p>Building upon the embedding layer, our model includes several key layers to enhance the analysis.
The embedding layer has an embedding dimension of 256. We applied a dropout layer with a rate of
0.25 after the embedding layer to prevent overfitting. The first convolutional layer uses 1024 filters with
a kernel size of 7 and ReLU activation to capture local patterns of consecutive instructions within the
opcode data. Following the convolutional layer, a global max pooling layer reduces dimensionality while
retaining significant features. A dense layer with 1024 units further interprets these features, learning
non-linear combinations. Another dropout layer with a rate of 0.25 is added before the final softmax
classification layer, which translates the learned features into probability distributions across diferent
malware categories. The model is trained using the Adam optimizer with a learning rate of 1 − 4. This
comprehensive structure allows the model to provide insights into the most likely classification for a
given opcode sequence. The model structure is given in Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset and Experimental Setup</title>
      <p>This section describes the dataset utilized and the preprocessing procedures applied before training our
malware family classification model.
4.1. Dataset 1: BIG2015
Microsoft Malware Classification Challenge (BIG 2015) dataset is a common benchmark in the field
[27]. The dataset comprises approximately 10,868 training and 10,873 testing PE malware which were
labelled into 9 diferent families. Each malware sample provided by Microsoft is represented through
two files: an assembly asm file containing low-level language code, as well as a corresponding compiled
bytes file containing CPU machine code. Table 1 provides an overview of the distribution of samples
among the diferent malware families.</p>
      <sec id="sec-4-1">
        <title>4.2. Dataset 2: BODMAS</title>
        <p>BODMAS dataset is an open dataset comprising a collection of 57,293 malware samples and 77,142
benign samples (a total of 134,435) [28]. In this dataset, malicious files were gathered between August
29, 2019, and September 30, 2020, and benign samples were collected from January 1, 2007, to September
30, 2020, aiming to represent the real-world distribution of PE binaries. The dataset also includes family
information for malicious samples, encompassing a total of 581 families. Additionally, the authors have
provided timestamp information indicating the first-seen (also referred to as first-submission) date,
which is based on VirusTotal reports.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Experimental Setup</title>
        <p>We implemented several preprocessing steps before the training and testing process. In the BIG2015
dataset, opcodes were extracted from the .asm files using a regex pattern to isolate only the opcodes. For
the BODMAS dataset, which consists of actual PE binaries, distorm3 disassembler [25] was employed
to perform the opcode extraction. Operands were removed from the instructions in both datasets to
simplify analysis and capture the essential behavior of the malware. Additionally, the input was limited
to the first 200 opcodes for each sample. This number was determined based on the results of our
preliminary experiments, as shown in Figure 2. This approach helped reduce complexity and provided
a consistent input size for our deep learning model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. EVALUATION</title>
      <p>In the context of our study on PE malware, we conducted a series of four distinct experiments to
comprehensively evaluate our proposed approach and perform temporal analysis and concept drift
analysis and mitigation. These experiments were designed to delve into our methodology’s performance.
The foundational datasets for our analysis were sourced from Microsoft Malware Classification Challenge
(BIG 2015) [27] and BODMAS [28]. We adopted our deep learning-based approach using opcode word
embeddings (see Subsection 3.1) to analyze the opcodes within the same Microsoft dataset. These initial
investigations aimed to establish a comparative benchmark for our baseline model’s performance in a
domain already replete with existing works. Shifting our focus to concept drift, we first applied our
word embedding model to the entirety of the BODMAS dataset. Furthermore, to measure the temporal
evolution of our models’ eficacy, we conducted our final experiments using the BODMAS dataset
separated into individual months.
5.1. Experiment 1: BIG2015
In this experiment, we evaluated the efectiveness of our model using the Microsoft Malware
Classification Challenge dataset. Despite being relatively older, the BIG2015 dataset was chosen to provide
an overall idea of our model’s performance. We extracted opcodes from the .asm files using regular
expressions to focus solely on opcode analysis. We enhanced the dataset quality by filtering out files
containing only repetitive instructions, excluding 54 samples. Due to the absence of testing labels, the
training dataset was used for both training and testing, employing a 5-fold cross-validation method.
This involved splitting the dataset into five subsets, with each subset serving as a testing set once, while
the remaining portions were used for training. The training dataset was split into an 80:20 ratio in each
fold, ensuring that every family was represented in both the training and test sets.</p>
      <p>The model described in Subsection 3.3 was used for our experiments, with results shown in the
normalized confusion matrix in Figure 3. In the confusion matrix, rows represent the true class labels,
and columns indicate the predicted labels. Each cell shows the normalized proportion of predictions,
with each row summing to 1, allowing for a straightforward comparison of the model’s performance
across classes. Additionally, Table 2 includes a comparative analysis with other state-of-the-art studies
as cited in those papers. Although there might be minor variations in experimental setups, this test
confirms the efectiveness of our model as one of the reliable baselines for assessing concept drift.</p>
      <sec id="sec-5-1">
        <title>5.2. Experiment 2: BODMAS</title>
        <p>In this experiment, the BODMAS dataset is utilized [28]. While the dataset ofers the actual binary files
for the malicious samples, benign files are not provided by authors due to copyright concerns. Due to
limitations in obtaining benign binaries, our investigation focused solely on malicious samples for our
experiments.</p>
        <p>We employed the most prominent 20 malware families from the dataset. This choice ensured precise
training by establishing a minimum sample size. In our experimental phase, we constrained opcodes to
200, guided by the experiments given in Figure 2, as the highest overall accuracy score achieved using
200 opcode length.</p>
        <sec id="sec-5-1-1">
          <title>Full training on BODMAS</title>
          <p>In this experiment, we evaluated our model on the BODMAS dataset to understand its performance
on a more recent dataset. Given the availability of timestamps for each sample, we opted not to use
k-fold cross-validation. Instead, we followed the guidelines specified in [ 5] by splitting the dataset into
an 80:20 ratio based on the first-seen timestamp. This approach involved using older samples to train
the model and newer samples to test it, which helps eliminate temporal bias and simulate a realistic
setting. Specifically, the training set includes samples from August 2019 to June 2020, while the testing
set includes samples from June 2020 to September 2020. Additionally, 10% of the training data was
reserved as a validation set to ensure the model’s resilience. We used the same word embedding model
from our initial experiment with the BIG2015 dataset (see Subsection 5.1).</p>
          <p>The resulting testing dataset comprises 6,375 malware samples distributed across 20 diferent families.
In the training dataset, there were a total of 1013 distinct vocabulary items, representing unique opcodes.
As mentioned in Subsection 3.3, a word embedding size of 256 was employed, reducing the embedding
size below 256 led to a slight drop in performance while increasing it beyond that value did not impact
performance, despite increased resources and computational demands. The achieved accuracy was
89.55% across the 20 malware families, and a detailed confusion matrix is given in Figure 4. Additionally,
Table 3 compares our model’s performance with other studies. To facilitate a straightforward comparison
with our monthly separated temporal analysis experiment, as described in Section 5.2, we used the
most prevalent 20 families according to our training month, September 2019.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Temporal analysis on BODMAS</title>
          <p>In this experiment, our primary goal was to conduct a temporal analysis of our model’s performance.
We categorized the malware samples based on their occurrence over 14 months, from August 2019 to
September 2020. To ensure suficient representation of each malware family for comprehensive training,
we combined the data from August and September 2019. This decision was necessary because using
August’s data alone would not provide enough samples. We then evaluated the model’s performance
over the subsequent 12 months, from October 2019 to September 2020, allowing us to assess the model’s
adaptability and efectiveness over time thoroughly.</p>
          <p>Our investigation focused on the top 20 malware classifications. While the model initially showed
a relatively stronger accuracy of 83.14%, this accuracy varied significantly throughout the year. The
performance fluctuated, as illustrated in Figure 5, with notable dips. Specifically, the accuracy ranged
from a low of 65.49% to a high of 91.45%. This indicates variability in the model’s ability to correctly
classify malware samples over time, reflecting the possible impact of concept drift.</p>
          <p>When analyzing only the non-packed malware, the model’s performance demonstrated higher
consistency and generally better accuracy throughout the year. The accuracy for non-packed malware
remained above 85% for most months, suggesting that packed malware contributes significantly to the
observed variability and overall reduction in model performance. This highlights the importance of
efectively handling packed malware in malware classification tasks to maintain robust and reliable
performance.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>Concept drift mitigation</title>
          <p>To explore and address the concept drift phenomenon, we conducted a preliminary experiment to
understand how our dataset evolves. From October 2019 to September 2020, we selected 10 samples
from each malware family (totaling 200 samples) each month. These samples were used to fine-tune
our classification model, originally trained on August and September 2019 data. We then evaluated the
model’s performance on the entire dataset for each subsequent month.</p>
          <p>The result, shown in Figure 6, illustrates the efectiveness of the fine-tuned models. The blue line
represents the base model, while the red line depicts the fine-tuned model. The captions indicate the
months during which additional samples were collected for retraining. The minimal diference between
the blue and red lines indicated that retraining the model with samples from October 2019 to January
2020 did not significantly improve performance. However, a significant improvement is observed when
the model is fine-tuned with data from February 2020. Specifically, the model’s accuracy on February
data increased; importantly, the accuracy on subsequent months’ data also showed improvement.
For example, the accuracy in March, which initially had the lowest accuracy of 71.5%, increased by
approximately 20% up to 91.7% with the fine-tuned model, despite not being trained on any March data.
This suggests that the model adapted well to new malware variants in February, which also benefited
the performance in later months. This pattern is also observed for models retrained with samples from
March, April, May, July, August, and September 2020, except for June, where the improvement was not
as pronounced. This exception may indicate that June did not introduce significant new variants. These
ifndings highlight the evolving nature of malware and underscore the importance of regularly updating
the model with new data to maintain its accuracy. This preliminary experiment informs future research
aimed at developing robust strategies for detecting and mitigating concept drift.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Impact of Packed Malware and Packer</title>
      <p>Packed malware and packers play a critical role in the field of malware analysis [ 8]. Packers are tools
used to compress or encrypt executable files, often to obfuscate the code and evade detection by security
software. When a file is packed, its original code is transformed into a more compact and unreadable
format, which is then decompressed or decrypted back into its original form during execution. Malware
authors commonly use this technique to hide malicious payloads and hinder reverse engineering eforts.
Analyzing packed malware, therefore, requires specialized techniques to unpack and examine the
underlying code, making it a challenging but essential aspect of modern cybersecurity practices.</p>
      <p>Our deeper analysis of incorrectly classified samples found that about 67% of these samples were
packed. This led us to investigate the role of packers and their efect on our model’s performance. We
wanted to see if the packing process caused misclassification or if other factors were at play. We also
noticed that not all packed malware were incorrectly classified. This suggested that the type of packer
used might significantly influence classification accuracy, which is also indicated in [ 8]. To understand
this better, we analyzed how diferent packers afected our model’s ability to identify malware families
correctly. Table 5 indicates the number of samples each packer has, how many were misclassified, and
(d) January 2020
(e) February 2020
(f) March 2020
(g) April 2020
(h) May 2020
(i) June 2020
(j) July 2020
(k) August 2020
(l) September 2020
the percentage of misclassified samples for each packer.</p>
      <p>In our analysis of various malware families, we observed significant variability in detection accuracy,
sample size, and the prevalence of packed samples. Table 4 provides a detailed overview of this variability
across diferent malware families. Notably, families such as Dinwod, Fuerboos, Ganelp, and Wacatac
exhibited low average accuracy rates of 27.57%, 27.78%, 23.37%, and 59.90%, respectively. Many samples
from these families were packed, likely contributing to the reduced classification accuracy. This high
prevalence of packed samples suggests that the packing process significantly impacts the ability to
classify these malware families accurately.</p>
      <p>To better understand the impact of diferent packers on misclassification, we analyzed the misclassified
samples based on the packer used. We identified the total number of misclassified samples for each
packer and the classes into which these samples were incorrectly classified. For instance, among the
samples packed with the Petite packer, we found that out of 2045 misclassified samples, 1850 were
misclassified as Wacatac, accounting for 90.45% of the misclassifications. Similarly, for the MPRESS
packer, 124 misclassified samples were classified as Upatre, which is 97% of total misclassifications. This
data highlights the significant influence of the type of packer on the model’s misclassification behaviour,
suggesting that certain packers may introduce features more strongly associated with specific malware
families, leading to higher misclassification rates.</p>
      <p>In conclusion, the presence of packed samples significantly impacts the model’s classification
performance. When a particular malware family predominantly uses a packer in the training data, the classifier
labels any sample using that packer as belonging to that family. This results in misclassifications, where
malware from diferent families using the same packer is incorrectly identified as the dominant
family. Our study highlights a notable challenge in accurately distinguishing between malware families
based on their packer associations. Furthermore, some packers have a more pronounced efect on
misclassification, indicating that diferent packers vary in their impact on the classifier’s performance.</p>
    </sec>
    <sec id="sec-7">
      <title>7. CONCLUSION AND FUTURE WORK</title>
      <p>In this study, our approach to malware family classification centered around utilizing opcodes to detect
malware families through efectively implementing CNN. We began by employing a widely recognized
Microsoft dataset to establish a robust benchmark for our CNN model. Building on this foundation, we
extended the applicability of our model by applying the same architecture to classify the BODMAS
dataset. To explore the temporal aspect, we trained our model using data from the initial period,
determined by the first submission (first-seen) date on VirusTotal. We subsequently tested its eficacy
over the following months. Our observations revealed significant variations in the performance of our
model across diferent periods, indicating the dynamic nature of malware behaviors over time. Our
results demonstrate that concept drift is a significant challenge in malware detection. As the model
encounters new data over time, it initially shows a decline in performance. However, the model adapts
to the new data patterns through periodic retraining with recent data samples, resulting in improved
performance. This highlights the importance of continuous learning and model updating in maintaining
high detection accuracy.</p>
      <p>The impact of packers on malware classification was also thoroughly examined. Packers, which
compress or encrypt executable files, significantly afect the classification performance by obfuscating
the malware’s features. Our analysis showed that certain packers led to higher misclassification rates,
suggesting that the type of packer used plays a crucial role in the efectiveness of malware detection
systems. Addressing the challenges posed by packed malware is essential for developing robust detection
models.</p>
      <p>For future work, our research will focus on an in-depth exploration of drift detection techniques
to enhance our model’s adaptability. This involves identifying changes in data distribution over time,
enabling us to detect when the model’s performance might be compromised. Additionally, we plan to
conduct a comprehensive study to explain the underlying reasons for observed performance fluctuations
across diferent temporal segments. By understanding the specific features and patterns contributing
to the behavior of malware families, including the impact of packers, we aim to fortify our model’s
robustness.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Numan Halit Guldemir is supported by the Republic of Türkiye Ministry of National Education
(MoNE1416/YLSY).
[5] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, {TESSERACT}: Eliminating
experimental bias in malware classification across space and time, in: 28th USENIX Security
Symposium (USENIX Security 19), 2019, pp. 729–746.
[6] F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, A. A. H. Elnour, Malware
detection issues, challenges, and future directions: A survey, Applied Sciences 12 (2022) 8482.
[7] P. O’Kane, S. Sezer, K. McLaughlin, Obfuscation: The hidden malware, IEEE Security &amp; Privacy 9
(2011) 41–47.
[8] H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, C. Kruegel,
When malware is packin’heat; limits of machine learning classifiers based on static analysis
features, in: Network and Distributed Systems Security (NDSS) Symposium 2020, 2020.
[9] A. Afianian, S. Niksefat, B. Sadeghiyan, D. Baptiste, Malware dynamic analysis evasion techniques:</p>
      <p>A survey, ACM Computing Surveys (CSUR) 52 (2019) 1–28.
[10] C. Rathnayaka, A. Jamdagni, An eficient approach for advanced malware analysis using memory
forensic technique, in: 2017 IEEE Trustcom/BigDataSE/ICESS, IEEE, 2017, pp. 1145–1150.
[11] R. Sihwail, K. Omar, K. A. Zainol Arifin, S. Al Afghani, Malware detection approach based on
artifacts in memory image and dynamic analysis, Applied Sciences 9 (2019) 3680.
[12] R. Sihwail, K. Omar, K. A. Z. Arifin, An efective memory analysis for malware detection and
classification., Computers, Materials &amp; Continua 67 (2021).
[13] T. Carrier, P. Victor, A. Tekeoglu, A. H. Lashkari, Detecting obfuscated malware using memory
feature engineering., in: Icissp, 2022, pp. 177–188.
[14] S. Jeon, J. Moon, Malware-detection method with a convolutional recurrent neural network using
opcode sequences, Information Sciences 535 (2020) 1–15.
[15] X. Ling, L. Wu, J. Zhang, Z. Qu, W. Deng, X. Chen, Y. Qian, C. Wu, S. Ji, T. Luo, Adversarial attacks
against Windows PE malware detection: A survey of the state-of-the-art, Computers &amp; Security
(2023) 103134. ISBN: 0167-4048 Publisher: Elsevier.
[16] D. Bilar, Opcodes as predictor for malware, International journal of electronic security and digital
forensics 1 (2007) 156–168.
[17] E. S. Parildi, D. Hatzinakos, Y. Lawryshyn, Deep learning-aided runtime opcode-based windows
malware detection, Neural Computing and Applications 33 (2021) 11963–11983.
[18] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei, E. Trickel,
Z. Zhao, A. Doupé, Deep android malware detection, in: Proceedings of the Seventh ACM on
Conference on Data and Application Security and Privacy, 2017, pp. 301–308.
[19] R. Lu, Malware detection with lstm using opcode language, arXiv preprint arXiv:1906.04593
(2019).
[20] A. S. Kale, V. Pandya, F. Di Troia, M. Stamp, Malware classification with word2vec, hmm2vec, bert,
and elmo, Journal of Computer Virology and Hacking Techniques 19 (2023) 1–16.
[21] F. Barbero, F. Pendlebury, F. Pierazzi, L. Cavallaro, Transcending transcend: Revisiting malware
classification in the presence of concept drift, in: 2022 IEEE Symposium on Security and Privacy
(SP), IEEE, 2022, pp. 805–823.
[22] L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, G. Wang, CADE: Detecting and
Explaining Concept Drift Samples for Security Applications., in: USENIX Security Symposium,
2021, pp. 2327–2344.
[23] F. Ceschin, M. Botacin, H. M. Gomes, F. Pinagé, L. S. Oliveira, A. Grégio, Fast &amp; furious: On the
modelling of malware detection as an evolving data stream, Expert Systems with Applications 212
(2023) 118590.
[24] F. Zola, J. L. Bruse, M. Galar, Temporal analysis of distribution shifts in malware classification
for digital forensics, in: 2023 IEEE European Symposium on Security and Privacy Workshops
(EuroS&amp;PW), IEEE, 2023, pp. 439–450.
[25] G. Dabah, gdabah/distorm, 2023. URL: https://github.com/gdabah/distorm, accessed: 13 September
2023.
[26] Horsicq, Detect it easy: Program for determining types of files for windows, linux and macos.,
2024. URL: https://github.com/horsicq/Detect-It-Easy, accessed: 01 July 2024.
[27] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, M. Ahmadi, Microsoft malware classification
challenge, arXiv preprint arXiv:1802.10135 (2018). arXiv:1802.10135.
[28] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, G. Wang, BODMAS: An open dataset for learning
based temporal analysis of PE malware, in: 2021 IEEE Security and Privacy Workshops (SPW),
IEEE, 2021, pp. 78–84.
[29] M. Alaeiyan, A. Dehghantanha, T. Dargahi, M. Conti, S. Parsa, A multilabel fuzzy relevance
clustering system for malware attack attribution in the edge layer of cyber-physical networks,
ACM Transactions on Cyber-Physical Systems 4 (2020) 1–22.
[30] Q. Lu, H. Zhang, H. Kinawi, D. Niu, Self-attentive models for real-time malware classification,</p>
      <p>IEEE Access 10 (2022) 95970–95985.
[31] X. Zhu, J. Huang, B. Wang, C. Qi, Malware homology determination using visualized images and
feature fusion, PeerJ Computer Science 7 (2021) e494.
[32] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, G. Giacinto, Novel feature extraction, selection
and fusion for efective malware family classification, in: Proceedings of the Sixth ACM Conference
on Data and Application Security and Privacy, 2016, pp. 183–194.
[33] J. Hao, S. Luo, L. Pan, EII-MBS: Malware family classification via enhanced adversarial instruction
behavior semantic learning, Computers &amp; Security 122 (2022) 102905. ISBN: 0167-4048 Publisher:
Elsevier.
[34] M. H. L. Louk, B. A. Tama, Tree-based classifier ensembles for PE malware analysis: A performance
revisit, Algorithms 15 (2022) 332. ISBN: 1999-4893 Publisher: MDPI.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>AV-Test</surname>
          </string-name>
          ,
          <source>Malware statistics &amp; trends report</source>
          ,
          <year>2023</year>
          . URL: https://www.av-test.org/en/statistics/ malware/, accessed: 12 May
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Kakisim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulmez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sogukpinar</surname>
          </string-name>
          ,
          <article-title>Sequential opcode embedding-based malware detection method</article-title>
          ,
          <source>Computers &amp; Electrical Engineering</source>
          <volume>98</volume>
          (
          <year>2022</year>
          )
          <fpage>107703</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jordaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sharad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papini</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Nouretdinov</surname>
          </string-name>
          , L. Cavallaro, Transcend:
          <article-title>Detecting concept drift in malware classification models</article-title>
          ,
          <source>in: 26th USENIX security symposium (USENIX security 17)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Droidevolver: Self-evolving android malware detection system</article-title>
          ,
          <source>in: 2019 IEEE European Symposium on Security</source>
          and
          <string-name>
            <surname>Privacy (EuroS&amp;P),</surname>
            <given-names>IEEE</given-names>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>