Navigating Concept Drift and Packing Complexity in Malware Family Classification Numan Halit Guldemir1 , Oluwafemi Olukoya1 and Jesús Martínez-del-Rincón1 1 Centre for Secure Information Technologies (CSIT), Queen’s University Belfast, United Kingdom Abstract In the rapidly evolving cybersecurity landscape, the classification of malware families presents significant challenges due to the dynamic nature of malware, a phenomenon known as concept drift. This research classifies Windows PE malware families using static analysis of raw opcode sequences. By leveraging Convolutional Neural Networks (CNNs) to extract unique features from these sequences, our approach achieves high classification accuracy rates of 97.93% and 89.55% on the Microsoft Malware Classification Challenge and BODMAS datasets, respectively. We also conducted a temporal analysis on BODMAS over 13 months to observe the evolution of malware families and identify periods where our model’s accuracy decreases. We implemented a retraining strategy, allowing us to observe how retraining the model with new data helps it adapt to new malware patterns. The study also examined the impact of packed malware and different types of packers on the model’s performance. Our findings indicate that packed malware significantly affects the model’s accuracy, with some packers having a more pronounced impact than others. These results underscore the importance of regular model updates and specialized handling of packed malware to maintain robust detection capabilities. Keywords malware family classification, concept drift, Windows PE malware, temporal analysis, packed malware, packer 1. Introduction Detecting malware cannot be understated, and a great focus lies in identifying family-related threats. As technology rapidly grows, so does our reliance on various software, making us vulnerable to cyberattacks. This has led to an increase in harmful software created by cyberattackers. A malware statistics and trends report by AV-Test [1] indicates that approximately 300,000 new malware emerge daily. This rise is due to the accessibility of malware construction toolkits, which empower even individuals without the expertise to create malware samples [2]. These toolkits facilitate the implementation of sophisticated methods like polymorphism and metamorphism. Consequently, the need to prevent the spread of these harmful files and effectively detect them remains of utmost importance. One critical challenge in the field of malware detection is the continuous evolution of malware. Building reliable detection models becomes increasingly challenging as malware develops and changes over time. This is because when these models come across new or evolved types of malware, they may have difficulty accurately identifying them. This phenomenon is referred to as concept drift [3], model aging [4] or time decay [5] in the literature. Drifted samples are prone to being misclassified by detection models. Therefore, it is essential to identify and monitor concept drift in these models, as it enables the detection of when the model is becoming outdated and incapable of effectively identifying new and emerging malware threats. In this study, the primary focus of the malware detection system is on Windows malicious applications, commonly referred to as Windows PE malware. This focus is based on the observation that, despite the increasing prevalence of malware targeting other platforms such as Android, Windows remains the primary target for malware creators [1]. Furthermore, Windows PE malware exhibits distinct characteristics that set it apart from other types of malware. These differences in behavior and structure CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA $ nguldemir01@qub.ac.uk (N. H. Guldemir); o.olukoya@qub.ac.uk (O. Olukoya); j.martinez-del-rincon@qub.ac.uk (J. Martínez-del-Rincón)  0000-0003-1202-6841 (N. H. Guldemir); 0000-0003-2771-2553 (O. Olukoya); 0000-0002-9574-4138 (J. Martínez-del-Rincón) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings make it necessary to develop specialized detection models and strategies tailored to the unique challenges posed by Windows PE malware. Various research on malware detection showcases a spectrum of approaches, from signature-based to behavior-based methodologies [6]. Conventional approaches heavily relied on signatures and hashes to identify binaries. Nevertheless, these techniques are susceptible to exploitation, as malware creators can effortlessly manipulate them; for instance, they can alter hashes using methods like polymorphism or metamorphism [7]. It is important to note that, by adopting these methods, alterations can be made to the malware’s binary structure, including its hash, without affecting the underlying malicious behavior. This ability to evade signature-based detection emphasizes the need for more sophisticated approaches to malware detection. Static analysis presents a rapid means to extract characteristics of PE malware. What distinguishes this approach is its independence from execution, thus rendering it a secure option. Through static analysis, a range of attributes can be derived from PE malware, encompassing API calls, byte sequences, string patterns, opcodes, and the inherent characteristics of the PE malware itself [8]. On the other hand, dynamic malware analysis involves executing malware in controlled environments like sandboxes to examine its behavior. This method requires more time and resources than static analysis and faces scalability challenges. Some malware can even detect when running in controlled environments and adjust its behavior to avoid detection, making dynamic analysis less effective [9]. Another method known as memory analysis has gained popularity in recent years. This technique has piqued the interest of malware analysts due to its ability to provide a comprehensive assessment of malware [10, 11, 12, 13]. However, this method can be time-consuming, with generating a single memory dump taking up to 2 minutes, and the resulting large dump files exceeding 1 gigabyte. These challenges intensify resource requirements and hinder scalability and efficiency in malware detection and analysis [10]. This research leans towards utilizing static features, mainly because of the limitations of the alternative methodologies discussed. Within these static attributes, the emphasis on using opcode sequences stems from their potential to offer valuable insights into the operational characteristics of binary files, all without necessitating their execution [14]. Opcodes are the fundamental building blocks of executables, comprising a structured assembly of directives [15]. By examining opcode sequences, important patterns of behavior can be extracted. This approach eliminates the necessity for real-time execution, making the analysis process more efficient and secure. The contributions of this paper are highlighted below: • We perform an initial temporal analysis to observe how the performance of our model changes over time, providing insights into the presence of potential concept drift. • We conduct preliminary retraining of the model with random samples from specific months to observe how its performance may improve as it encounters new data patterns over time. • We examine the impact of packed malware and different packers on model performance. This paper is organized as follows: Section 2 reviews relevant research on malware detection, malware family classification and concept drift. Section 3 describes our methodology, including data processing and model design. Section 4 outlines the datasets and experimental setup. Section 5 discusses our results and key findings, and Section 6 examines the effects of packed malware. Finally, Section 7 concludes with insights and future directions. 2. RELATED WORK 2.1. Opcode-based Malware Family Detection Opcodes, which represent the fundamental instructions executed by a binary, provide valuable insights into its behavior and characteristics. One notable study is Bilar’s work, which made significant con- tributions to malware detection through opcode analysis [16]. In this study, the author investigated an approach for identifying malicious code through the statistical analysis of opcode distributions. The results showed statistical distinction in opcode distributions between malware and non-malicious software, with infrequent opcodes demonstrating a higher predictive capability. Parildi et al. [17] utilized opcodes that are obtained at runtime to make a binary classification of PE samples. They used Word2Vec to generate opcode word embeddings, followed by CNN, RNN and Transformers. McLaughlin et al. [18] proposed an Android malware detection system using a deep convolutional neural network. The system performs static analysis of raw opcode sequences from disassembled programs to learn features indicative of malware. Lu [19] employed opcodes of PE malware to determine malware family statistically. They propose a two-stage LSTM model for malware detection, which utilizes two LSTM layers and one mean-pooling layer to obtain the feature representations of opcode sequences of malware. The experiments were conducted on a 969 malicious and 123 benign files dataset. Kakisim et al. [2] introduced a malware detection method called SOEMD. This approach uses static analysis to detect malware by creating a new feature space of sequential opcode embeddings through Random Walk and Skip-Gram techniques. Kale et al. [20] investigated the use of word embedding techniques (HMM2Vec, Word2Vec, BERT, and ELMo) for malware family classification. They employed four classification algorithms (k-nearest neighbors, support vector machines, random forests, and convolutional neural networks) on each word embedding technique. The experiments are conducted on opcode sequences from seven malware families. In summary, all these methods demonstrated how the use of opcode sequences and deep learning techniques is a powerful and effective combination in the detection of malware as well as the malware family. However, since they rely on a training set collected at a given time, they are subject to performance degradation over time. Additionally, these methods did not consider packed samples and packer techniques, which can significantly impact the performance and effectiveness of malware detection systems [8]. 2.2. Concept Drift Although our research on concept drift is preliminary, it is essential to highlight the significance of this phenomenon and recognize the valuable contributions of recent studies in this area. Jordaney et al. [3] designed a framework called TRANSCEND to identify when a classification model was consistently misclassifying new data, indicating that concept drift had occurred. The TRANSCEND framework uses statistical metrics to detect concept drift. It employs a non-conformity measure to assess how well test samples fit, using a p-value. Building on this, Barbero et al. developed TRANSCENDENT, which has enhanced the efficiency and reduced the computational costs of the original Transcend framework. Yang et al. [22] developed framework called CADE. The primary objective of the framework is to identify drifting samples that deviate from the existing class and provide explanations for the occurrence of drift. Ceschin et al. [23] investigated the impact of concept drift on Android malware detection, utilizing datasets from DREBIN and AndroZoo. They examined how evolving malware characteristics affect machine learning classifiers and proposed a data stream learning pipeline that updates both classifiers and feature extractors. The authors [24] proposed a three-step forensic exploration method to evaluate concept drift and model performance over time. This approach involves comparing models trained with different temporal windows, evaluating the influence of temporal information on model drift, and conducting a detailed misclassification and feature analysis to identify the causes of drift and model failure. To sum up, the objectives of our experiments revolve around addressing the following research questions: RQ1: Given the use of a recent dataset, can opcode analysis still effectively provide insights into the behavior and functionality of various malware families? RQ2: How does the performance of our opcode-based classification model change over time when applied to Windows PE malware, and what patterns can we observe? RQ3: What initial steps or observations can be made towards understanding and addressing concept drift in an opcode-based analysis of Windows PE malware? RQ4: How do packed samples and different packers influence the performance of malware family classification models? 3. METHOD We employ a deep learning approach inspired by natural language processing techniques to process the opcodes. First, we transform these opcodes into word embedding arrays. We then utilize a CNN model to classify the malware using these transformed representations. 3.1. Disassembly of PE Malware In this study, we employed the distorm3 disassembler to extract instructions from PE malware samples. Distorm3 [25] is a disassembler tool that converts binary code into human-readable assembly language. To streamline our analysis and focus on high-level behavioral patterns, we discarded operands similar to [18, 2, 20, 17], which specify the data to be manipulated by the instructions, from the extracted instructions. 3.2. Packer Detection To detect packers in the BODMAS dataset, we employed the Detect It Easy [26]. Detect It Easy is known for its comprehensive capability to identify various types of packers embedded in PE files. This tool provides detailed information regarding the presence of packers within the PE binaries. By utilizing Detect it Easy tool, we were able to identify whether a sample was packed as well as packer information, which enabled us to analyze the impact of packed samples and packers on our model’s performance. This detection step is crucial for understanding how different packing techniques may affect the performance of our malware classification model and for addressing the specific challenges posed by these techniques. Additionally, as we do not have binaries in the BIG2015 dataset, we were unable to apply this method as it requires actual binaries. 3.3. Model Architecture Leveraging deep learning techniques, we automated the feature extraction process for analyzing PE malware through opcode sequences. We integrated an opcode embedding layer into our model architecture. Unlike integer-encoding and one-hot vector representations, which are limited in capturing complex patterns, the embedding layer enables the model to learn patterns and semantic relationships within the opcode data. Initially, the weights of the embedding layer are set randomly and updated during training. This allows the embedding layer to represent opcodes in a continuous vector space, where similar opcodes are positioned closer together. This approach reduces the dimensionality of the input data, alleviating computational complexity and improving generalization across various opcode sequences. Building upon the embedding layer, our model includes several key layers to enhance the analysis. The embedding layer has an embedding dimension of 256. We applied a dropout layer with a rate of 0.25 after the embedding layer to prevent overfitting. The first convolutional layer uses 1024 filters with a kernel size of 7 and ReLU activation to capture local patterns of consecutive instructions within the opcode data. Following the convolutional layer, a global max pooling layer reduces dimensionality while retaining significant features. A dense layer with 1024 units further interprets these features, learning non-linear combinations. Another dropout layer with a rate of 0.25 is added before the final softmax classification layer, which translates the learned features into probability distributions across different malware categories. The model is trained using the Adam optimizer with a learning rate of 1𝑒 − 4. This comprehensive structure allows the model to provide insights into the most likely classification for a given opcode sequence. The model structure is given in Figure 1. Figure 1: Architecture of the opcode-based convolutional neural network model for malware family classification. 4. Dataset and Experimental Setup This section describes the dataset utilized and the preprocessing procedures applied before training our malware family classification model. 4.1. Dataset 1: BIG2015 Microsoft Malware Classification Challenge (BIG 2015) dataset is a common benchmark in the field [27]. The dataset comprises approximately 10,868 training and 10,873 testing PE malware which were labelled into 9 different families. Each malware sample provided by Microsoft is represented through two files: an assembly asm file containing low-level language code, as well as a corresponding compiled bytes file containing CPU machine code. Table 1 provides an overview of the distribution of samples among the different malware families. Table 1 Sample Distribution Across Malware Families in the BIG2015 Dataset. Name Samples Type Ramnit 1541 Worm Lollipop 2478 Adware Kelihos_ver3 2942 Backdoor Vundo 475 Trojan Simda 42 Backdoor Tracur 751 Trojan Downloader Kelihos_ver1 398 Backdoor Obfuscator.ACY 1228 Obfuscated Malware Gatak 1013 Backdoor 4.2. Dataset 2: BODMAS BODMAS dataset is an open dataset comprising a collection of 57,293 malware samples and 77,142 benign samples (a total of 134,435) [28]. In this dataset, malicious files were gathered between August 29, 2019, and September 30, 2020, and benign samples were collected from January 1, 2007, to September 30, 2020, aiming to represent the real-world distribution of PE binaries. The dataset also includes family information for malicious samples, encompassing a total of 581 families. Additionally, the authors have provided timestamp information indicating the first-seen (also referred to as first-submission) date, which is based on VirusTotal reports. 4.3. Experimental Setup We implemented several preprocessing steps before the training and testing process. In the BIG2015 dataset, opcodes were extracted from the .asm files using a regex pattern to isolate only the opcodes. For the BODMAS dataset, which consists of actual PE binaries, distorm3 disassembler [25] was employed to perform the opcode extraction. Operands were removed from the instructions in both datasets to simplify analysis and capture the essential behavior of the malware. Additionally, the input was limited to the first 200 opcodes for each sample. This number was determined based on the results of our preliminary experiments, as shown in Figure 2. This approach helped reduce complexity and provided a consistent input size for our deep learning model. Figure 2: Model accuracy for varying opcode sequence lengths (BODMAS dataset). 5. EVALUATION In the context of our study on PE malware, we conducted a series of four distinct experiments to comprehensively evaluate our proposed approach and perform temporal analysis and concept drift analysis and mitigation. These experiments were designed to delve into our methodology’s performance. The foundational datasets for our analysis were sourced from Microsoft Malware Classification Challenge (BIG 2015) [27] and BODMAS [28]. We adopted our deep learning-based approach using opcode word embeddings (see Subsection 3.1) to analyze the opcodes within the same Microsoft dataset. These initial investigations aimed to establish a comparative benchmark for our baseline model’s performance in a domain already replete with existing works. Shifting our focus to concept drift, we first applied our word embedding model to the entirety of the BODMAS dataset. Furthermore, to measure the temporal evolution of our models’ efficacy, we conducted our final experiments using the BODMAS dataset separated into individual months. 5.1. Experiment 1: BIG2015 In this experiment, we evaluated the effectiveness of our model using the Microsoft Malware Classifi- cation Challenge dataset. Despite being relatively older, the BIG2015 dataset was chosen to provide an overall idea of our model’s performance. We extracted opcodes from the .asm files using regular expressions to focus solely on opcode analysis. We enhanced the dataset quality by filtering out files containing only repetitive instructions, excluding 54 samples. Due to the absence of testing labels, the training dataset was used for both training and testing, employing a 5-fold cross-validation method. This involved splitting the dataset into five subsets, with each subset serving as a testing set once, while the remaining portions were used for training. The training dataset was split into an 80:20 ratio in each fold, ensuring that every family was represented in both the training and test sets. The model described in Subsection 3.3 was used for our experiments, with results shown in the normalized confusion matrix in Figure 3. In the confusion matrix, rows represent the true class labels, and columns indicate the predicted labels. Each cell shows the normalized proportion of predictions, with each row summing to 1, allowing for a straightforward comparison of the model’s performance across classes. Additionally, Table 2 includes a comparative analysis with other state-of-the-art studies as cited in those papers. Although there might be minor variations in experimental setups, this test confirms the effectiveness of our model as one of the reliable baselines for assessing concept drift. Figure 3: Confusion matrix for malware family classification using BIG2015 dataset. 5.2. Experiment 2: BODMAS In this experiment, the BODMAS dataset is utilized [28]. While the dataset offers the actual binary files for the malicious samples, benign files are not provided by authors due to copyright concerns. Due to limitations in obtaining benign binaries, our investigation focused solely on malicious samples for our experiments. We employed the most prominent 20 malware families from the dataset. This choice ensured precise training by establishing a minimum sample size. In our experimental phase, we constrained opcodes to Table 2 Performance Comparison of Various Models on the BIG2015 Dataset. Algorithm Features Accuracy Alaeiyan et al. [29] Opcode frequencies 97.52% Lu et al. [30] Byte sequences 96.96% Zhu et al. [31] Byte & opcode sequences 99.05% Ahmadi et al. [32] A set of static features 99.76% Our model Opcode sequences 97.93% 200, guided by the experiments given in Figure 2, as the highest overall accuracy score achieved using 200 opcode length. Full training on BODMAS In this experiment, we evaluated our model on the BODMAS dataset to understand its performance on a more recent dataset. Given the availability of timestamps for each sample, we opted not to use k-fold cross-validation. Instead, we followed the guidelines specified in [5] by splitting the dataset into an 80:20 ratio based on the first-seen timestamp. This approach involved using older samples to train the model and newer samples to test it, which helps eliminate temporal bias and simulate a realistic setting. Specifically, the training set includes samples from August 2019 to June 2020, while the testing set includes samples from June 2020 to September 2020. Additionally, 10% of the training data was reserved as a validation set to ensure the model’s resilience. We used the same word embedding model from our initial experiment with the BIG2015 dataset (see Subsection 5.1). The resulting testing dataset comprises 6,375 malware samples distributed across 20 different families. In the training dataset, there were a total of 1013 distinct vocabulary items, representing unique opcodes. As mentioned in Subsection 3.3, a word embedding size of 256 was employed, reducing the embedding size below 256 led to a slight drop in performance while increasing it beyond that value did not impact performance, despite increased resources and computational demands. The achieved accuracy was 89.55% across the 20 malware families, and a detailed confusion matrix is given in Figure 4. Additionally, Table 3 compares our model’s performance with other studies. To facilitate a straightforward comparison with our monthly separated temporal analysis experiment, as described in Section 5.2, we used the most prevalent 20 families according to our training month, September 2019. Temporal analysis on BODMAS In this experiment, our primary goal was to conduct a temporal analysis of our model’s performance. We categorized the malware samples based on their occurrence over 14 months, from August 2019 to September 2020. To ensure sufficient representation of each malware family for comprehensive training, we combined the data from August and September 2019. This decision was necessary because using August’s data alone would not provide enough samples. We then evaluated the model’s performance over the subsequent 12 months, from October 2019 to September 2020, allowing us to assess the model’s adaptability and effectiveness over time thoroughly. Our investigation focused on the top 20 malware classifications. While the model initially showed a relatively stronger accuracy of 83.14%, this accuracy varied significantly throughout the year. The performance fluctuated, as illustrated in Figure 5, with notable dips. Specifically, the accuracy ranged from a low of 65.49% to a high of 91.45%. This indicates variability in the model’s ability to correctly classify malware samples over time, reflecting the possible impact of concept drift. When analyzing only the non-packed malware, the model’s performance demonstrated higher consistency and generally better accuracy throughout the year. The accuracy for non-packed malware remained above 85% for most months, suggesting that packed malware contributes significantly to the Figure 4: Confusion matrix for malware family classification using BODMAS dataset. observed variability and overall reduction in model performance. This highlights the importance of effectively handling packed malware in malware classification tasks to maintain robust and reliable performance. Concept drift mitigation To explore and address the concept drift phenomenon, we conducted a preliminary experiment to understand how our dataset evolves. From October 2019 to September 2020, we selected 10 samples from each malware family (totaling 200 samples) each month. These samples were used to fine-tune our classification model, originally trained on August and September 2019 data. We then evaluated the model’s performance on the entire dataset for each subsequent month. The result, shown in Figure 6, illustrates the effectiveness of the fine-tuned models. The blue line represents the base model, while the red line depicts the fine-tuned model. The captions indicate the months during which additional samples were collected for retraining. The minimal difference between the blue and red lines indicated that retraining the model with samples from October 2019 to January 2020 did not significantly improve performance. However, a significant improvement is observed when the model is fine-tuned with data from February 2020. Specifically, the model’s accuracy on February data increased; importantly, the accuracy on subsequent months’ data also showed improvement. For example, the accuracy in March, which initially had the lowest accuracy of 71.5%, increased by approximately 20% up to 91.7% with the fine-tuned model, despite not being trained on any March data. This suggests that the model adapted well to new malware variants in February, which also benefited the performance in later months. This pattern is also observed for models retrained with samples from March, April, May, July, August, and September 2020, except for June, where the improvement was not as pronounced. This exception may indicate that June did not introduce significant new variants. These Figure 5: Temporal accuracy trends for malware classification on BODMAS dataset, comparing packed vs. non-packed malware. Table 3 Performance Comparison of Various Models on the BODMAS Dataset. Number Temporal Research Accuracy Type of classes analysis Lu et al. [30] 96.96% Multi 11 No Lu et al. [30] 93.42% Multi 49 No Hao et al. [33] 99.29% Multi 14 No Louk and Tama [34] 99.97% Binary - No Our work 89.55% Multi 20 Yes findings highlight the evolving nature of malware and underscore the importance of regularly updating the model with new data to maintain its accuracy. This preliminary experiment informs future research aimed at developing robust strategies for detecting and mitigating concept drift. 6. Impact of Packed Malware and Packer Packed malware and packers play a critical role in the field of malware analysis [8]. Packers are tools used to compress or encrypt executable files, often to obfuscate the code and evade detection by security software. When a file is packed, its original code is transformed into a more compact and unreadable format, which is then decompressed or decrypted back into its original form during execution. Malware authors commonly use this technique to hide malicious payloads and hinder reverse engineering efforts. Analyzing packed malware, therefore, requires specialized techniques to unpack and examine the underlying code, making it a challenging but essential aspect of modern cybersecurity practices. Our deeper analysis of incorrectly classified samples found that about 67% of these samples were packed. This led us to investigate the role of packers and their effect on our model’s performance. We wanted to see if the packing process caused misclassification or if other factors were at play. We also noticed that not all packed malware were incorrectly classified. This suggested that the type of packer used might significantly influence classification accuracy, which is also indicated in [8]. To understand this better, we analyzed how different packers affected our model’s ability to identify malware families correctly. Table 5 indicates the number of samples each packer has, how many were misclassified, and (a) October 2019 (b) November 2019 (c) December 2019 (d) January 2020 (e) February 2020 (f) March 2020 (g) April 2020 (h) May 2020 (i) June 2020 (j) July 2020 (k) August 2020 (l) September 2020 Figure 6: Fine-tuned model performance over time (October 2019 to September 2020). Blue line: Base model trained in August/September 2019. Red line: fine-tuned model with selected samples from the month indicated in the subtitle. the percentage of misclassified samples for each packer. In our analysis of various malware families, we observed significant variability in detection accuracy, sample size, and the prevalence of packed samples. Table 4 provides a detailed overview of this variability across different malware families. Notably, families such as Dinwod, Fuerboos, Ganelp, and Wacatac exhibited low average accuracy rates of 27.57%, 27.78%, 23.37%, and 59.90%, respectively. Many samples from these families were packed, likely contributing to the reduced classification accuracy. This high Table 4 Summary of Detection Accuracy, Sample Size, and Packed Malware Proportion for Each Malware Family from October 2019 to September 2020, Highlighting Primary Packer Type. Percentage of Majority Packer Family Average Accuracy Total Sample Packed Sample (Percentage) autoit 98.32% 834 37.53% UPX (100.00%) autorun 79.33% 571 34.50% PECompact (44.67%) berbew 99.10% 1674 0.48% DxPack (100.00%) ceeinject 96.63% 1099 0.45% UPX (100.00%) delf 96.93% 749 1.20% AsPack (55.56%) dinwod 27.57% 1788 85.57% DxPack (47.58%) fuerboos 27.78% 198 90.40% UPX (84.36%) gandcrab 99.22% 769 2.21% UPX (94.12%) ganelp 23.37% 2148 76.63% Petite (100.00%) gepys 84.28% 935 13.69% MPRESS (51.56%) mira 98.15% 1889 0.95% PECompact (94.44%) padodor 9.91% 646 7.89% DxPack (100.00%) qqpass 95.50% 200 79.00% Petite (43.67%) sillyp2p 99.87% 1510 4.44% UPX (100.00%) small 77.90% 3059 3.04% UPX (98.92%) unruy 99.18% 734 0.41% UPX (100.00%) upatre 95.07% 3447 21.26% MPRESS (54.16%) urelas 79.25% 583 13.21% PECompact (49.35%) wabot 97.96% 3521 1.90% MPRESS (100.00%) wacatac 59.90% 4242 79.00% UPX (51.12%) prevalence of packed samples suggests that the packing process significantly impacts the ability to classify these malware families accurately. To better understand the impact of different packers on misclassification, we analyzed the misclassified samples based on the packer used. We identified the total number of misclassified samples for each packer and the classes into which these samples were incorrectly classified. For instance, among the samples packed with the Petite packer, we found that out of 2045 misclassified samples, 1850 were misclassified as Wacatac, accounting for 90.45% of the misclassifications. Similarly, for the MPRESS packer, 124 misclassified samples were classified as Upatre, which is 97% of total misclassifications. This data highlights the significant influence of the type of packer on the model’s misclassification behaviour, suggesting that certain packers may introduce features more strongly associated with specific malware families, leading to higher misclassification rates. In conclusion, the presence of packed samples significantly impacts the model’s classification perfor- mance. When a particular malware family predominantly uses a packer in the training data, the classifier labels any sample using that packer as belonging to that family. This results in misclassifications, where malware from different families using the same packer is incorrectly identified as the dominant fam- ily. Our study highlights a notable challenge in accurately distinguishing between malware families based on their packer associations. Furthermore, some packers have a more pronounced effect on misclassification, indicating that different packers vary in their impact on the classifier’s performance. 7. CONCLUSION AND FUTURE WORK In this study, our approach to malware family classification centered around utilizing opcodes to detect malware families through effectively implementing CNN. We began by employing a widely recognized Microsoft dataset to establish a robust benchmark for our CNN model. Building on this foundation, we extended the applicability of our model by applying the same architecture to classify the BODMAS dataset. To explore the temporal aspect, we trained our model using data from the initial period, Table 5 Misclassification Rates by Packer Type in Malware Detection. Total Number of Mis- Ratio of Packer Samples classified Samples Misclassification ASPack 83 32 38.55% DxPack 1346 858 63.74% MPRESS 617 133 21.56% PECompact 280 126 45.00% Petite 3280 2054 62.62% UPX 3022 1379 45.63% determined by the first submission (first-seen) date on VirusTotal. We subsequently tested its efficacy over the following months. Our observations revealed significant variations in the performance of our model across different periods, indicating the dynamic nature of malware behaviors over time. Our results demonstrate that concept drift is a significant challenge in malware detection. As the model encounters new data over time, it initially shows a decline in performance. However, the model adapts to the new data patterns through periodic retraining with recent data samples, resulting in improved performance. This highlights the importance of continuous learning and model updating in maintaining high detection accuracy. The impact of packers on malware classification was also thoroughly examined. Packers, which compress or encrypt executable files, significantly affect the classification performance by obfuscating the malware’s features. Our analysis showed that certain packers led to higher misclassification rates, suggesting that the type of packer used plays a crucial role in the effectiveness of malware detection systems. Addressing the challenges posed by packed malware is essential for developing robust detection models. For future work, our research will focus on an in-depth exploration of drift detection techniques to enhance our model’s adaptability. This involves identifying changes in data distribution over time, enabling us to detect when the model’s performance might be compromised. Additionally, we plan to conduct a comprehensive study to explain the underlying reasons for observed performance fluctuations across different temporal segments. By understanding the specific features and patterns contributing to the behavior of malware families, including the impact of packers, we aim to fortify our model’s robustness. Acknowledgments Numan Halit Guldemir is supported by the Republic of Türkiye Ministry of National Education (MoNE- 1416/YLSY). References [1] AV-Test, Malware statistics & trends report, 2023. URL: https://www.av-test.org/en/statistics/ malware/, accessed: 12 May 2024. [2] A. G. Kakisim, S. Gulmez, I. Sogukpinar, Sequential opcode embedding-based malware detection method, Computers & Electrical Engineering 98 (2022) 107703. [3] R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, L. Cavallaro, Transcend: Detecting concept drift in malware classification models, in: 26th USENIX security symposium (USENIX security 17), 2017, pp. 625–642. [4] K. Xu, Y. Li, R. Deng, K. Chen, J. Xu, Droidevolver: Self-evolving android malware detection system, in: 2019 IEEE European Symposium on Security and Privacy (EuroS&P), IEEE, 2019, pp. 47–62. [5] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, {TESSERACT}: Eliminating experimental bias in malware classification across space and time, in: 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 729–746. [6] F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, A. A. H. Elnour, Malware detection issues, challenges, and future directions: A survey, Applied Sciences 12 (2022) 8482. [7] P. O’Kane, S. Sezer, K. McLaughlin, Obfuscation: The hidden malware, IEEE Security & Privacy 9 (2011) 41–47. [8] H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, C. Kruegel, When malware is packin’heat; limits of machine learning classifiers based on static analysis features, in: Network and Distributed Systems Security (NDSS) Symposium 2020, 2020. [9] A. Afianian, S. Niksefat, B. Sadeghiyan, D. Baptiste, Malware dynamic analysis evasion techniques: A survey, ACM Computing Surveys (CSUR) 52 (2019) 1–28. [10] C. Rathnayaka, A. Jamdagni, An efficient approach for advanced malware analysis using memory forensic technique, in: 2017 IEEE Trustcom/BigDataSE/ICESS, IEEE, 2017, pp. 1145–1150. [11] R. Sihwail, K. Omar, K. A. Zainol Ariffin, S. Al Afghani, Malware detection approach based on artifacts in memory image and dynamic analysis, Applied Sciences 9 (2019) 3680. [12] R. Sihwail, K. Omar, K. A. Z. Arifin, An effective memory analysis for malware detection and classification., Computers, Materials & Continua 67 (2021). [13] T. Carrier, P. Victor, A. Tekeoglu, A. H. Lashkari, Detecting obfuscated malware using memory feature engineering., in: Icissp, 2022, pp. 177–188. [14] S. Jeon, J. Moon, Malware-detection method with a convolutional recurrent neural network using opcode sequences, Information Sciences 535 (2020) 1–15. [15] X. Ling, L. Wu, J. Zhang, Z. Qu, W. Deng, X. Chen, Y. Qian, C. Wu, S. Ji, T. Luo, Adversarial attacks against Windows PE malware detection: A survey of the state-of-the-art, Computers & Security (2023) 103134. ISBN: 0167-4048 Publisher: Elsevier. [16] D. Bilar, Opcodes as predictor for malware, International journal of electronic security and digital forensics 1 (2007) 156–168. [17] E. S. Parildi, D. Hatzinakos, Y. Lawryshyn, Deep learning-aided runtime opcode-based windows malware detection, Neural Computing and Applications 33 (2021) 11963–11983. [18] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, Deep android malware detection, in: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, 2017, pp. 301–308. [19] R. Lu, Malware detection with lstm using opcode language, arXiv preprint arXiv:1906.04593 (2019). [20] A. S. Kale, V. Pandya, F. Di Troia, M. Stamp, Malware classification with word2vec, hmm2vec, bert, and elmo, Journal of Computer Virology and Hacking Techniques 19 (2023) 1–16. [21] F. Barbero, F. Pendlebury, F. Pierazzi, L. Cavallaro, Transcending transcend: Revisiting malware classification in the presence of concept drift, in: 2022 IEEE Symposium on Security and Privacy (SP), IEEE, 2022, pp. 805–823. [22] L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, G. Wang, CADE: Detecting and Explaining Concept Drift Samples for Security Applications., in: USENIX Security Symposium, 2021, pp. 2327–2344. [23] F. Ceschin, M. Botacin, H. M. Gomes, F. Pinagé, L. S. Oliveira, A. Grégio, Fast & furious: On the modelling of malware detection as an evolving data stream, Expert Systems with Applications 212 (2023) 118590. [24] F. Zola, J. L. Bruse, M. Galar, Temporal analysis of distribution shifts in malware classification for digital forensics, in: 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), IEEE, 2023, pp. 439–450. [25] G. Dabah, gdabah/distorm, 2023. URL: https://github.com/gdabah/distorm, accessed: 13 September 2023. [26] Horsicq, Detect it easy: Program for determining types of files for windows, linux and macos., 2024. URL: https://github.com/horsicq/Detect-It-Easy, accessed: 01 July 2024. [27] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, M. Ahmadi, Microsoft malware classification challenge, arXiv preprint arXiv:1802.10135 (2018). arXiv:1802.10135. [28] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, G. Wang, BODMAS: An open dataset for learning based temporal analysis of PE malware, in: 2021 IEEE Security and Privacy Workshops (SPW), IEEE, 2021, pp. 78–84. [29] M. Alaeiyan, A. Dehghantanha, T. Dargahi, M. Conti, S. Parsa, A multilabel fuzzy relevance clustering system for malware attack attribution in the edge layer of cyber-physical networks, ACM Transactions on Cyber-Physical Systems 4 (2020) 1–22. [30] Q. Lu, H. Zhang, H. Kinawi, D. Niu, Self-attentive models for real-time malware classification, IEEE Access 10 (2022) 95970–95985. [31] X. Zhu, J. Huang, B. Wang, C. Qi, Malware homology determination using visualized images and feature fusion, PeerJ Computer Science 7 (2021) e494. [32] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, G. Giacinto, Novel feature extraction, selection and fusion for effective malware family classification, in: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, 2016, pp. 183–194. [33] J. Hao, S. Luo, L. Pan, EII-MBS: Malware family classification via enhanced adversarial instruction behavior semantic learning, Computers & Security 122 (2022) 102905. ISBN: 0167-4048 Publisher: Elsevier. [34] M. H. L. Louk, B. A. Tama, Tree-based classifier ensembles for PE malware analysis: A performance revisit, Algorithms 15 (2022) 332. ISBN: 1999-4893 Publisher: MDPI.