Learning to Embed Byte Sequences with Convolutional Autoencoders Doug Sibley1 1 Cisco Talos Abstract We propose a self-supervised approach to generating features for arbitrary byte sequences by training a convolutional autoencoder directly on raw bytes. The limited vocabulary of this task (256) makes it viable to train on sequences of at least 1MB in size. We evaluate this approach to byte-level feature engineering by first examining how accurate the autoencoder can be at reconstructing a variety of datasets, then testing this approach specifically on SOREL malware samples, extracting the learned features and comparing them against the EMBER V2 features for the task of malware tagging. Our results suggest that the learned features from the convolutional autoencoder rival those of the human-engineered set without requiring domain-specific preprocessing of Portable Executable files. 1. Introduction Byte sequences possessing discernable structure and hierarchy, such as Portable Executable (PE) files, can be processed by domain experts to produce features amenable to machine learning tasks. The EMBER V2 features as presented in the EMBER[1] and SOREL[2] datasets are derived from processing such executable files and extracting features that researchers and data engineers in the information security space have identified as being salient for classification. However, this process is knowledge intensive, requiring domain expertise on the specifics of the data contained within a given format. Earlier natural language processing (NLP) work[3] showed that it was possible to learn high-level concepts, such as sentiment, by simply predicting the next character within a text sequence. Krčál et al.[4] had success training a Convolutional Neural Network (CNN) to predict malware based on input samples comprised of a raw byte stream. Expanding on their work, we posit that it would be desirable to develop a method of extracting features from raw byte sequences without first requiring expert domain knowledge and a deep understanding of the format any potential input data may be stored in. 2. Methodology We leverage the self-supervised task of autoencoding to train a convolutional network whose goal is to reconstruct an input sequence of bytes by predicting each byte value. Because a byte CAMLIS’22: Conference on Applied Machine Learning in Information Security (CAMLIS), October 20–21, 2022, Arlington, VA $ dosibley@cisco.com (D. Sibley) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) can only be one of 256 values, the vocabulary for our prediction is small enough to directly calculate the cross-entropy loss without having to resort to approximate methods. This ceiling on our vocabulary size also provides a performance envelope on modern GPUs, whereby training models on sequences hundreds of thousands of bytes long becomes feasible. As we are reconstructing the raw input byte sequence, no preprocessing of the data is necessary, other than converting each byte to its representative integer value. Our contribution to the field is the observation that the aforementioned byte reconstruction task is computationally viable. Furthermore, through this implementation we can design an autoencoder that has useful properties for downstream tasks. Given the potentially large length of input sequences, the priority for model design is one with good computational efficiency and throughput. We chose to focus on convolutional networks over recurrence as convolutions allow us to process an entire sequence in one training step and can be greatly accelerated by using GPUs. While various recurrent neural network designs could also be applied to this task, issues with long-term credit assignment and training speed suggests that these model architectures would be a poorer fit. Our autoencoder has three main components in its design: 1. A bi-directional temporal CNN which produces an output for every position in our input sequence. 2. A global CNN which produces a fixed length output based on the output of the temporal CNN. 3. A decoder network which accepts the temporal and global CNN as inputs and attempts to predict the value of each byte. Temporal convolutions[5] are a type of one-dimensional convolution, where at sequence position T our convolution is only looking at information prior to T, contrasted with a regular convolution, which sees information from before, at, and following T. Stacking multiple layers of temporal convolutions increases the receptive field, but still constrains the network to only seeing information prior to T. A bi-directional temporal CNN applies this concept in both directions. At position T the network will have information from both before and after position T, but crucially not at T itself. Due to this design, when the model attempts to predict what byte is at position T, the network must use the information from the rest of the sequence to make a prediction, rather than simply learning to predict the value it sees at that location. Since every position in our input sequence has this constraint, we can attempt to predict every byte in a single training step, producing a high-quality training signal. The second component of our autoencoder is comprised of a global CNN. This part of the model starts with the output from the temporal CNN, passes the data through several layers of convolutions and pooling, and is completed by applying a global max pooling layer. This architecture produces a fixed-length vector, whereby its size represents a hyperparameter for the network and can be treated as an embedding of the entire input sequence. The authors of Malconv[6] found success with a large initial receptive field followed by a global pooling layer. We similarly view the global pooling layer as being critical to extracting high-level features that may be present at any position within our byte sequence. Finally, the decoder component is a dense neural network. It takes the full sequence of the temporal CNN and the fixed vector of the global CNN, producing a prediction for each byte position of the input sequence. The loss for this overall network is the average cross-entropy loss for each predicted byte. The desired goal for this design is that the temporal CNN learns features at a local context for predicting byte values, while the global CNN learns features representing the entirety of the sequence. We can then embed any variable-length input sequence to a fixed-length feature vector by extracting just the global CNN output from our model. Model Architecture The core requirement for the design of a model on this autoencoding task is to return a prediction sequence the same length as the input byte sequence. All experiments in this paper were conducted with the same model design, as described below. This design has approximately 700k parameters and was trained on a single NVIDIA V100 GPU at a speed of 2.4 MB/s. Unless otherwise noted, all convolutions in the network are implemented in the densely connected style[7]. The bi-directional temporal CNN consists of two identical networks, one in each temporal direction, where the final output of both networks are concatenated together. The input to the network is the input byte sequence, which is embedded with a length 8 vector. Each network consists of 6 layers having 16 features with a width of 15. Dilation rates in order are 1, 1, 5, 9, 13, 1. The output of the temporal CNN is a sequence the same length as the input. The global CNN takes as input the results of the bi-directional temporal CNN concatenated with the embedding vectors from the input byte sequence. Its primary purpose is to produce a fixed length vector from the variable length input, using a global max pooling layer to accomplish this. It is comprised of 9 convolutional layers containing 32 features with a width of 7. Layers 1 and 3 have a stride of 3, with a width 3 average pooling layer after layer 6. Layers 5 and 8 have a dilation of 3, while layers 6 and 9 have a dilation of 5. The decoder component takes as input the results of the bi-directional temporal CNN as well as the results of the global CNN, which is broadcast back to the shape of the temporal CNN. It additionally uses a single width 17 convolutional layer with 64 features, which is applied solely to the embedded input byte sequence. This layer is constrained during training such that the weights at position T are fixed to 0, effectively allowing the layer to see 8 bytes before and after position T. Failure to constrain the weights in this way, or stacking multiple of these layers together, would allow the decoder to observe the exact input byte it is attempting to predict and would short circuit the training objective. These three inputs are passed through a 4 layer fully connected network, each with 64 features, with the final output being a sequence with the identical length as the input for calculating the per-byte cross entropy prediction loss. 3. Autoencoder Evaluation To evaluate the efficacy of this design on our autoencoding task, we trained a set of models with fixed hyperparameters on a variety of datasets to observe the accuracy of the autoencoder for predicting the correct byte values across multiple domains. The tested datasets were comprised of: 1. Wikipedia site dumps in XML format, containing natural language articles inside of XML structure, roughly 1GB of data for each selected language. This data tests the autoencoder’s ability to represent single and multi-byte character sets. 2. MNIST images represented as either the pixel values in a numpy array (ideal format), or bytes from saved PNG/JPG images. This data tests the autoencoder’s ability to capture information when the underlying bytes are not well modeled by a 1d CNN, or when the information is compressed. 3. SOREL 20M malware PE files. This data tests the ability of the autoencoder to represent complex real world inputs. 4. Byte values drawn from a uniform random distribution. This tests that the model is being forced to predict byte values from the surrounding context and not by passing through the input byte. Below we report the average accuracy for predicting the byte value after the model has been trained: Dataset Accuracy English Wiki 69% German Wiki 68% Greek Wiki 82% Hebrew Wiki 74% Japanese Wiki 71% Russian Wiki 80% Chinese Wiki 64% MNIST Ideal 88% MNIST PNG 32% MNIST JPG 7% SOREL Malware PE 17% Random Uniform 0.39% Accuracy for the Wikipedia data is observed to be the highest among the tested datasets, as natural language and XML both have strong context for a given byte in the closest surrounding bytes. Accuracy on the ideal MNIST data is also very high, however most pixels in MNIST are represented by the value 0, excluding predictions for byte value 0 the autoencoder had roughly 36% accuracy on the ideal data format. The same adjustment to the PNG accuracy yields a value of 25%, while the accuracy on the JPG dataset is the same when 0 is included or excluded. As such, we can observe a loss of accuracy as the structure of the signal in the images becomes more complex in the PNG dataset and can observe accuracy further degrading in the JPG dataset by the inherent entropy of lossy-compression. Accuracy with the SOREL malware samples shows that the reconstruction task is more challenging than natural language, but easier than the most challenging MNIST format. As PE files can contain a variety of types of information, such as headers, executable code, and arbitrary data such as strings, the model faces a varying challenge even in the same sample. The random uniform data serves as an implementation correctness test, as we want the model to be learning to predict byte values from the surrounding context without using the byte’s identity directly. If there were an error in the implementation, such as mis-aligned temporal convolutions, the model would be able to achieve near perfect accuracy by passing through the input. The results confirm that our model is properly constrained to the surrounding context and limited to an accuracy of 1/256. 4. Learned Feature Evaluation To evaluate if the autoencoding task is forcing the network to learn interesting features, trained versions of the MNIST and SOREL models were used to produce fixed-length feature vectors for their respective training sets by extracting the output values from the global encoder portion of the autoencoder. These features were then used to train a Random Forest classifier to predict MNIST digit classes or SOREL Malware tags, allowing us to evaluate if the features learned from the purely self-supervised training task are salient to a known classification task for the data. MNIST Evaluation The MNIST model was evaluated on all three data formats in order to understand how increasing the complexity of the input data format impacts the quality of the learned representation. For each model, we embed the MNIST test set and split it into a new training and test set to evaluate the accuracy of predicting the associated digit. This approach is done before and after the autoencoder model is trained on the MNIST training set, so that we can observe if the self- supervised training causes the fixed feature representation to better represent the classes in the downstream classifier. Data Format Untrained Autoencoder Accuracy Trained Autoencoder Accuracy Ideal 71.6% 79.3% PNG 26.9% 59% JPG 26.3% 48.7% For all three models, we observe the accuracy for predicting the associated digit improves when using representations from the trained version of the autoencoder, even though the autoencoding task is not using any information on the class digits. The untrained autoencoder on the ideal data format is still relatively skillful. In this case it is essentially functioning as an extreme learning machine, however it still improves its performance with the self-supervised training. For the PNG and JPG data the improvement is more pronounced, as the input data is more compressed and less directly representative of the underlying class compared to the ideal pixel format. It is also interesting to note that the relative performance of the classification models matches the order seen in the autoencoder training. However, the drop-off of recon- struction accuracy was more pronounced than the drop-off of classification accuracy, showing that even if the autoencoder is unable to achieve a high reconstruction accuracy it is still capable of learning features relevant in a downstream classification context. Figure 1 shows a tSNE embedding of the autoencoder feature vectors extracted with the PNG model, revealing that Figure 1: tSNE embedding of 10000 MNIST PNG images using the Autoencoder feature vector. 60 0 1 2 40 3 4 5 20 6 7 8 0 9 20 40 60 60 40 20 0 20 40 60 even in PNG format the autoencoder is able to learn features which are salient to the classes present in the data. SOREL Evaluation To evaluate how effective our autoencoder was on the SOREL Malware samples, a set of samples which were not used in the self-supervised task were held out and embedded with the trained SOREL autoencoder. Three Random Forest models were trained, one on the autoencoder feature vector, one using the provided EMBERv2 features, and one with both the autoencoder and EMBERv2 features. The training data for the classifier was comprised entirely of SOREL malware samples, with the models attempting to predict any of the malware tags associated with the given sample. Figure 2: tSNE embedding of 50000 SOREL Malware samples using the Ember features. 150 adware crypto_miner downloader 100 dropper file_infector flooder 50 installer packed ransomware spyware 0 worm 50 100 150 100 50 0 50 100 150 EMBERv2 Model Autoencoder Model Joint Model Accuracy 83.97% 83.96% 83.97% Precision 96.88% 96.88% 96.87% Recall 92.92% 92.93% 92.94% For the reported metrics, we observe that the model using only features from the autoen- coder has performance comparable with the model using the human-engineered features from EMBERv2. Figures 2 and 3 visualizes a tSNE embedding of the same set of 50k samples using the Ember and Autoencoder features, showing that both have the ability to partition the data to a certain extent with respect to the malware tags. Performance between all three models is quite close. We attribute this to label noise in the malware tags, which are themselves de- rived from machine learning[8] and likely contain some level of error. With this in mind, we do not claim that the autoencoder features outperform the human-engineered set, but rather that the autoencoder can learn a representation with similar performance without requiring Figure 3: tSNE embedding of 50000 SOREL Malware samples using the Autoencoder features. 150 adware crypto_miner downloader 100 dropper file_infector flooder 50 installer packed ransomware spyware 0 worm 50 100 150 150 100 50 0 50 100 150 human domain expertise. We further validate this assumption by examining the joint Random Forest model, where we find 42 of the top 100 most important features were produced from the autoencoder. Figure 4 shows all feature importances from the model, divided into the 10 subsections that comprise the Ember feature set as well as the autoencoder features. 5. Conclusion Our experiment suggests that learning representations of bytes with a convolutional autoencoder can be effective for identifying salient features present in a set of data. Evaluating this approach on the SOREL dataset shows that this method rivals the efficacy of human-engineered features, with the added advantages of our approach not requiring any domain knowledge and can be applied to raw input sequences. Additionally, having the majority of the autoencoder being implemented via convolutions, our model architecture benefits from the prior work completed in the industry to optimize the processing speed of convolutions on GPUs. Figure 4: Feature importance graph from the joint Random Forest model, comparing the Autoencoder features to the 10 sections of Ember features. Individual features for each subsection are plotted along the X axis, the order of which is not relavant. The Y axis is the feature importance reported by the model, higher is more important toe ies er au ctor str py od bit ram o ire nc ntr ex s ral da orts n sec r s rt e tog tio tad ing ee ne po ad p his im ge he 0.006 0.005 0.004 0.003 0.002 0.001 0.000 References [1] H. S. Anderson, P. Roth, EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models, 2018. [2] R. Harang, E. M. Rudd, SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection, 2020. [3] A. Radford, R. Jozefowicz, I. Sutskever, Learning to Generate Reviews and Discovering Sentiment, 2017. [4] M. Krčál, O. Švec, O. Jašek, M. Bálek, Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only, 2017. [5] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuogluo, WaveNet: A Generative Model for Raw Audios, 2016. [6] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, C. Nicholas, Malware Detection by Eating a Whole EXE, 2017. [7] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely Connected Convolutional Networks, 2016. [8] F. N. Ducau, E. M. Rudd, T. M. Heppner, A. Long, K. Berlinr, Berlin. Automatic Malware Description via Attribute Tagging and Similarity Embedding, 2019.