=Paper=
{{Paper
|id=Vol-3095/paper1
|storemode=property
|title=BETH Dataset: Real Cybersecurity Data for Unsupervised Anomaly Detection Research
|pdfUrl=https://ceur-ws.org/Vol-3095/paper1.pdf
|volume=Vol-3095
|authors=Kate Highnam,Kai Arulkumaran,Zachary Hanif,Nicholas R. Jennings
}}
==BETH Dataset: Real Cybersecurity Data for Unsupervised Anomaly Detection Research==
BETH Dataset: Real Cybersecurity Data for Unsupervised Anomaly Detection Research Kate Highnam∗†1,2 , Kai Arulkumaran*1,3 , Zachary Hanif*4 , and Nicholas R. Jennings5 1 Imperial College London 2 The Alan Turing Institute 3 ARAYA Inc. 4 University of Maryland, College Park 5 Loughborough University Abstract We present the BETH cybersecurity dataset for anomaly detection and out-of-distribution analysis. With real “anomalies” collected using a novel low-level tracking system, our dataset contains over eight million data points tracking 23 hosts. Each host has captured benign activity and, at most, a single attack, enabling cleaner behavioural analysis. In addition to being one of the most modern and extensive cybersecurity datasets available, BETH enables the development of anomaly detection algorithms on heterogeneously-structured real-world data, with clear downstream applications. We give details on the data collection, suggestions on pre-processing, and analysis with initial anomaly detection benchmarks on a subset of the data. 1 Introduction When deploying machine learning (ML) models in the real world, anomalous data points and shifts in the data distribution are inevitable. From a cyber security perspective, these anomalies and dataset shifts are driven by both defensive and adversarial advancement. To withstand the cost of critical system failure, the development of robust models is therefore key to the performance, protection, and longevity of deployed defensive systems. Current research into the robustness of ML models tends to be based on modifying common datasets, and extrapolating performance to disparate anomaly detection applications. For out-of- distribution (OoD) estimation, researchers combine pairs of existing datasets, such as MNIST- FashionMNIST, CIFAR10-CelebA, or CIFAR10-ImageNet32 [21], to mimic data distribution shift. Other data for evaluating robustness is constructed by modifying samples within datasets; for example, corrupting, perturbing, or shifting samples [6, 22] to generate anomalies. These primarily use image datasets [13, 3, 31, 10, 27, 17], and sometimes text datasets [14, 12, 7]. Such experiments are said to be motivated by domains such as security, but it is unknown exactly how well new methods—in particular, those centred around deep learning (DL)—may generalise beyond these input modalities in other real world applications. The difficulty with real datasets is that complexity and domain expertise remain a barrier to entry for ML researchers. In this paper, we present the BPF-extended tracking honeypot (BETH) dataset1 as the first cybersecurity dataset for robustness benchmarking in unsupervised anomaly detection. Collected ∗ Equal Contribution † Corresponding Author: kwh19@ic.ac.uk 1 Available at https://www.kaggle.com/katehighnam/beth-dataset Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. using a novel honeypot tracking system, our dataset has the following properties that make it attractive for the development of robust ML methods: 1) at over eight million data points, this is one of the largest cyber security datasets available; 2) it contains modern host activity and attacks; 3) it is fully labelled (for verification); 4) it contains highly structured but heterogeneous features; and 5) each host contains benign activity and at most a single malicious user, which is ideal for behavioural analysis and other research tasks. In addition to the described dataset, further data is currently being collected and analysed to add alternative attack vectors to the dataset. There are several existing cybersecurity datasets commonly used in ML research: the KDD Cup 1999 Data [8], the 1998 DARPA Intrusion Detection Evaluation Dataset [11, 15], the ISCX IDS 2012 dataset [29], and the NSL-KDD dataset [30]. The KDD’99 data is an abstracted view of network activity from the DAPRA’98 data; the NSL-KDD dataset is the same as the KDD’99 Data after, primarily, removing duplicates. Each includes millions of records of realistic activity for enterprise applications, with labels for attacks or benign activity. The KDD1999, NSL-KDD, and ISCX datasets contain network traffic, while the DARPA1998 dataset also includes limited process calls. However, these datasets are at best almost a decade old, and are collected on in-premise servers. In contrast, BETH contains modern host activity and activity collected from cloud services, making it relevant for current real-world deployments. In addition, some datasets include artificial user activity [29] while BETH contains only real activity. BETH is also one of the few datasets to include both kernel-process and network logs, providing a holistic view of malicious behaviour. Refer to Table 1 for further comparison between BETH and these other cybersecurity datasets. This paper begins with a description of the data collection process and the relevance of the available features. We then perform an analysis of the first set of kernel-level process logs collected, including anomaly detection benchmarks2 . Our benchmarks include both traditional baselines [26, 28, 16], as well as a state-of-the-art deep-learning-based method [21]. In summary, the isolation forest (iForest) [16] archives the highest area under the receiver operating charac- teristic (AUROC) on the initial, labelled subset of our data. We believe the scale and range of attacks available in our full dataset will pose a challenge for all current anomaly detection methods. 2 The BETH Dataset The BETH dataset currently represents 8,004,918 events collected over 23 stand-alone honeypots, running for about five non-contiguous hours on a major cloud provider. For benchmarking and discussion, we selected the initial subset of the process logs. This subset was further divided into training, validation, and testing sets with a rough 60/20/20 split based on host, quantity of logs generated, and the activity logged. Only the test set includes malicious activity, as expected when training unsupervised anomaly detection models. Table 2 provides a summary of the dataset, while Table 3 and Table 4 provide a description of the kernel-process and DNS log features, respectively. In this section, we first detail the log collection methodology, followed by a description of the overall dataset. The final subsection discusses potential research questions that could be investigated using our dataset. 2.1 Collection Methodology The challenge of crafting a honeypot is two-fold: make it tempting enough to infiltrate, and track activity without being detected. The former is typically done by providing “free” resources to an attacker, i.e., easily accessible computer power. Our implementation currently runs hosts 2 https://github.com/jinxmirror13/BETH_Dataset_Analysis Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Table 1: Comparison of our BETH dataset against other cybersecurity datasets used in ML research. These attributes signify the ideal features of a robust model benchmark and reduce the barrier to entry for ML researchers looking for a real-world dataset. The attributes were manually verified and compared with Ring et al.’s [25] table of network-based datasets. Dataset Name Size Kernel Re-created Single User Simple Network Cloud Processes Traffic Activity Environment Infrastructure DARPA1998 [11] Not Stated, 3 3 6 6 6 4GB compressed KDD1999 [8] 7+ million records 6 3 6 6 6 NSL-KDD [30] ∼ 2 million records 6 3 6 6 6 ISCX IDS2012 [29] ∼ 2 million flows 6 3 6 6 6 & ∼ 81.1GB compressed BETH ∼ 8 million records 3 6 3 3 3 BETH Dataset for Unsupervised Anomaly Detection Table 2: General characteristics of the kernel-process logs, including our initial benchmark subset. Dataset Length % of subset # of hosts Training 763,144 66.88% 8 Validation 188,967 16.56% 4 Testing 188,967 16.56% 1 Subset Total 1,141,078 100% 13 Total 8,004,918 - 23 Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 K. Highnam et. al. BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. Table 3: The description and type of each feature within the kernel-level process logs, tracking every create, clone, and kill process call. Starred features were included in the model baselines and converted as described in Appendix A. Feature Description timestamp Seconds since system boot processId* Integer label for the process spawning this log threadId Integer label for the thread spawning this log parentProcessId* Parent’s integer label for the process spawning this log userId* Login integer ID of user spawning this log mountNamespace* Set mounting restrictions this process log works within processName String command executed hostName Name of host server eventId* ID for the event generating this log eventName Name of the event generating this log argsNum* Length of args returnValue* Value returned from this event log (usually 0) stackAddresses memory values relevant to the process args List of arguments passed to this process sus Binary label as a suspicious event (1 is suspicious, 0 is not) evil Binary label as a known malicious event (0 is benign, 1 is not) with a single ssh vulnerability: any password will be accepted to login. This is enough to entrap automated, naive SSH scanning and brute force methods. In the future we plan to deploy hosts with other vulnerabilities, with which we hope to observe other attack vectors. To log activity in real time, each host runs Ubuntu 18.04 with a Docker container [20]. This encapsulates our two-sensor monitoring system utilising the extended Berkeley Packet Filter (eBPF) [5] which runs isolated programs in a Linux kernel sandbox to directly monitor the packets that it processes. Due to various checks and restrictions placed on the design of these programs, they can be placed in a live kernel without needing to restart it. The eBPF enables tracking of any user-level activity at pre-specified points. The first sensor is set to listen and exfiltrate relevant data packets resulting from internal activity. In particular, this sensor tracks all Linux system calls to create, clone, and kill processes. The second sensor logs network traffic, specifically DNS queries and responses from all processes on the host machine, including those processes running within the hosted Docker containers. When the desired packet appears, it is parsed out to pre-defined fields and then transmitted to a collection server. These process and network data packets are gathered on a separate node hosting a message queue server. Sensor clients connect to this server and transmit their individual data packets as events occur, without batching. This allows the collection and re-transmission of data at an atomic level. The data is then stored locally in two correlated files with respective logs, allowing analysts to understand both local and network effects that occur jointly. 2.2 Dataset Characteristics The dataset is composed of two sensor logs: kernel-level process calls and network traffic. As the initial benchmark subset only includes process logs, this section only covers these; a description of the network logs can be found in Table 4. Each process call consists of 14 raw features and 2 labels, described in Table 3. These largely contain categorical features with some containing large integers, necessitating further processing. Thus, for our benchmarking, we converted several fields to binary variables based Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. Table 4: The description and type of each feature within the DNS logs. Feature Description timestamp Date and time in the format “YYYY-MM- DDTHH:MM:SSZ” for when the packet was sent or received SourceIP Source IP address of the packet DestinationIP Destination IP address of the packet DnsQuery Sent DNS query (e.g. the URL submitted - "google.com") DnsAnswer DNS response; can be null DnsAnswerTTL List of integers sent as strings, can be null; The time to live of the DNS answer DnsQueryNames Name of the requested resource DnsQueryClass Class code for the resource query DnsQueryType Type of resource record (A, AAAA, MX, TXT, etc.) NumberOfAnswers Number of answer headers in the packet DnsOpCode Header information regarding which operation this packet was sent (e.g. standard query is 0) SensorId Same as the hostName in the process records; Name of host server sus Binary label as a suspicious event (1 is suspicious, 0 is not) evil Binary label as a known malicious event (0 is benign, 1 is not) on field expertise, as described in Appendix A. We note that this conversion process creates more duplicates due to the removal of critical fields, such as processName and args, which have no widely-accepted encoding scheme for any machine learning model to utilise. There are some inherent duplicates which we keep within the dataset prior to conversion for the accuracy of real logs and to not limit users of the data. Each record in the process logs and DNS logs was manually labelled suspicious (sus) or evil to assist post-training analysis3 . Logs marked suspicious indicate unusual activity or outliers in the data distribution, such as an external userId with a systemd process4 , infrequent daemon process calls (e.g. acpid or accounts-daemon), or calls to close processes that we did not observe as being started. Evil5 indicates a malicious external presence not inherent to the system, such as a bash execution call to list the computer’s memory information, remove other users’ ssh access, or un-tar an added file. Events marked evil are considered “out of distribution,” as they are generated from a data distribution not seen during training. A subset of the kernel process logs were divided into a typical 60/20/20 split for training, validation, and testing, based on the amount of activity recorded, and evil labels from each host. As is typical in unsupervised anomaly detection, our training and validation sets are each composed of logs generated from multiple hosts which only recorded activity from the OS and cloud infrastructure management. Because there are no signs of infiltration in these process logs, activity resulting from these hosts is considered benign and the events generated by the OS and cloud infrastructure management to be “in-distribution”. Our initial testing dataset contains all activity on a single exploited host, including its 3 These labels are provided from a single reviewer and should not be solely relied upon for industry systems. 4 In the scope of our honeypot any external user traffic is suspicious, but some of these events were initiated by the cloud provider. 5 We note that presence in this dataset does not constitute a “conviction”, as no real damage was done. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. (a) Fitted on Training and transformed Testing (b) Same as (a), coloured with evil labels Figure 1: UMAP visualisations of the training and testing dataset using the pre-processed features (see Appendix A). Figure (a) shows the overlap between the training and testing dataset; Figure (b) highlights the trails of evil events. OS and cloud infrastructure management. The first attack we logged is an attempt to setup a botnet; more details are available in Appendix B. The full dataset contains other malicious activity performed within our honeypots, including cryptomining and lateral movement (between servers). These various attacks may also be compared to answer alternative research questions with our data, as discussed in Subsection 2.3. As each exploited host only contains a single staged attack, with no artificial noise in the benign activity, BETH is one of the cleanest cyber security datasets available to distinguish between malicious and benign activity. As an initial investigation of the data, we visualised the (pre-processed) training and testing datasets with uniform manifold approximation and projection (UMAP) [19]. UMAP was first fitted to the training set before being used to project the testing set into the same space. As can be seen in Figure 1, the data from both sets forms several large clusters in the centre, surrounded by many smaller clusters, with both benign and malicious activity spread across the entire space. The first image shows significant overlap between the training and testing sets. The second image shows that evil events appear in distinct areas. This indicates that unsupervised methods could potentially detect a large portion of the “anomalous” events. 2.3 Research Questions The BETH dataset could answer other cyber security questions than just OoD analysis. Unlike logs within real deployed systems that contain no labels for malicious events, our BETH dataset contains (recently recorded) real data with labels. One use for this dataset would be to profile the attacker or malware’s behaviour [2]. For instance, the known evil events could form a unique fingerprint, a method of uniquely identifying the tactic used by the attacker, to link an attack to its family or appropriate resolution strategy [1]. One could also use graph analysis of process relationships to find malicious cliques [4], or use time series analysis of execution sequences to profile process names (processName) on a modern OS. This latter topic is particularly interesting as some attackers rename malicious processes to benign process names to trick systems into running malicious code. The logs would present a benign process name, even if the arguments or events were inconsistent with normal activity. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. 3 Anomaly Detection Baselines In this section, we provide anomaly detection benchmarks on our initial subset of logs. We chose both standard anomaly detection baselines [23, 32], which includes robust covariance [26], one- class support vector machine (SVM) [28] and iForest [16], as well as density of states estimation (DoSE) [21], which is based on deep generative models. As per [21], we report AUROC, using an ensemble of 5 models for each method. Robust covariance [26] fits an “ellipsoid with the smallest volume or with the smallest co- variance determinant” [24] around the central data points; the tightness is controlled assuming a given level of contamination with anomalies (which we set to 0.05). The anomalies are then scored using the Mahalanobis distance. Similarly, the one-class SVM fits a hyperplane to discrim- inate between the support of the in-distribution data and OoD data [28]. As kernelised SVMs scale with O(N 2 ) and our data subsets are substantial in size, we instead utilised scikit-learn’s linear SVM with stochastic gradient descent, after whitening the data. In contrast to the other methods, the iForest [16] tries to characterise anomalous points in the data distribution using an ensemble of “isolation trees”. Given the scale of the dataset, we also considered DL-based OoD detection methods. In particular, DoSE uses summary statistics (such as the log-likelihood or posterior entropy), cal- culated over the training set by a trained generative model, in order to characterise the “typical set”. In our work we train a variational autoencoder (VAE) [9], consisting of two 2-layer neural networks with 64 hidden units and ReLU activation functions for the encoder and decoder. The first layer of the encoder concatenates learned embeddings of all input features. The final layer of the decoder outputs a set of logits for categorical distributions for all features. We use a 2D latent representation. Each VAE is trained using the AdamW optimiser [18] with learning rate 0.003 and a weight decay of 0.1; early stopping was used with the validation loss. We picked the hidden size ∈ {64, 128, 256}, learning rate ∈ {0.003, 0.0003, 0.00003}, and weight decay ∈ {0, 0.01, 0.1}, using a grid search on the validation loss. Other than modelling the ob- servations as a product of categorical distributions, our setup is largely the same as the original paper [21]. However, due to the size of the training set, we were only able to use DoSE with a linear one-class SVM trained using SGD (as opposed to a kernel SVM). Table 5: OoD AUROC results. Method Robust Covariance One-Class SVM iForest VAE+DoSE-SVM AUROC 0.519 0.605 0.850 0.698 As seen in Table 5, the iForest performs best at differentiating sus events from the benign in our testing dataset. We attribute this to the small set of discrete features available and the conspicuous nature of the attack. DL-based models are less competitive on these sets of features, but have the potential to deal with more raw categorical and even text-based features, which we hope to explore in future work. Finally, we note that imbalanced labelling, summarised in Table 6, necessitates further investigation of what each model predicts is benign or not. 4 Conclusions In this paper, we present our BETH cybersecurity dataset for anomaly detection and OoD analysis. The data was sourced from our novel honeypot tracking system recording both kernel- level process events and DNS network traffic. It contains real-world attacks in the presence of benign modern OS and cloud provider traffic, without the added complexity of noisy artificial user activity. This cleanliness is ideal for OoD analysis, such that each host in the dataset Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. only contains one or two data-generating distributions. We also include baselines for anomaly detection trained on a subset of the BETH dataset: robust covariance, one-class SVM, iForest, and DoSE-SVM (with a VAE). For future work, we plan to collect and publish more attacks for alternative testing datasets. This will also allow investigations in comparing attacks or perhaps testing in a continual learning setting. Acknowledgements We are thankful for the many incredible academics who supported this work. Dr. Arinbjörn Kolbeinsson inspired the included UMAP visualisations. Professor Sergio Maffeis provided in- credible advise on the presentation and edits to the paper. Also, thank you to the reviewers for their positive feedback. This work was also supported by The Alan Turing Institute, during an author’s participa- tion in the Turing Enrichment Program (October 2021 - June 2022), under the EPSRC grant EP/N510129/1. References [1] D. Brumley, J. Caballero, Z. Liang, J. Newsome, and D. Song. Towards automatic discovery of deviations in binary implementations with applications to error detection and fingerprint generation. In USENIX Security Symposium, page 15, 2007. [2] Q. Chen, S. R. Islam, H. Haswell, and R. A. Bridges. Automated ransomware behavior analysis: Pattern extraction and early detection. In International Conference on Science of Cyber Security, pages 199–214. Springer, 2019. [3] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. Emnist: Extending mnist to handwrit- ten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017. [4] A. A. Elhadi, M. A. Maarof, and A. H. Osman. Malware detection based on hybrid signa- ture behaviour application programming interface call graph. American Journal of Applied Sciences, 9(3):283, 2012. [5] B. Gregg. BPF Performance Tools: Linux System and Application Observability. Addison- Wesley Professional, 1st edition, 2019. ISBN 0136554822. [6] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019. [7] D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song. Pretrained trans- formers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020. [8] S. Hettich and S. Bay. The uci kdd archive [http://kdd. ics. uci. edu]. irvine, ca: University of california. Department of Information and Computer Science, 152, 1999. [9] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [10] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. [11] M. L. Labs. 1998 darpa intrusion detection evaluation dataset, 1998. URL https://www. ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset. [12] K. Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pages 331–339. Elsevier, 1995. [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [14] D. D. Lewis. Reuters-21578 text categorization collection data set, 1997. [15] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00, volume 2, pages 12–26. IEEE, 2000. [16] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008. [17] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. [18] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. 2018. [19] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approxima- tion and projection for dimension reduction, 2018. URL http://arxiv.org/abs/ 1802.03426. cite arxiv:1802.03426Comment: Reference implementation available at http://github.com/lmcinnes/umap. [20] D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux journal, 2014(239):2, 2014. [21] W. Morningstar, C. Ham, A. Gallagher, B. Lakshminarayanan, A. Alemi, and J. Dillon. Density of states estimation for out of distribution detection. In International Conference on Artificial Intelligence and Statistics, pages 3232–3240. PMLR, 2021. [22] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshmi- narayanan, and J. Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Jour- nal of Machine Learning Research, 12:2825–2830, 2011. [24] D. Peña and F. J. Prieto. Multivariate outlier detection and robust covariance matrix estimation. Technometrics, 43(3):286–310, 2001. [25] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho. A survey of network-based intrusion detection data sets. Computers & Security, 86:147–167, 2019. [26] P. J. Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recog- nition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. [28] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001. [29] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security, 31 (3):357–374, 2012. ISSN 0167-4048. doi: https://doi.org/10.1016/j.cose.2011.12.012. URL https://www.sciencedirect.com/science/article/pii/S0167404811001672. [30] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani. A detailed analysis of the kdd cup 99 data set. In 2009 IEEE symposium on computational intelligence for security and defense applications, pages 1–6. IEEE, 2009. [31] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. [32] S. W. Yahaya, A. Lotfi, and M. Mahmud. Towards a data-driven adaptive anomaly detection system for human activity. Pattern Recognition Letters, 145:200–207, 2021. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2021.02.006. URL https://www. sciencedirect.com/science/article/pii/S0167865521000611. A Pre-Processing In this section, we provide more details on the raw features in the dataset, as well as pre- processing suggestions, which we used in our baselines: timestamp: We left this field out to consider the dataset as a sample from a distribution rather than time series. We recommend using the values as they are or also leave them out, depending on the method chosen. processId: Process IDs 0, 1, and 2 are meaningful since these are always values used by the OS, but otherwise a random number is assigned to the process upon creation. We recommend replacing processId with a binary variable indicating whether or not processID is 0, 1, or 2. threadId: While this value did not appear useful in our analysis, it might suggest how to link process calls if obfuscated in the system. No conversion is recommended at this time. parentProcessId: Same as processId, the same mapping to a binary variable should suffice. userId: The default in Linux systems is to assign OS activity to some number below 1000 (typically 0). As users login, they are assigned IDs starting at 1000, incrementally. This can be altered by a user, but none of the current logs gave evidence an attacker did this. We used a binary variable to indicate userId < 1000 or userId ≥ 1000. Alternatively, one could use an ordinal mapping that buckets all userId < 1000 at zero and then increment upwards for each new user. Also, no more than four logins were viewed per host in our current datasets. mountNamespace: This field is somewhat consistent across our hosts and determines the access a certain process has to various mount points. The most common value for this feature is 4026531840 or 0xF0000000, which is for the mnt/ directory where all manually mounted points are linked. It is noted that all logs with userId ≥ 1000 had a mountNamespace of 4026531840, while some OS userId traffic used different mountNamespace values. We converted this feature into a binary mapping for whether or not mountNamespace = 4026531840. processName: This is a string field of variable length (ranging from one to fifteen characters). When manually analysing the data, this was a critical field in conjunction with the eventName. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. For our baselines, we refrained from utilising this, although the model should be given an encoding of this using a hash or ability to learn a useful encoding on its own. It is noted that attackers can easily change the processName to override a benign one so their traffic looks regular. This was not observed within the current dataset. hostName: This field is useful for grouping the dataset into related subsets of data generated from the same honeypot. The name of the host name does not transfer between the model development subsets described in this paper. eventId: Linux systems assign an integer corresponding to the eventName. We include this field as-is for our benchmarks. eventName: Event names uniquely map to eventId, so we drop it from training. argsNum: This raw feature is included as-is, since, at this time, adequately parsing args requires either more sophisticated pre-processing or a more complex ML model. returnValue: This is also called the exit status and can be used to determine whether a call completed successfully or not. Mappings for this can vary, as this value is decided between the parent and child process. We mapped returnValue into three values based on the common usage of the field: -1 when negative (error), 0 when zero (success), and 1 when positive (success and signalling something to the parent process). stackAddresses: It is difficult to clearly relate this feature during manual analysis and the large values within a variable size list make processing automatically difficult without encoding or learning an extra embedding. Thus this field was dropped from training our baselines. args: There are many options in this variable list of dictionaries. For simplicity, we refrain from utilising any of these values. However, more features can and should be created for future work. Finally, BETH contains two binary, manually-labelled flags: sus and evil. Examples and the explanation of how these labels were created are detailed in Section 2.2. A breakdown of these labels within the subsets for model development is given in Table 6. Table 6: Breakdown of sus and evil labels by training, validation, and testing subsets. Dataset sus=0, sus=1, sus=1, evil=0 evil=0 evil=1 Training 761875 1269 0 (99.8%) (0.02%) (0.00%) Validation 188181 786 0 (99.6%) (0.04%) (0.00%) Testing 17508 13027 158432 (9.27%) (6.89%) (83.84%) B Testing Dataset Details This testing dataset was extracted from a single honeypot. The overall attack appears to be instantiating a botnet node. The timeline of the events recorded is provided in Figure 2; this is the typical attack pattern. The server is initially accessed, it may run a few setup operations in the environment to send some details to its Command and Control (C2) for a customised attack, it sleeps for a while, intermittently checks in with the C2 or a clock, and then launches its attack until complete. In this case, the honeypot is first accessed at around 411 seconds from booting. Several Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 BETH Dataset for Unsupervised Anomaly Detection K. Highnam et. al. Figure 2: The timeline of the attack captured in the testing dataset is displayed as a histogram based on the number of events and seconds from the “boot” or starting up of the machine. thousand lines are then recorded in the process logs denoting the setup of the new user profile. This happens within milliseconds; these are detailed logs of everything the OS does during the short pause before the terminal opens for user entry when ssh-ing into a server. This user then sleeps, pausing all user activity for some number of seconds. This appears to happen at random intervals—a more sophisticated technique than using consistent intervals—of which the latter would give a clear signature of automated activity. After a few minutes, it sets up an SFTP server to download a file called dota3.tar.gz (known botnet malware) and scopes out the system using common commands such as whoami, ls, and cat /proc/cpuinfo. After about 7.5 minutes, it unpacks the dota3.tar.gz and runs over a hundred threads, all attempting to connect with different servers. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021