Multi-Layer Model and Training Method for Information-Extreme Malware Traffic Detector

Multi-Layer Model and Training Method for Information-Extreme Malware Traffic Detector ViacheslavMoskalenko v.moskalenko@cs.sumdu.edu.ua Sumy State University

Rimsky-Korsakov st., 2 40007 Sumy Ukraine

AlonaMoskalenko a.moskalenko@cs.sumdu.edu.ua Sumy State University

Rimsky-Korsakov st., 2 40007 Sumy Ukraine

Multi-Layer Model and Training Method for Information-Extreme Malware Traffic Detector 2AAD205B6A89A98A20AFB23D515C7F30 GROBID - A machine learning software for extracting information from scholarly documents malware detection system convolutional sparse coding network growing neural gas tree ensembles random forest regression information criterion information-extreme machine learning

Model-based on multilayer convolutional sparse coding feature extractor and information-extreme decision rules for malware traffic detection is presented in the paper. Growing sparse coding neural gas algorithms for unsupervised pre-training of the feature extractor are used. Random forest regression model as a student in knowledge distillation from sparse coding layers is proposed for speed up inference mode. Information-extreme learning method based on binary encoding with tree ensembles and class separation with radial basis function in binary Hamming space are proposed. Information-extreme classifier is characterized by low computational complexity and high generalization ability for small labeled training sets. Simulation results with an optimized model on test open datasets confirm the suitability of proposed algorithms for practical application.

Introduction

Existing malware traffic detection systems still do not provide high-reliability solutions, as there are a constant increase the number and variety of new sources of malware traffic and a small number of relevant labeled data [1,2]. Thus, the use of handcrafted features for the description of observations leads to a decline the informativeness of the features description and the effectiveness of learning of the decision rules of the malware traffic detection system [2,3]. Therefore, the most promising approach to the synthesis of a features extractor is the use of ideas and methods of machine learning for the hierarchical (deep) representation of observations for unlabeled data [4,5]. Conventional approaches to deep supervised machine learning require a significant amount of labeled training examples and computational resources [6,7]. In addition, models trained with a supervisor based on gradient descent and its modifications are vulnerable to adversarial attacks, noise and data novelty. To increase the informativeness of the feature representation of observations, it is promising to use ideas and methods of sparse coding and unsupervised competitive learning [8,9]. This allows to use the large volume of unlabeled data with maximum efficiency. Among the ways to increase the generalization ability of the decision rules are considered ensemble algorithms, error correction codes and methods of class separation within the geometric approach. Also high speed of packet flow in modern networks require high productivity of traffic analysis algorithms. To reduce computational complexity of data analysis models, different methods of model pruning and knowledge distillation are used. However, models hybridization and integrated use of different methods bring some uncertainties to the final result, so the solution in this approach requires research and verification. In this case, information criteria are considered the best metrics for validation and verification of the result, because they directly characterize the reduction of uncertainty in decision-making and are less sensitive to outlier and imbalances in the data.

Formal Problem Statement

Let the CTU-Mixed and CTU-13 datasets are given data collections from the real network environment by CTU researchers from 2011 to 2015, which are formed as pcap-files [4,5]. The first CTU-Mixed dataset can be used for training a feature extractor. The second CTU-13 dataset contains labeled flows and it could be used to train the decision rules for detecting malware network traffic.

It is necessary to build an informative feature extractor and reliable decision rules using labeled and unlabeled datasets through optimization of model parameters. In the process of training, it is necessary to maximize the information efficiency criterion of the malware traffic detector * ( )

{ } 1 1 max , M k m k m E E M    , where ( ) k m E is information efficiency criterion of recognition the class o m X on k -th step of training; { }

k -ordered set of training steps. When the malware traffic detector functions in its inference mode, it is necessary to provide computational efficiency for high speed traffic.

Literature Review

Convolutional multi-layer neural networks allow forming an informative hierarchical features representation of input observations [6]. In addition, they have already shown high efficiency in solving problems of machine vision and analysis of time series [6], [7]. Meanwhile, supervised training requires a large amount of labeled data, the labeling of which may be expensive or inaccessible in a reasonable amount of time. The unsupervised training of convolutional networks is aimed at efficient use of unlabeled examples, which are usually available quite a lot. It is carried out based on an autoencoder or Restricted Boltzmann machine, which requires a large amount of training data and long learning time to obtain an acceptable result [8]. In work [9] it is proposed to use alternative approach based on k-means cluster-analysis algorithm to speed up feature set training. However, k-means is characterized by slow convergence and sub-optimality of the results due to the hard-competitive nature of its learning scheme and the sensitivity to initial cluster initialization.

In work [10] is proposed a combination of the principles of neural gas and sparse coding for the feature set training on unlabeled data. Given approach is characterized by soft-competitive learning scheme that facilitates robust convergence to close to optimal features distributions over the training sample. At the same time, embedding of sparse coding methods can increase the immunity against interference and generalization ability of features representation. Also, it is a well-known fact that sparse representations of the input data are a crucial tool for combating adversarial attacks and the production of de-correlated features as a result of the explaining-away effect. However, the size of feature set is unknown beforehand and it is selected by the developer, which leads to increase the optimization time.

The required size of feature set in each layer of hierarchical representation is difficult to predict in advance, so the promising approach to feature set learning is to use the principles of growing neural gas, which automatically determines the required number of neurons (features) [11]. The presence of a mechanism for the adding of new neurons, as well as the removal of excessive old ones, makes the algorithm more flexible compared to the classical neuron gas, but it also has serious disadvantages. The small values of the period between the iterations of the generation of new neurons  lead to the instability of the learning process and the distortion of the formed structures, as here observed the excessively frequent adding of new neurons. The high value of the period  provides the expected effect, but at the same time it leads to a significant slowdown in the algorithm. However, in the works [11,12] it was shown that achieving stability of learning could be done by setting the "radius of reach" of the neurons, which involves the replacement of the parameter  on the threshold of maximum distance of the neuron from each points of the training set attributed to it. However, the mechanisms for updating neurons and assessing the remoteness of the points of the input space to the neurons have not yet been reviewed in order to adapt the learning process to the sparse coding of observations.

The main disadvantage of sparse coding in representation learning is the use of an iterative procedure during the inference which slows down the recognition process. One of the popular ways to accelerate models is to use the principles of knowledge distillation, where the redundant model acting as a teacher can be replaced by a lightweight model acting as a student [13]. The ensemble of decision trees is a flexible and computationally efficient model, which can potentially be used as a student model to approximate the sparse coder [14]. However, no such research had been conducted and the effectiveness of such an approach is unknown, which underscores the relevance of this issue.

In addition the decision rules are important components in the malware detection systems. As a rule, it represents a trainable classifier. At the same time, the effectiveness of training a classifier is often considered as a measure of the effectiveness of the feature extractor [5]. The most popular algorithm for classification analysis is the method of support vector machine, where the training of decision rules takes place within the framework of a geometric approach by constructing linear separable hypersurface in the secondary features space [15]. However, this algorithm requires a lot of hyper-parameters adjustments and its performance depends on the complexity of the kernel functions. In work [16], were proposed the construction of decision rules by adaptive binary encoding of the input features and the optimizing in information sense the radial-basis based separable hyper-surface in the Hamming binary space. Such a classifier has high operational efficiency, since it uses low computing complexity operations as comparison and logical XOR.

Model and Training Method for Malware Traffic Detector

The internal characteristics of the unit of traffic (packet stream or session) are best displayed in the front part of its bytes, which contains connection data and some content data. The process of converting a pcap-file into a training data set involves three main steps: the separation of traffic into discrete units, taking into account some granularity, clearing traffic by removing empty and duplicate units, forming training images. When dividing traffic into discrete units, one can consider the following granularities: TCP connection, flows, session, service, and host. In this paper, it is proposed to divide the incoming traffic into flows, where a number of packets have the same tuple of five elements: the source and destination IP address, the source and destination ports, the protocol number. In this case, the length of the stream is limited to 784 bytes, so longer streams are cropped, and shorter ones are supplemented by zero bytes. As a result, we have an image of 28x28 pixels, which will be delivered to the input of the feature extractor. The brightness of each pixel is normalized to the range [0, 1]. As a basis for building architecture of features extractor was used a convolutional network is known as LeNet-5 [5], the main modification of which relates to use the unfixed number of convolutional filters, the amount of which is determined during the layer-wise training. The pixel activation of each channel of features map is offered to calculate based on greedy-L0 Orthogonal Matching Pursuit algorithm (OMP) or L1regularized least angle regression algorithm (LARS) with the function of ReLU activation [17]. In order to accelerate the model in the inference mode, it is possible to replace the computationally intensive search for sparse coefficients with a noniterative approximating encoder (Figure 1). According to distillation knowledge principle, the training set for approximation encoder is formed from input of the layer and pseudo-labels from output of the layer. In this case, pseudo-labels are obtained by OMP or LARS algorithms.

It is proposed to implement sparse coding with OMP and LARS algorithms where stop criterion based on achievement of 30% non-zero entries in sparse code. A Local Contrast Normalization layer, placed after the sub-sampling layer, before the next layer, amplifies the informative features and weakens the rest of the pixels of the feature map. 13. All edges in the graph with the age more than max a are removed. In the case that some nodes do not have incident edges (become isolated), they are also removed. Features extractor can be fine-tuned based on the backpropagation algorithm with a temporary or permanent neural classifier at the model output [17]. Since in the conditions of nonstationarity the informativeness of features in advance cannot be known, the fine tuning is not provided in our algorithm. The purpose of the feature extractor is to disentangle explanatory factors. The information-extreme classifier requires binary representation of the input signal to build error-correction decision rules. The ensemble of decision trees is a computationally effective method for inducing informative binary features of observations (Figure 2). Nodes of decision trees are numbered. Numbers of nonzero bits of resulting binary code correspond to the numbers of nodes through which the decision path lies [16].

Information-extreme classifier under inference mode make decision on belonging 2. For k = 1,…, K do 3. Bootstrap k D from D using probability distribution ( )

j j P X x w   .

4. Train decision tree k T on k D using entropy criterion to measure the quality of split.

Binary encoding of j

x datapoint from D using concatenation of results from 1 ,..., k T T trees. The output of this step is a binary matrix

, , 2 { | 1, ; 1, ; 1, } z s i z b i N s n z Z   

, where 2 N is a number of induced binary features and z n is a number of samples corresponded to class o z X . Hence the equality z z n n   condition is met.

6. Build information-extreme decision rules in radial basis of binary Hamming space and compute optimal information criterion:

  max ( ) z z d E E d   ,(2) where , ,

{ } {0,1,..., 1 }

z i c i i d b b         

 is a set of concentric radiuses with center z b

(support vector) of data distribution in class o z X , which computed using rule , , , ,

с z n n Z z s i с s i s с s z с z i b b n Z n b              ,(3)

where z E -training efficiency criterion of decision rule for o z X class, which is computed as the normalized modification of the S. Kullback's information measure [16]:

2 2 21 ( ) 2 ( ) log , log (2 ) log ( )z z z z z z z E                           ( 4 )

where z  , z  are the false-positive and false-negative rates of classification of input vectors as belonging to the o z X class; ς is any small non-negative number, introduced to avoid uncertainty when dividing by zero. Thus, the resulting model consists of several layers of tree ensemble with optimal in informational sense decision rules at the output.

Result and discussion

The training sample formed with CTU-Mixed for the training of the feature extractor contains 10,000 instances. To train the information-extreme classifier are formed by 1000 instances per class in the training and test datasets. In growing sparse coding neural gas algorithm were chosen the following parameters 0.5

  b , 0.05   b , max 100  a , 0 1   та 0.01 final  

. The parameter of the threshold of neuron fixation v and the parameter of the maximum number of trees K of the classifier are adjusted by scrolling through the values. Table 1 shows the dependence of the number of neurons in the first 1 M and second 2 M layers of feature the criterion of the effectiveness of training averaged over the classes E and accuracy by the validation sampling of the parameter v . In the tree ensembles, max depth is set to 5 and max features is set to 1 N . The analysis of Table 1 shows that increasing the threshold  leads to an increase in the number of neurons in the process of unsupervised training the features extractor. At the same time, increasing the threshold from 0.8 to 0.9 practically does not affect the accuracy of the decision rules. It means, that the value * 0.8   is optimal and allows to form a more compact features representation (compression), meanwhile 0.9   allows to form a sparse representation based on overcomplete basis. Knowledge distillation is implemented with Random Forest regression as student model, where the number of decision trees is limited to 150. The obtained model has equivalent accuracy. In this case, the inference time is reduced by 65 times.

Figure 3 shows a graph of maxima's changes of the information criterion (4) averaged in the set of classes in dependence on the number of decision trees in informa-tion-extreme classifier with * v = 0.8. In this case, the maximum number of trees is limited, K= 100. Thus, the proposed training algorithm allows determining automatically the optimal number of neurons at each layer. At the same time, approximation of the sparse encoder by the non-iterative model, Random Forest regression, allowed accelerating the inference mode.

The results of simulation on data from CTU-Mixed and CTU-13 datasets show that obtained result is superior to result from [4] and [5] and it is acceptable for practical applications. ─ the algorithm of growing sparse coding neural gas is proposed for the first time, which allows unsupervised learning the optimal set of neurons for each layer of the convolution sparse coding model of feature extractor model; ─ for the first time it was proposed to apply the principle of knowledge distillation to reduce computational costs in the algorithms of sparse coding through the application of approximation by the random forest model, which in the inference mode is non-iterative and computationally efficient; ─ for the first time, an information-extreme algorithm of supervised learning is proposed for constructing the decision rules of the detector of malware network traffic.

11. The practical value of obtained results obtained for malware traffic detection systems is a developing a new learning method that effectively uses both labeled and unlabeled training sets. The results of simulation with using the CTU-Mixed and CTU-13 datasets confirm the effectiveness of the obtained decision rules in identifying the malware in test samples of traffic. In this case, the accuracy of the decision rules of the malware traffic detector is 96.1%.

Fig. 1 . 6 .. 7 .w167Fig. 1. Knowledge distillation diagram for each layer of feature extractor The dataset for training of feature extraction layer is formed by decomposition of images or activation maps to patches. These patches are reshaped to 1D vectors, which put on the input of growing sparse coding neural gas algorithm, main steps of which are given below [16]. 1. Initialization of the counter of training vectors : 0 t  . 2. Two initial nodes (neurons) a w and b w are assigned by random selection from the training set. Nodes a w and b w are connected by an edge whose age is zero. These nodes are considered non-fixed. 3. Selected from the dataset the following vector х , which is normalized to a unit length (L2-normalization). 4. Normalizing each base vector , 1, k w k M  to a unit length (L2-normalization). 5. Calculation of the similarity of the input vector х to the base vectors k s w W  for

step 15, otherwise -increment of the counter of steps is : 1 t t   and then proceed to step 3. 15. If all neurons are fixed, the execution of the algorithm stops, otherwise proceed to step 3 and a new epoch of learning begins (repetition of the training set).

of input datapoint x with appropriate binary representation b to one class from

Fig. 2.Classifier Architecture

7 .7Test obtained information-extreme rules on dataset D and compute error rate for each sample from D . Under the inference mode, decision on belonging of datapoint b to one class from set { of binary representation b of input datapoint x to o z X class, the optimal container of which has support vector * z b and radius * , < K/2 abort loop, where  = 0.001.

Fig. 3 . 1 E 2 E 1 d 2 d31212Fig. 3.A graph of the change of the average information criterion (4) in dependence from the number of decision trees in information-extreme classifierThe analysis of Figure3shows that the optimal value of the hyper parameter * K is equal to 185. Further increase of the parameter K does not lead to an increase in the accuracy of the decision rules. At optimal parameters of extractor and the classifier, the accuracy of detection of malware traffic is 96.1%. It indicates on the informative nature of the features descriptive of observation. Figure 4 shows the dependence of the information criterion (4) on the code radius of the container of each class. The analysis of Figure 4 shows that the maximum values of information criterion of learning for the first and second classes are equal to * 1 E =0.590 and * 2 E = 0.597, respectively, and the optimal values of radii of the corresponding containers of the classes of recognition * 1 d = 26, * 2 d = 32 (in code units). In this case, the inter-center Hamming distance is 65 indicating compactness of the feature vector distributions and the clarity of partition in the binary Hamming space.Thus, the proposed training algorithm allows determining automatically the optimal number of neurons at each layer. At the same time, approximation of the sparse encoder by the non-iterative model, Random Forest regression, allowed accelerating the inference mode.The results of simulation on data from CTU-Mixed and CTU-13 datasets show that obtained result is superior to result from[4] and[5] and it is acceptable for practical applications.

Fig. 4 .4Fig. 4. Charts of dependency of the information criterion (4) on the radii of containers of classes: а -class of normal traffic; b -class of malware traffic

Table 1 .1Dependence of information criteria and number of neurons from model parametersv1 M2 MEValidation accuracy0.1015110.106740.1517130.138770.2023130.138770.2525130.138770.3027150.149780.3527150.220830.4033170.255850.4534220.255850.5040250.366900.5549310.45993.00.6066430.46693.20.6570450.50194.10.7099450.55095.20.75145570.55495.30.801611200.59196.10.852201470.60395.40.903222380.61195.0

Acknowledgment

The work was performed in the laboratory of intellectual systems of the computer science department at Sumy State University with the financial support of the Ministry of Education and Science of Ukraine in the framework of state budget scientific and research work of DR No. 0117U003934.

Flow Based Algorithm for Malware Traffic Detection MSkrzewski 2011 Computer Networks Malware traffic detection using tamper resistant features ZBerkay Celik RWalls PMcdaniel ASwami MILCOM 2015 2015. 2015 IEEE Military Communications Conference Analysis of network traffic features for anomaly detection FIglesias TZseby Machine Learning 101 2014 Autoencoder-based feature learning for cyber security applications MYousefi-Azar VVaradharajan LHamey UTupakula International Joint Conference on Neural Networks (IJCNN) 2017. 2017 Malware traffic classification using convolutional neural network for representation learning WeiWang MingZhu XuewenZeng XiaozhouYe YiqiangSheng International Conference on Information Networking (ICOIN) 2017. 2017 Going deeper with convolutions CSzegedy WeiLiu YangqingJia PSermanet SReed DAnguelov DErhan VVanhoucke ARabinovich IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015. 2015 Convolutional neural networks for time series classification BZhao HLu SChen JLiu DWu Journal of Systems Engineering and Electronics 28 2017 Compressed auto-encoder building block for deep learning network QiyingFeng CChen LongChen International Conference on Informative and Cybernetics for Computational Social Systems (ICCSS) 2016. 2016 Weed identification based on Kmeans feature learning combined with convolutional neural network JTang DWang ZZhang LHe JXin YXu Computers and Electronics in Agriculture 135 2017 Sparse Coding Neural Gas: Learning of overcomplete data representations KLabusch EBarth TMartinetz Neurocomputing 72 2009 Image Classification with Growing Neural Networks IMrazova MKukacka International Journal of Computer Theory and Engineering 2013 The Growing Hierarchical Neural Gas Self-Organizing Neural Network EPalomo ELopez-Rubio IEEE Transactions on Neural Networks and Learning Systems 2016 Layer-Level Knowledge Distillation for Deep Neural Network Learning HLi SLin CChen CChiang Applied Sciences 9 1966 2019 YZhou ZZhou GHooker Approximation Trees: Statistical Stability in Model Distillation 2018 Deep learning of support vector machines with class probability output networks SKim ZYu RKil MLee Neural Networks 64 2015 The Model and Training Algorithm of Compact Drone Autonomous Visual Navigation System VMoskalenko AMoskalenko AKorobov VSemashko Data 4 4 2018 Deep Sparse-coded Network YGwon MCha HTKung DSN). 2016 International Conference on Pattern Recognition 2016