<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Multi-Layer Model and Training Method for Information-Extreme Malware Traffic Detector</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Viacheslav</forename><surname>Moskalenko</surname></persName>
							<email>v.moskalenko@cs.sumdu.edu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Sumy State University</orgName>
								<address>
									<addrLine>Rimsky-Korsakov st., 2</addrLine>
									<postCode>40007</postCode>
									<settlement>Sumy</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alona</forename><surname>Moskalenko</surname></persName>
							<email>a.moskalenko@cs.sumdu.edu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Sumy State University</orgName>
								<address>
									<addrLine>Rimsky-Korsakov st., 2</addrLine>
									<postCode>40007</postCode>
									<settlement>Sumy</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Multi-Layer Model and Training Method for Information-Extreme Malware Traffic Detector</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2AAD205B6A89A98A20AFB23D515C7F30</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T04:23+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>malware detection system</term>
					<term>convolutional sparse coding network</term>
					<term>growing neural gas</term>
					<term>tree ensembles</term>
					<term>random forest regression</term>
					<term>information criterion</term>
					<term>information-extreme machine learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Model-based on multilayer convolutional sparse coding feature extractor and information-extreme decision rules for malware traffic detection is presented in the paper. Growing sparse coding neural gas algorithms for unsupervised pre-training of the feature extractor are used. Random forest regression model as a student in knowledge distillation from sparse coding layers is proposed for speed up inference mode. Information-extreme learning method based on binary encoding with tree ensembles and class separation with radial basis function in binary Hamming space are proposed. Information-extreme classifier is characterized by low computational complexity and high generalization ability for small labeled training sets. Simulation results with an optimized model on test open datasets confirm the suitability of proposed algorithms for practical application.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Existing malware traffic detection systems still do not provide high-reliability solutions, as there are a constant increase the number and variety of new sources of malware traffic and a small number of relevant labeled data <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Thus, the use of handcrafted features for the description of observations leads to a decline the informativeness of the features description and the effectiveness of learning of the decision rules of the malware traffic detection system <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>. Therefore, the most promising approach to the synthesis of a features extractor is the use of ideas and methods of machine learning for the hierarchical (deep) representation of observations for unlabeled data <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>. Conventional approaches to deep supervised machine learning require a significant amount of labeled training examples and computational resources <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. In addition, models trained with a supervisor based on gradient descent and its modifications are vulnerable to adversarial attacks, noise and data novelty. To increase the informativeness of the feature representation of observations, it is promising to use ideas and methods of sparse coding and unsupervised competitive learning <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. This allows to use the large volume of unlabeled data with maximum efficiency. Among the ways to increase the generalization ability of the decision rules are considered ensemble algorithms, error correction codes and methods of class separation within the geometric approach. Also high speed of packet flow in modern networks require high productivity of traffic analysis algorithms. To reduce computational complexity of data analysis models, different methods of model pruning and knowledge distillation are used. However, models hybridization and integrated use of different methods bring some uncertainties to the final result, so the solution in this approach requires research and verification. In this case, information criteria are considered the best metrics for validation and verification of the result, because they directly characterize the reduction of uncertainty in decision-making and are less sensitive to outlier and imbalances in the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Formal Problem Statement</head><p>Let the CTU-Mixed and CTU-13 datasets are given data collections from the real network environment by CTU researchers from 2011 to 2015, which are formed as pcap-files <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>. The first CTU-Mixed dataset can be used for training a feature extractor. The second CTU-13 dataset contains labeled flows and it could be used to train the decision rules for detecting malware network traffic.</p><p>It is necessary to build an informative feature extractor and reliable decision rules using labeled and unlabeled datasets through optimization of model parameters. In the process of training, it is necessary to maximize the information efficiency criterion of the malware traffic detector * ( )</p><formula xml:id="formula_0">{ } 1 1 max , M k m k m E E M    , where ( ) k m E is information efficiency criterion of recognition the class o m X on k -th step of training; { }</formula><p>k -ordered set of training steps. When the malware traffic detector functions in its inference mode, it is necessary to provide computational efficiency for high speed traffic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Literature Review</head><p>Convolutional multi-layer neural networks allow forming an informative hierarchical features representation of input observations <ref type="bibr" target="#b5">[6]</ref>. In addition, they have already shown high efficiency in solving problems of machine vision and analysis of time series <ref type="bibr" target="#b5">[6]</ref>, <ref type="bibr" target="#b6">[7]</ref>. Meanwhile, supervised training requires a large amount of labeled data, the labeling of which may be expensive or inaccessible in a reasonable amount of time. The unsupervised training of convolutional networks is aimed at efficient use of unlabeled examples, which are usually available quite a lot. It is carried out based on an autoencoder or Restricted Boltzmann machine, which requires a large amount of training data and long learning time to obtain an acceptable result <ref type="bibr" target="#b7">[8]</ref>. In work <ref type="bibr" target="#b8">[9]</ref> it is proposed to use alternative approach based on k-means cluster-analysis algorithm to speed up feature set training. However, k-means is characterized by slow convergence and sub-optimality of the results due to the hard-competitive nature of its learning scheme and the sensitivity to initial cluster initialization.</p><p>In work <ref type="bibr" target="#b9">[10]</ref> is proposed a combination of the principles of neural gas and sparse coding for the feature set training on unlabeled data. Given approach is characterized by soft-competitive learning scheme that facilitates robust convergence to close to optimal features distributions over the training sample. At the same time, embedding of sparse coding methods can increase the immunity against interference and generalization ability of features representation. Also, it is a well-known fact that sparse representations of the input data are a crucial tool for combating adversarial attacks and the production of de-correlated features as a result of the explaining-away effect. However, the size of feature set is unknown beforehand and it is selected by the developer, which leads to increase the optimization time.</p><p>The required size of feature set in each layer of hierarchical representation is difficult to predict in advance, so the promising approach to feature set learning is to use the principles of growing neural gas, which automatically determines the required number of neurons (features) <ref type="bibr" target="#b10">[11]</ref>. The presence of a mechanism for the adding of new neurons, as well as the removal of excessive old ones, makes the algorithm more flexible compared to the classical neuron gas, but it also has serious disadvantages. The small values of the period between the iterations of the generation of new neurons  lead to the instability of the learning process and the distortion of the formed structures, as here observed the excessively frequent adding of new neurons. The high value of the period  provides the expected effect, but at the same time it leads to a significant slowdown in the algorithm. However, in the works <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref> it was shown that achieving stability of learning could be done by setting the "radius of reach" of the neurons, which involves the replacement of the parameter  on the threshold of maximum distance of the neuron from each points of the training set attributed to it. However, the mechanisms for updating neurons and assessing the remoteness of the points of the input space to the neurons have not yet been reviewed in order to adapt the learning process to the sparse coding of observations.</p><p>The main disadvantage of sparse coding in representation learning is the use of an iterative procedure during the inference which slows down the recognition process. One of the popular ways to accelerate models is to use the principles of knowledge distillation, where the redundant model acting as a teacher can be replaced by a lightweight model acting as a student <ref type="bibr" target="#b12">[13]</ref>. The ensemble of decision trees is a flexible and computationally efficient model, which can potentially be used as a student model to approximate the sparse coder <ref type="bibr" target="#b13">[14]</ref>. However, no such research had been conducted and the effectiveness of such an approach is unknown, which underscores the relevance of this issue.</p><p>In addition the decision rules are important components in the malware detection systems. As a rule, it represents a trainable classifier. At the same time, the effectiveness of training a classifier is often considered as a measure of the effectiveness of the feature extractor <ref type="bibr" target="#b4">[5]</ref>. The most popular algorithm for classification analysis is the method of support vector machine, where the training of decision rules takes place within the framework of a geometric approach by constructing linear separable hypersurface in the secondary features space <ref type="bibr" target="#b14">[15]</ref>. However, this algorithm requires a lot of hyper-parameters adjustments and its performance depends on the complexity of the kernel functions. In work <ref type="bibr" target="#b15">[16]</ref>, were proposed the construction of decision rules by adaptive binary encoding of the input features and the optimizing in information sense the radial-basis based separable hyper-surface in the Hamming binary space. Such a classifier has high operational efficiency, since it uses low computing complexity operations as comparison and logical XOR.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Model and Training Method for Malware Traffic Detector</head><p>The internal characteristics of the unit of traffic (packet stream or session) are best displayed in the front part of its bytes, which contains connection data and some content data. The process of converting a pcap-file into a training data set involves three main steps: the separation of traffic into discrete units, taking into account some granularity, clearing traffic by removing empty and duplicate units, forming training images. When dividing traffic into discrete units, one can consider the following granularities: TCP connection, flows, session, service, and host. In this paper, it is proposed to divide the incoming traffic into flows, where a number of packets have the same tuple of five elements: the source and destination IP address, the source and destination ports, the protocol number. In this case, the length of the stream is limited to 784 bytes, so longer streams are cropped, and shorter ones are supplemented by zero bytes. As a result, we have an image of 28x28 pixels, which will be delivered to the input of the feature extractor. The brightness of each pixel is normalized to the range [0, 1]. As a basis for building architecture of features extractor was used a convolutional network is known as LeNet-5 <ref type="bibr" target="#b4">[5]</ref>, the main modification of which relates to use the unfixed number of convolutional filters, the amount of which is determined during the layer-wise training. The pixel activation of each channel of features map is offered to calculate based on greedy-L0 Orthogonal Matching Pursuit algorithm (OMP) or L1regularized least angle regression algorithm (LARS) with the function of ReLU activation <ref type="bibr" target="#b16">[17]</ref>. In order to accelerate the model in the inference mode, it is possible to replace the computationally intensive search for sparse coefficients with a noniterative approximating encoder (Figure <ref type="figure">1</ref>). According to distillation knowledge principle, the training set for approximation encoder is formed from input of the layer and pseudo-labels from output of the layer. In this case, pseudo-labels are obtained by OMP or LARS algorithms.</p><p>It is proposed to implement sparse coding with OMP and LARS algorithms where stop criterion based on achievement of 30% non-zero entries in sparse code. A Local Contrast Normalization layer, placed after the sub-sampling layer, before the next layer, amplifies the informative features and weakens the rest of the pixels of the feature map. 13. All edges in the graph with the age more than max a are removed. In the case that some nodes do not have incident edges (become isolated), they are also removed. Features extractor can be fine-tuned based on the backpropagation algorithm with a temporary or permanent neural classifier at the model output <ref type="bibr" target="#b16">[17]</ref>. Since in the conditions of nonstationarity the informativeness of features in advance cannot be known, the fine tuning is not provided in our algorithm. The purpose of the feature extractor is to disentangle explanatory factors. The information-extreme classifier requires binary representation of the input signal to build error-correction decision rules. The ensemble of decision trees is a computationally effective method for inducing informative binary features of observations (Figure <ref type="figure">2</ref>). Nodes of decision trees are numbered. Numbers of nonzero bits of resulting binary code correspond to the numbers of nodes through which the decision path lies <ref type="bibr" target="#b15">[16]</ref>.</p><p>Information-extreme classifier under inference mode make decision on belonging  2. For k = 1,…, K do 3. Bootstrap k D from D using probability distribution ( )</p><formula xml:id="formula_1">j j P X x w   .</formula><p>4. Train decision tree k T on k D using entropy criterion to measure the quality of split.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Binary encoding of j</head><p>x datapoint from D using concatenation of results from 1 ,..., k T T trees. The output of this step is a binary matrix</p><formula xml:id="formula_2">, , 2 { | 1, ; 1, ; 1, } z s i z b i N s n z Z   </formula><p>, where 2 N is a number of induced binary features and z n is a number of samples corresponded to class o z X . Hence the equality z z n n   condition is met.</p><p>6. Build information-extreme decision rules in radial basis of binary Hamming space and compute optimal information criterion:</p><formula xml:id="formula_3">  max ( ) z z d E E d   ,<label>(2) where , ,</label></formula><p>{ } {0,1,..., 1 }</p><formula xml:id="formula_4">z i c i i d b b         </formula><p> is a set of concentric radiuses with center z b</p><p>(support vector) of data distribution in class o z X , which computed using rule , , , ,</p><p>.</p><formula xml:id="formula_6">с z n n Z z s i с s i s с s z с z i b b n Z n b              ,<label>(3)</label></formula><p>where z E -training efficiency criterion of decision rule for o z X class, which is computed as the normalized modification of the S. Kullback's information measure <ref type="bibr" target="#b15">[16]</ref>:</p><formula xml:id="formula_7">2 2 2<label>1 ( ) 2 ( ) log , log (2 ) log ( )</label></formula><formula xml:id="formula_8">z z z z z z z E                           <label>( 4 )</label></formula><p>where z  , z  are the false-positive and false-negative rates of classification of input vectors as belonging to the o z X class; ς is any small non-negative number, introduced to avoid uncertainty when dividing by zero. Thus, the resulting model consists of several layers of tree ensemble with optimal in informational sense decision rules at the output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Result and discussion</head><p>The training sample formed with CTU-Mixed for the training of the feature extractor contains 10,000 instances. To train the information-extreme classifier are formed by 1000 instances per class in the training and test datasets. In growing sparse coding neural gas algorithm were chosen the following parameters 0.5</p><formula xml:id="formula_9">  b , 0.05   b , max 100  a , 0 1   та 0.01 final  </formula><p>. The parameter of the threshold of neuron fixation v and the parameter of the maximum number of trees K of the classifier are adjusted by scrolling through the values. Table <ref type="table" target="#tab_0">1</ref> shows the dependence of the number of neurons in the first 1 M and second 2 M layers of feature the criterion of the effectiveness of training averaged over the classes E and accuracy by the validation sampling of the parameter v . In the tree ensembles, max depth is set to 5 and max features is set to 1 N . The analysis of Table <ref type="table" target="#tab_0">1</ref> shows that increasing the threshold  leads to an increase in the number of neurons in the process of unsupervised training the features extractor. At the same time, increasing the threshold from 0.8 to 0.9 practically does not affect the accuracy of the decision rules. It means, that the value * 0.8   is optimal and allows to form a more compact features representation (compression), meanwhile 0.9   allows to form a sparse representation based on overcomplete basis. Knowledge distillation is implemented with Random Forest regression as student model, where the number of decision trees is limited to 150. The obtained model has equivalent accuracy. In this case, the inference time is reduced by 65 times.</p><p>Figure <ref type="figure">3</ref> shows a graph of maxima's changes of the information criterion (4) averaged in the set of classes in dependence on the number of decision trees in informa-tion-extreme classifier with * v = 0.8. In this case, the maximum number of trees is limited, K= 100. Thus, the proposed training algorithm allows determining automatically the optimal number of neurons at each layer. At the same time, approximation of the sparse encoder by the non-iterative model, Random Forest regression, allowed accelerating the inference mode.</p><p>The results of simulation on data from CTU-Mixed and CTU-13 datasets show that obtained result is superior to result from <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b4">[5]</ref> and it is acceptable for practical applications. ─ the algorithm of growing sparse coding neural gas is proposed for the first time, which allows unsupervised learning the optimal set of neurons for each layer of the convolution sparse coding model of feature extractor model; ─ for the first time it was proposed to apply the principle of knowledge distillation to reduce computational costs in the algorithms of sparse coding through the application of approximation by the random forest model, which in the inference mode is non-iterative and computationally efficient; ─ for the first time, an information-extreme algorithm of supervised learning is proposed for constructing the decision rules of the detector of malware network traffic.</p><p>11. The practical value of obtained results obtained for malware traffic detection systems is a developing a new learning method that effectively uses both labeled and unlabeled training sets. The results of simulation with using the CTU-Mixed and CTU-13 datasets confirm the effectiveness of the obtained decision rules in identifying the malware in test samples of traffic. In this case, the accuracy of the decision rules of the malware traffic detector is 96.1%.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 . 6 .. 7 .w</head><label>167</label><figDesc>Fig. 1. Knowledge distillation diagram for each layer of feature extractor The dataset for training of feature extraction layer is formed by decomposition of images or activation maps to patches. These patches are reshaped to 1D vectors, which put on the input of growing sparse coding neural gas algorithm, main steps of which are given below [16]. 1. Initialization of the counter of training vectors : 0 t  . 2. Two initial nodes (neurons) a w and b w are assigned by random selection from the training set. Nodes a w and b w are connected by an edge whose age is zero. These nodes are considered non-fixed. 3. Selected from the dataset the following vector х , which is normalized to a unit length (L2-normalization). 4. Normalizing each base vector , 1, k w k M  to a unit length (L2-normalization). 5. Calculation of the similarity of the input vector х to the base vectors k s w W  for</figDesc><graphic coords="5,135.07,147.36,146.55,125.21" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>step 15, otherwise -increment of the counter of steps is : 1 t t   and then proceed to step 3. 15. If all neurons are fixed, the execution of the algorithm stops, otherwise proceed to step 3 and a new epoch of learning begins (repetition of the training set).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>of input datapoint x with appropriate binary representation b to one class from</head><label></label><figDesc></figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>Fig. 2.Classifier Architecture</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>7 .</head><label>7</label><figDesc>Test obtained information-extreme rules on dataset D and compute error rate for each sample from D . Under the inference mode, decision on belonging of datapoint b to one class from set { of binary representation b of input datapoint x to o z X class, the optimal container of which has support vector * z b and radius * , &lt; K/2 abort loop, where  = 0.001.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 3 . 1 E 2 E 1 d 2 d</head><label>31212</label><figDesc>Fig. 3.A graph of the change of the average information criterion (4) in dependence from the number of decision trees in information-extreme classifierThe analysis of Figure3shows that the optimal value of the hyper parameter * K is equal to 185. Further increase of the parameter K does not lead to an increase in the accuracy of the decision rules. At optimal parameters of extractor and the classifier, the accuracy of detection of malware traffic is 96.1%. It indicates on the informative nature of the features descriptive of observation. Figure 4 shows the dependence of the information criterion (4) on the code radius of the container of each class. The analysis of Figure 4 shows that the maximum values of information criterion of learning for the first and second classes are equal to * 1 E =0.590 and * 2 E = 0.597, respectively, and the optimal values of radii of the corresponding containers of the classes of recognition * 1 d = 26, * 2 d = 32 (in code units). In this case, the inter-center Hamming distance is 65 indicating compactness of the feature vector distributions and the clarity of partition in the binary Hamming space.Thus, the proposed training algorithm allows determining automatically the optimal number of neurons at each layer. At the same time, approximation of the sparse encoder by the non-iterative model, Random Forest regression, allowed accelerating the inference mode.The results of simulation on data from CTU-Mixed and CTU-13 datasets show that obtained result is superior to result from<ref type="bibr" target="#b3">[4]</ref> and<ref type="bibr" target="#b4">[5]</ref> and it is acceptable for practical applications.</figDesc><graphic coords="10,203.40,180.72,188.28,188.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Charts of dependency of the information criterion (4) on the radii of containers of classes: а -class of normal traffic; b -class of malware traffic</figDesc><graphic coords="11,149.64,147.48,146.28,126.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Dependence of information criteria and number of neurons from model parameters</figDesc><table><row><cell>v</cell><cell>1 M</cell><cell>2 M</cell><cell>E</cell><cell>Validation accuracy</cell></row><row><cell>0.10</cell><cell>15</cell><cell>11</cell><cell>0.106</cell><cell>74</cell></row><row><cell>0.15</cell><cell>17</cell><cell>13</cell><cell>0.138</cell><cell>77</cell></row><row><cell>0.20</cell><cell>23</cell><cell>13</cell><cell>0.138</cell><cell>77</cell></row><row><cell>0.25</cell><cell>25</cell><cell>13</cell><cell>0.138</cell><cell>77</cell></row><row><cell>0.30</cell><cell>27</cell><cell>15</cell><cell>0.149</cell><cell>78</cell></row><row><cell>0.35</cell><cell>27</cell><cell>15</cell><cell>0.220</cell><cell>83</cell></row><row><cell>0.40</cell><cell>33</cell><cell>17</cell><cell>0.255</cell><cell>85</cell></row><row><cell>0.45</cell><cell>34</cell><cell>22</cell><cell>0.255</cell><cell>85</cell></row><row><cell>0.50</cell><cell>40</cell><cell>25</cell><cell>0.366</cell><cell>90</cell></row><row><cell>0.55</cell><cell>49</cell><cell>31</cell><cell>0.459</cell><cell>93.0</cell></row><row><cell>0.60</cell><cell>66</cell><cell>43</cell><cell>0.466</cell><cell>93.2</cell></row><row><cell>0.65</cell><cell>70</cell><cell>45</cell><cell>0.501</cell><cell>94.1</cell></row><row><cell>0.70</cell><cell>99</cell><cell>45</cell><cell>0.550</cell><cell>95.2</cell></row><row><cell>0.75</cell><cell>145</cell><cell>57</cell><cell>0.554</cell><cell>95.3</cell></row><row><cell>0.80</cell><cell>161</cell><cell>120</cell><cell>0.591</cell><cell>96.1</cell></row><row><cell>0.85</cell><cell>220</cell><cell>147</cell><cell>0.603</cell><cell>95.4</cell></row><row><cell>0.90</cell><cell>322</cell><cell>238</cell><cell>0.611</cell><cell>95.0</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgment</head><p>The work was performed in the laboratory of intellectual systems of the computer science department at Sumy State University with the financial support of the Ministry of Education and Science of Ukraine in the framework of state budget scientific and research work of DR No. 0117U003934.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Flow Based Algorithm for Malware Traffic Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Skrzewski</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="271" to="280" />
		</imprint>
	</monogr>
	<note type="report_type">Computer Networks</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Malware traffic detection using tamper resistant features</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Berkay Celik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Walls</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcdaniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Swami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">MILCOM</title>
		<imprint>
			<biblScope unit="volume">2015</biblScope>
			<date type="published" when="2015">2015. 2015</date>
			<publisher>IEEE Military Communications Conference</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Analysis of network traffic features for anomaly detection</title>
		<author>
			<persName><forename type="first">F</forename><surname>Iglesias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zseby</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">101</biblScope>
			<biblScope unit="page" from="59" to="84" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Autoencoder-based feature learning for cyber security applications</title>
		<author>
			<persName><forename type="first">M</forename><surname>Yousefi-Azar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Varadharajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hamey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Tupakula</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint Conference on Neural Networks (IJCNN)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Malware traffic classification using convolutional neural network for representation learning</title>
		<author>
			<persName><forename type="first">Wei</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xuewen</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaozhou</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yiqiang</forename><surname>Sheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Information Networking (ICOIN)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Going deeper with convolutions</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yangqing</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sermanet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rabinovich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Convolutional neural networks for time series classification</title>
		<author>
			<persName><forename type="first">B</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Systems Engineering and Electronics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="162" to="169" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Compressed auto-encoder building block for deep learning network</title>
		<author>
			<persName><forename type="first">Qiying</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Long</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Informative and Cybernetics for Computational Social Systems (ICCSS)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Weed identification based on Kmeans feature learning combined with convolutional neural network</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers and Electronics in Agriculture</title>
		<imprint>
			<biblScope unit="volume">135</biblScope>
			<biblScope unit="page" from="63" to="70" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Sparse Coding Neural Gas: Learning of overcomplete data representations</title>
		<author>
			<persName><forename type="first">K</forename><surname>Labusch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Barth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Martinetz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">72</biblScope>
			<biblScope unit="page" from="1547" to="1555" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Image Classification with Growing Neural Networks</title>
		<author>
			<persName><forename type="first">I</forename><surname>Mrazova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kukacka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Theory and Engineering</title>
		<imprint>
			<biblScope unit="page" from="422" to="427" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The Growing Hierarchical Neural Gas Self-Organizing Neural Network</title>
		<author>
			<persName><forename type="first">E</forename><surname>Palomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lopez-Rubio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Neural Networks and Learning Systems</title>
		<imprint>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Layer-Level Knowledge Distillation for Deep Neural Network Learning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chiang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">1966</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hooker</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1808.07573" />
		<title level="m">Approximation Trees: Statistical Stability in Model Distillation</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Deep learning of support vector machines with class probability output networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Networks</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="19" to="28" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The Model and Training Algorithm of Compact Drone Autonomous Visual Navigation System</title>
		<author>
			<persName><forename type="first">V</forename><surname>Moskalenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moskalenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korobov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Semashko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">4</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Deep Sparse-coded Network</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gwon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">T</forename><surname>Kung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">DSN). 2016 International Conference on Pattern Recognition</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
