<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Two Semi-supervised Approaches to Malware Detection with Neural Networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jan</forename><surname>Koza</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Information Technology</orgName>
								<orgName type="institution">Czech Technical University</orgName>
								<address>
									<addrLine>Thákurova 9</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marek</forename><surname>Krčál</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Rossum Czech Republic</orgName>
								<address>
									<addrLine>Dobratická 523</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Holeňa</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Information Technology</orgName>
								<orgName type="institution">Czech Technical University</orgName>
								<address>
									<addrLine>Thákurova 9</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Two Semi-supervised Approaches to Malware Detection with Neural Networks</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4D27A73C51FB01F5902109A0BEE05763</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Semi-supervised learning is characterized by using the additional information from the unlabeled data. In this paper, we compare two semi-supervised algorithms for deep neural networks on a large real-world malware dataset. Specifically, we evaluate the performance of a rather straightforward method called Pseudo-labeling, which uses unlabeled samples, classified with high confidence, as if they were the actual labels. The second approach is based on an idea to increase the consistency of the network's prediction under altered circumstances. We implemented such an algorithm called Π-model, which compares outputs with different data augmentation and different dropout setting. As a baseline, we also provide results of the same deep network, trained in the fully supervised mode using only the labeled data. We analyze the prediction accuracy of the algorithms in relation to the size of the labeled part of the training dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the application domains that pay the most attention to the progress of and new developments in machine learning is malware detection. Vendors of antivirus software cannot keep up with the increasing number of malicious programs and their increasingly sophisticated obfuscation and polymorphism without using more and more advanced machine learning methods, most importantly, methods for anomaly detection, classification and pattern recognition.</p><p>The most successful machine learning methods for classification and pattern recognition definitely include artificial neural networks (ANN), especially deep networks. However, they have a high number of degrees of freedom, thus requiring a large amount of labeled training data, whereas most of the data for malware detection is unlabeled because its labeling requires expensive involvement of human experts. One possible way how to tackle the lack of training data is semi-supervised learning. In a narrow sense, this means supervised learning that simultaneously to labels also uses some information from additional unlabeled data, in a broad sense any combination of supervised learning and unlabeled data, e.g., unsupervised learning followed by supervised learning. In the context of malware detection, however, semi-supervised ANN learning is only emerging <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref>. The work in progress reported in this paper is a small contribution to it. It restricts Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). attention only to two methods of semi-supervised ANN learning, approaches not relying on neural networks are outside its scope.</p><p>The next section briefly reviews using ANN in malware detection and the overlapping area of network intrusion detection. In Section 3, several important methods for semi-supervised ANN learning are recalled, two of which have been implemented for our research. The core Section 4 describes several experiments with a real-world malware dataset, and reports their results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Neural Networks in Malware and Network Intrusion Detection</head><p>As malware detection is strongly interconnected with and closely related to network intrusion detection, using ANN will be reviewed here in both areas. Probably the first proposal to use neural networks in them was in 1990 by Lunt <ref type="bibr" target="#b14">[15]</ref> and was implemented two years later <ref type="bibr" target="#b3">[4]</ref> in a network trained on inputs from audit log files. The authors of <ref type="bibr" target="#b24">[25]</ref> employed user commands as input, but rather than trying to learn benign and malicious command sequences, they were detecting anomalies in frequency histograms of user commands calculated for each user.</p><p>The paper by Cannady <ref type="bibr" target="#b2">[3]</ref> summarised ANN advantages and disadvantages for misuse detection. As the two main advantages, the flexibility with respect to incomplete, distorted and noisy data, and the generalization ability are viewed, whereas as the main disadvantage, the ANN black-box nature.</p><p>In the late 1990s and early 2000s, self-organizing maps were quite popular in this context <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b23">24]</ref>. In particular, Depren et al. <ref type="bibr" target="#b4">[5]</ref> used a hierarchical model where misuse detection based on self-organizing maps (SOMs) was coupled with random forest-based rule system to provide not only high precision, but also some sort of explanation.</p><p>Much research has been devoted to comparing different kinds of ANN, or more generally, different classifiers including one or more kinds of ANN, on real-world malware detection or intrusion detection data. Probably the most popular among such data is an extensive intrusion detection dataset that was used at the 1999 KDD Cup <ref type="bibr" target="#b28">[29]</ref>. Zhang et al. <ref type="bibr" target="#b32">[33]</ref> compared five different kinds of ANN. Mukkamala et al. <ref type="bibr" target="#b18">[19]</ref> compared a multilayer perceptron (MLP) with support vector machines.</p><p>Among more recent ANN applications to malware and network intrusion detection, <ref type="bibr" target="#b13">[14]</ref> should be mentioned for using synthetically generated attack samples to train an MLP, as well as <ref type="bibr" target="#b29">[30]</ref> for a malware detection with recurrent networks. Expectedly, the kinds of ANN applied to these two areas during the last decade are most often deep networks <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b9">10]</ref>. In <ref type="bibr" target="#b15">[16]</ref>, deep learning was used together with spectral clustering to improve the detection rate of low frequency network attacks. ability to process raw inputs and learn their own features. Saxe et al. <ref type="bibr" target="#b25">[26]</ref> employed a convolutional neural network (CNN) to extract features that were subsequently used as the input for an MLP detecting malicious activities. CNNs seem to be particularly suitable to learn spatial features of network traffic <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32]</ref>. In <ref type="bibr" target="#b30">[31]</ref>, a CNN was in addition combined with a long short term memory learning temporal features from multiple network packets.</p><p>To our best knowledge, there were so far only two particular ANN applications to malware or network intrusion detection that included semi-supervised learning in the narrow sense. In <ref type="bibr" target="#b19">[20]</ref>, various settings of semisupervised ladder networks (see Section 3) were compared on the above mentioned intrusion detection dataset <ref type="bibr" target="#b28">[29]</ref>. In <ref type="bibr" target="#b20">[21]</ref> (cf. also the thesis <ref type="bibr" target="#b26">[27]</ref>), skipgram networks <ref type="bibr" target="#b16">[17]</ref> extended with semi-supervised learning based on Pseudolabels (see Section 3) were used for Android malware detection. Skipgrams are neural networks embedding large sets of structured non-numeric data into low-dimensional vector spaces. Whereas in <ref type="bibr" target="#b16">[17]</ref>, skipgrams were proposed for the embedding of text (word2vec), the input set in <ref type="bibr" target="#b20">[21]</ref> is the set of rooted subgraphs around every node of three dependency graphs representing the API dependencies, permission dependencies, and information source and sink dependencies of the considered Android application. However, skipgrams were not used directly for malware detection in <ref type="bibr" target="#b20">[21]</ref>, only for representation learning of the structured input, whereas the malware detection itself was performed by a support vector machine. So far, no semi-supervised neural networks have been used directly for malware detection, and also none have been used with unstructured inputs simply listing values of the evaluated features, which are encountered much more frequently than dependency matrices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Semi-supervised Learning of Neural Networks</head><p>According to the overview paper <ref type="bibr" target="#b21">[22]</ref>, the following approaches are most important for semi-supervised learning of neural networks, especially deep networks: (i) Pseudo-labels <ref type="bibr" target="#b12">[13]</ref>, which are ANN predictions of the correct class for unlabeled data, provided the network has a sufficient confidence in such a prediction. Formally, a prediction serves as a pseudo-label for an unlabeled input x if</p><formula xml:id="formula_0">arg max c∈C f c (x) ≥ ϑ ∑ c∈C f c ,<label>(1)</label></formula><p>where C denotes the set of classes, f c (x) the activity of the output neuron corresponding to the class c ∈ C for the input x, and ϑ ∈ (0, 1) is a given threshold.</p><p>(ii) Increasing the consistency of predictions for the same input between two instances of a neural network differing through a random perturbation. Such a perturbation is typically introduced through random noise or through dropout. The overall loss function minimized during semi-supervised learning is then the superposition of the loss of supervised learning and a loss reflecting the inconsistency of the considered ANN instances. This approach was first applied in <ref type="bibr" target="#b22">[23]</ref> to ladder networks, which are basically chained denoising autoencoders. In <ref type="bibr" target="#b11">[12]</ref>, two similar kinds of neural networks using this approach to semisupervised ANN learning were proposed that can be viewed as simplifications of ladder networks. The first kind, called Π-model, evaluates both randomly differing ANN instances on each minibatch of data.</p><p>The second kind, called temporal ensembling, evaluates only one of them and then uses its predictions in the inconsistency loss. As a compensation, predictions from multiple previous network evaluations are aggregated into an ensemble prediction.</p><p>(iii) Due to targets changing only once per epoch, temporal ensembling becomes unwieldy when learning large datasets. To overcome this problem, an approach called mean teacher has been proposed in <ref type="bibr" target="#b27">[28]</ref>. Instead of aggregating predictions, it aggregates weights, more precisely, averages them.</p><p>(iv) In <ref type="bibr" target="#b17">[18]</ref>, the most sophisticated among the four considered approaches has been proposed, called virtual adversarial training, due to using a loss function proposed by Goodfellow et al. to train networks against adversarial inputs <ref type="bibr" target="#b7">[8]</ref>, and known as adversarial loss:</p><formula xml:id="formula_1">L adv (x, θ ) = D[q(•|x), p(•|x + r adv ; θ )]<label>(2)</label></formula><p>where r adv = arg max</p><formula xml:id="formula_2">r ≤ε D[q(•|x), p(•|x + r; θ )],<label>(3)</label></formula><p>In ( <ref type="formula" target="#formula_1">2</ref>)-( <ref type="formula" target="#formula_2">3</ref>), q(•|x) represents our knowledge of the true conditional distribution of labels given a particular input x, whereas p(•|x; θ ) represents the corresponding distribution implied by the neural network for particular values of their parameters θ , ε &gt; 0 and D is some non-negative function on pairs of probability distribution, such as cross entropy, which was used in <ref type="bibr" target="#b17">[18]</ref>. And the term "virtual" refers to the fact that in supervised learning, this loss needs to be minimized on unlabeled inputs instead on adversarial ones So far, we have managed to implement the first two of those approaches, the second in both variants Π-model and temporal ensembling. Some details of our implementation are given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Our Implementation of ANN Learning</head><p>Most parts of the two algorithms we used share the same implementation. Fundamentally, they only differ in the way they compute the unsupervised component of the loss function. Firstly, both methods use the same MLP architecture with ReLU as the activation function in the hidden layers and utilize the same optimizing algorithm Adam <ref type="bibr" target="#b10">[11]</ref> with the initial learning rate set to 0.001, β 1 = 0.99, and β 2 = 0.999. As was shown above, the optimized loss function is defined as a weighted sum of supervised and unsupervised loss L = L S + w(t)L U . The weight w(t) depends on the ratio between the number of labeled and all data, and the current epoch. Following a proposal in <ref type="bibr" target="#b21">[22]</ref>, we ramp up the value of the weight using a Gaussian curve: 2 , where t = max( e r u , 1), e is number of the current epoch,r u is the length of the rump up period and w max is a parameter specifying the maximum weight. Increasing the weight of the unsupervised loss during the training is necessary as the network needs to learn to classify the supervised data first. Eventually, it can learn to incorporate the unlabeled information as well. Similarly, at the later phase of the training, the learning rate and the β 1 parameter of the Adam optimizer are decreased to improve the exploitation:lr e = w d lr e−1 and β 1 = 0.4w d + 0.5, where</p><formula xml:id="formula_3">w(t) = w max |L | |L |+|U | exp −5(1 − t)</formula><formula xml:id="formula_4">w d = exp −12.5t 2 ,t = max( e r d</formula><p>, 1) and r d is the length of the ramp down period. We also included a type of elitism to select the resulting model with the lowest total loss per epoch calculated with the maximal weight for the unsupervised component instead of a weight in the current epoch.</p><p>The unsupervised loss in the Pseudo-labeling algorithm is calculated using cross-entropy between network's predictions and pseudo-labels, but only for predictions with confidence above a specified threshold ϑ (cf. ( <ref type="formula" target="#formula_0">1</ref>)). We compute the vector of pseudo-labels y for every data sample x using the corresponding network output f (x) in the following manner:</p><formula xml:id="formula_5">y i = 1 if i = argmax i f i (x) 0 otherwise<label>(4)</label></formula><p>Then the resulting formula based on cross entropy for the unsupervised loss component L U of a particular data sample x is:</p><formula xml:id="formula_6">L U (x) = − |C| ∑ i=1 y i log(y i ), (<label>5</label></formula><formula xml:id="formula_7">)</formula><p>where |C| is the number of classes. We also implemented two variants of the consistency preserving, self-ensembling algorithms: The Π-model and the temporal ensembling. Both approaches use mean squared error (MSE) to compute unsupervised loss. What is different is the target for which MSE is evaluated. The Π-model compares two predictions of the same state of the network using different inputs and different dropped out neurons. To augment the data for the second prediction, we multiplied the input feature vector with a noise sampled from normal distribution N (1, σ 2 ). We chose to multiply the data with the noise instead of adding it because it is invariant to the differing variances of the individual features.</p><p>The second variant, temporal ensembling, compares the prediction of the network in the current epoch with the predictions obtained in the previous epoch. The dropout and data augmentation can be used as well. So the unsupervised loss L U for this approach is calculated as follows:</p><formula xml:id="formula_8">L U (x) = |C| ∑ i=1 (y i − ỹi ) 2 , (<label>6</label></formula><formula xml:id="formula_9">)</formula><p>where y is the current output of the network in the training step and ỹ is the output of the network in a different state or for augmented input. Our open-source implementation is publicly available at https://github.com/c0zzy/semi-supervised-ann.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Validation using a simple artificial experiment</head><p>Firstly, we tried our implementations of two semi-supervised methods mentioned above and a fully supervised baseline on a two-dimensional example. We chose simple generated moon-shaped data, which are often used for testing of classification or clustering algorithms. The data consist of two classes, that are linearly inseparable but do not overlap so that the classification can be performed with no error. The advantage is that we can easily visualize the classification decision border in two dimensions and examine the behavior of the algorithm. For every method in this experiment, we used the same MLP architecture with two hidden layers, the first having 64 neurons and the second 32 neurons.</p><p>In Figure <ref type="figure" target="#fig_0">1</ref>, we present two different arrangements of labeled and unlabeled data, each solved by the fully supervised learning, Pseudo-labeling, and Π-model. In the first experiment, we tested the ability of the algorithm to learn from a small amount of data, there are two moonshaped clusters, each having 1000 samples, where only 16 of each are labeled. We let each network to train for 300 epochs. Even though the supervised learning had available samples distributed over the whole cluster, it was not able to learn the correct shape using only 32 samples. The Pseudo-labeling algorithm could not improve the results using the unlabeled data. However, the results of the Πmodel are notably better as it managed to capture the moon shape quite well.</p><p>In the second experiment, we tried if the algorithms can deal with a drift in the training data. This time we used clusters with 10,000 samples and labeled only 1000 points that lie near the center, for each class. We trained the networks for 100 epochs as having it run longer did not improve the results of either of the methods. The supervised algorithm could only use the labeled data that are linearly separable. So it learned to classify the labeled data with zero error, and we present it only as a baseline for comparison. Pseudo-labeling again failed to use the information contained in the unlabeled data, and its accuracy was similar to the fully supervised learning. Also in this task, the Π-model was able to use the smoothness of the data and performed the best of three methods. To quantify the results, we summarized the prediction accuracy tested on the whole clusters in Table <ref type="table" target="#tab_0">1</ref>. Completing these experiments, we observed that the results of the Pseudo-labeling correspond to the idea behind the algorithm. It makes the network's decision more confident as it uses the interim predictions as if they were the true labels. Also, the decision border did not seem to converge to a stable finale state throughout the learning. It kept shifting closer to one or the other class, roughly in the range where the confidence of the supervised learning was low. We managed to get decent results using the Πmodel, and it proved to be able to capture the smooth distribution of data. However, the algorithm was susceptible to inappropriate setting of hyperparameters. It often happened that one class became dominant during the training, and the Π-model could not recover from that.</p><p>4 Experiments with a Real-World Malware Dataset</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data</head><p>We tested our implementation using a large real-world malware detection dataset containing anonymized data provided by the company Avast. The data concern Windows Portable Executables (PE) files, which were collected during 380 weeks. It consists of 540 real-valued features derived directly from the binary PE files. Unfortunately, the company did not reveal the semantics of the individual features. Each file is labeled with one of the five classes: malware, adware, infected, potentially unwanted program, and clean. There were some features with zero or very low variance in the dataset. Therefore we used principal component analysis (PCA) to reduce the dimensionality of the feature space and speed up the training. First, we min-max normalized the data between 0 and 1, and then we projected them to the subspace spanned by the 128 main components while keeping more than 99 % of the explained variance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Experimental Design</head><p>At first, we analyzed the hyperparameters of each algorithm and optimized those that we expected to have the greatest impact on the results during early tests of our implementation. We chose the data from five weeks between 50th and 55th week. We performed stratified random sampling and selected 10,000 training and 5000 testing records. We kept only 5 % of the labeled from the training set, and the rest remained unlabeled. Using this data, we evaluated the classification accuracy for various sets of hyperparameters.</p><p>For the Pseudo-labeling algorithm, we optimized the threshold ϑ and the maximal weight w max for the unsupervised loss component. For the consistency preserving algorithms, we optimized the standard deviation σ of the noise used in data augmentation and again the parameter w max . Furthermore, we repeated the search of parameters for all six combinations of variants of the algorithm, which were: Π-model or temporal ensembling and whether to use dropout, augmentation or both. We took the parameters from the following sets: w max ∈ {0.1, 1, 2, 5, 10, 15, 20, 30, 50}, σ ∈ {0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5}, ϑ ∈ {0.5, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99}. However, because of the high time requirements, we restricted attention among the two similar models proposed in <ref type="bibr" target="#b11">[12]</ref> only to the Π-model. For the same reason, we did not perform the full factorial search through all possible combinations. Instead we optimized only one parameter at time, keeping others on default values which were: w max = 30, σ = 0.1 and ϑ = 0.9. Among all these tuned hyperparameters, the most critical from the point of view of the predictive accuracy were the maximal weight, and the standard deviation of the Π-model noise. The rest of the hyperparameters we used as stated in the original papers or we modified them slightly according to our observations because the domain of our dataset is entirely different. The final values of the chosen hyperparameters used in experiments follow in Table <ref type="table" target="#tab_1">2</ref>. For the fully supervised training, we enabled the dropout and the data augmentation in the same manner as with the Π-model. In every experiment, we used the same MLP architecture with five layers and the topology 128-96-64-32-5.</p><p>Then we measured the performance of the Pseudolabeling, Π-model, and the purely supervised baseline  for different proportions of labeled data. We varied the ratio r = |L | : (|L | + |U |) in the set of values {0.5%, 1%, 2%, 5%, 10%, 25%, 50%, 75%}. As the training union of labeled and unlabeled data, we took 10,000 stratified samples from 5 consequent weeks and split them in the considered ratios. Then we trained 20 separate instances of the network and calculated the average accuracy on a stratified test set of size 5000 for them. We repeated this experiment for four arbitrarily chosen distinct groups of weeks: 1-5, 51-55, 101-105, and 151-155. We also evaluated the performance of trained networks on the data from all of the following weeks. This is particularly interesting from the point of view of the considered application domain. Because the structure of malware changes over time, the prediction accuracy of the newer data tends to get worse. That means that if semi-supervised learning could overcome this problem, it could be beneficial. Therefore, we tried to take the data from newer periods than the labeled weeks as the unlabeled training set. So we trained the network using labeled data together with unlabeled data from several weeks later. Unfortunately, we did not manage to outperform the standard fully supervised learning this way using any of the implemented methods, so we refrained from it. We present the results of these experiments in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Results and Their Discussion</head><p>Using the hyperparameters setting presented in the previous section, we measured the average test accuracy of 20 training runs of our three implementations in relation to the proportion of the labeled data in the training data set. The results can be found in Table <ref type="table" target="#tab_2">3</ref>. We can see that the performance of the fully supervised learning depends on the number of labeled data as it is the only learning source for the network. The results of the semi-supervised algorithms Pseudo-labeling and Π-model are more interesting. Both algorithms bring a slight increase in the accuracy of low ratios of the labels. The most noticeable improvement is when there are only around 1 or 2 % of labels. When the ratio gets above 10 %, the accuracy gain is negligible, and for the higher values, the semi-supervised effect is even negative. Also, it seems that Π-model outperforms Pseudo-labeling, as its accuracy is higher in most of the measurements.</p><p>To verify our observations, we tested whether the distributions of predictive accuracy achieved by the three considered methods significantly differ from each other. Those distributions are for the considered ratios of labeled to all data shown in Figure <ref type="figure">2</ref>, but -due to lack of space -only for the networks trained on data from the first five weeks. Firstly, we applied the Friedman test <ref type="bibr" target="#b5">[6]</ref> to reject the hypothesis that all three methods can be considered equal. Then we performed a post hoc pairwise test to find out among which of them there were differences at the 5 % level of family-wise significance with Holm <ref type="bibr" target="#b8">[9]</ref> correction. We took the data from all of the following weeks and evaluated the accuracy for all considered ratios of labeled and all data, training for each of them 20 models. A significant difference between the compared methods was found for 80 among the 96 compared pairs corresponding to the 32 combinations of training weeks and ratios. We summarized the results in Table <ref type="table">4</ref>, where we compared the average accuracy for the three implemented methods. When we consider only tests with ratio up to 5 %, where the improvement was visible, then the Pseudo-labeling was significantly better than supervised learning in 3 cases and the Π-model in 11 cases. Pseudo-labeling was significantly better than Π-model only in 3 out of 14 significant comparisons.</p><p>We also visualized the progress of the classification accuracy over time for networks trained during three arbitrarily chosen sequences of 5 contiguous weeks in Figure 3. To capture the variance of the results, we plotted three quartiles. Because the accuracy oscillated greatly through the individual weeks, we used a moving average with a window size of five weeks to smooth the curves (the accuracy during the first five weeks, for which a window of that size has not been available, is dashed). We can see that both semi-supervised algorithms slightly improved the accuracy of the network on the roughly first 30 weeks. The Pseudo-labeling is around 1 or 2 % better than supervised learning, while Π-model gets another 1 or 2 % above the Pseudo-labeling. However, all three trained networks share the trend of decreasing predictive accuracy during the early weeks when the moving average has been applied, though the number of such weeks is network-specific. After around 40 weeks, the results of all three methods are very similar. As the properties of the data shift over time, the overall results on the data beyond 50 weeks got considerably worse and fluctuated more for all methods. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ratio 75%</head><p>Figure <ref type="figure">2</ref>: Boxplots summarizing the distributions of predictive accuracy achieved by supervised learning (S), pseudolabeling (P) and the Π-model (Π) for the considered ratios of labeled to all data and the networks trained on data from the first five weeks</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we presented an application of semi-supervised learning of deep neural networks to malware data. At the beginning, we recalled the current state of detecting malware with artificial neural networks and introduced the principles of neural semi-supervised learning. Then we outlined four semi-supervised approaches to deep learning. We covered two semi-supervised algorithms, Pseudolabeling and Π-model in more detail and compared them with the fully supervised baseline. We evaluated the classification accuracy on a real-world malware dataset divided to 380 weeks by the time of the first recording of the respective binary file. Despite having been developed for the classification of image data, the results showed that both methods could improve the performance of a neural network on malware data. However, implemented algorithms have the limitation of being beneficial only when the proportion of labeled data is low, ideally around 1 %.</p><p>We have also found that these semi-supervised methods can increase the accuracy on data newer than the training set, for which drift in structure is likely to occur, but only to a certain extent. Based on our experiments, the slightly more complex algorithm Π-model has got slightly better results than Pseudo-labeling in most cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>The research reported in this paper has been supported by the Czech Science Foundation (GA ČR) grant 18-18080S. For the employed data and the work of M. Krčál, his support through Avast fellowship is appreciated. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the program "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. Table <ref type="table">4</ref>: Multiple comparisons test of three methods for different ratios of labeled to all data, tested on the data from all of the following weeks till the end. Each cell contains a triplet of symbols representing the results of three post hoc pairwise tests. The order of the comparisons is: supervised to Pseudo-labeling, supervised to Π-model, and Pseudo-labeling to Π-model. The dash means that the difference was not statistically significant and the letters S, P, and Π mark whether supervised, Pseudo-labeling, or Π-model were significantly better than the other compared algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ratio</head><p>Training weeks 1-5 51-55 101-105 151-155 0.5 % P , -, Π P , S , Π P , -, Π P , -, Π 1 % P , -, -S , Π, Π P , -, Π P , Π, Π 2 % S , S , Π S , S , Π P , Π, P S , S , -5 % -, S , Π -, S , P P , Π, P P , S , Π 10 % -, -, Π P , S , Π S , S , P S , S , Π 25 % P , Π, Π S , S , Π S , P , P S , S , Π 50 % -, Π, Π -, S , Π S , Π, P S , S , Π 75 % S , Π, -S , Π, -S , -, P S , S , Π</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A comparison of the decision border of three algorithms on simple moon-shaped data. The decision border is visualized as a transition from blue to red. The saturation expresses the classification confidence of the network. The labeled data are shown as cyan or orange circles, while unlabeled are drawn in gray. On the left side, we randomly labeled only 16 samples out of 2000 from each class. On the right side, we labeled 1000 samples close to the center out of 5000 from each class.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure3: The progression of the classification accuracy on later weeks using Pseudo-labeling, Π-model, and fully supervised learning, trained using set with 1 % of labels. For each plot, there are three quartiles visualized; the median is drawn with a solid line, while the first and the third quartiles are dotted. The curves correspond to the moving average with the window size of five weeks. The first five dashed weeks are means of all previous weeks. The first five weeks at the beginning of each plot were used for the training.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>A summary of test accuracy on the moonshaped data. The table compares Pseudo-labeling, Π-model, and fully supervised learning on a test data covering the whole moon cluster. There are results of two experiments. In the first one, only 16 points out of 1000 were uniformly selected and labeled for both classes. In the second, we labeled 1000 points in the center out of 10,000 samples for both classes.</figDesc><table><row><cell>Method</cell><cell cols="2">Test case 16 pts uniform 1000 pts in center</cell></row><row><cell>Supervised</cell><cell>89.1 %</cell><cell>46.2 %</cell></row><row><cell>Pseudo-label</cell><cell>85.4 %</cell><cell>42.9 %</cell></row><row><cell>Π-model</cell><cell>95.7 %</cell><cell>76.0 %</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Final setting of model hyperparameters.</figDesc><table><row><cell>Common</cell><cell></cell></row><row><cell>Number of training epochs</cell><cell>100</cell></row><row><cell>Training batch size</cell><cell>100</cell></row><row><cell>Weight ramp-up period r u</cell><cell>70</cell></row><row><cell>Optimizer ramp-down period r d</cell><cell>20</cell></row><row><cell>Initial learning rate</cell><cell>0.001</cell></row><row><cell>Pseudo-Labeling</cell><cell></cell></row><row><cell>Pseudo-labeling threshold ϑ</cell><cell>0.9</cell></row><row><cell>Maximal weight w max</cell><cell>10</cell></row><row><cell>Consistency preserving</cell><cell></cell></row><row><cell>Consistency preserving variant</cell><cell>Π-model</cell></row><row><cell>Use dropout</cell><cell>Yes</cell></row><row><cell>Use data augmentation</cell><cell>Yes</cell></row><row><cell>Maximal weight w max</cell><cell>20</cell></row><row><cell cols="2">Standard deviation σ of the noise 0.2</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Comparison of the Π-model, Pseudo-labeling and the supervised baseline, for different ratios of labeled and all data. The table depicts the percentage of the average testing accuracy in four different periods. The S columns contain the results of the supervised baseline, the ∆ Ps and ∆ Π columns show the difference using pseudo-labeling and Π-model, respectively.</figDesc><table><row><cell></cell><cell></cell><cell cols="2">Ratio 0.5%</cell><cell></cell><cell></cell><cell>Ratio 1%</cell><cell></cell><cell cols="2">Ratio 2%</cell><cell></cell><cell></cell><cell>Ratio 5%</cell></row><row><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell></row><row><cell>Accuracy</cell><cell>0.7 0.4 0.5 0.6</cell><cell cols="11">Weeks: 1-5 0.7 ∆ Ps ∆ Π 0.5 % 67.9 +0.4 +3.1 63.9 +2.8 +3.5 56.8 +5.2 +6.8 67.6 +0.3 +1.9 Weeks: 51-55 Weeks: 101-105 Weeks: 151-155 0.7 Ratio S S ∆ Ps ∆ Π S ∆ Ps ∆ Π S ∆ Ps ∆ Π 1 % 71.0 +1.7 +4.5 67.1 +5.0 +6.4 61.8 +6.9 +9.0 70.3 +5.7 +6.4 2 % 76.8 +1.3 +2.5 73.9 +3.4 +5.7 69.7 +5.9 +6.2 76.6 +1.9 +2.0 0.4 0.5 0.6 0.4 0.5 0.4 0.6 0.6</cell></row><row><cell></cell><cell>0.3</cell><cell cols="11">5 % 10 % 85.1 +0.0 +1.1 86.1 -0.4 +0.8 83.2 +1.0 +1.1 81.7 +0.3 +0.6 82.4 -0.1 +1.1 82.2 +0.4 +2.4 77.8 +3.7 +3.3 80.4 +0.2 +0.8 0.3 0.3</cell></row><row><cell></cell><cell>0.2</cell><cell cols="11">25 % 88.3 -0.4 +0.3 89.2 -0.5 +0.1 87.4 +0.7 +0.3 83.1 +0.3 +0.3 0.2 0.2 0.2</cell></row><row><cell></cell><cell></cell><cell cols="11">50 % 89.9 -0.4 -0.1 90.6 -0.7 +0.0 89.8 -0.3 -0.2 84.2 -0.1 -0.2 S P S P S P S P</cell></row><row><cell></cell><cell></cell><cell cols="11">75 % 90.4 -0.1 -0.1 91.2 -0.3 -0.3 90.7 -0.1 -0.4 84.4 +0.3 -0.2</cell></row><row><cell></cell><cell></cell><cell cols="2">100 % 90.9</cell><cell></cell><cell></cell><cell>91.4</cell><cell></cell><cell>91.3</cell><cell></cell><cell cols="2">84.8</cell></row><row><cell></cell><cell></cell><cell cols="2">Ratio 10%</cell><cell></cell><cell></cell><cell>Ratio 25%</cell><cell></cell><cell cols="2">Ratio 50%</cell><cell></cell><cell></cell></row><row><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell></row><row><cell>Accuracy</cell><cell>0.4 0.6</cell><cell></cell><cell></cell><cell>0.4 0.6</cell><cell></cell><cell></cell><cell>0.4 0.6</cell><cell></cell><cell></cell><cell>0.4 0.6</cell><cell></cell></row><row><cell></cell><cell>0.2</cell><cell>S</cell><cell>P</cell><cell>0.2</cell><cell>S</cell><cell>P</cell><cell>0.2</cell><cell>S</cell><cell>P</cell><cell>0.2</cell><cell>S</cell><cell>P</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Nonlinear dimensionality reduction for intrusion detection using auto-encoder bottleneck features</title>
		<author>
			<persName><forename type="first">B</forename><surname>Abolhasanzadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IKT: IEEE 7th Conference on Information and Knowledge Technology</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Neural networks applied in intrusion detection systems</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Bonifacio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Cansian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C P L F</forename><surname>De Carvalho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Moreira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Joint Conference on Neural Networks</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="205" to="210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Artificial neural networks for misuse detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cannady</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">National Information Systems Security Conference</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="368" to="381" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A neural network component for an intrusion detection system</title>
		<author>
			<persName><forename type="first">H</forename><surname>Debar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Becker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Siboni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Computer Society Symposium on Research in Security and Privacy</title>
				<imprint>
			<date type="published" when="1992">1992</date>
			<biblScope unit="page" from="240" to="250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks</title>
		<author>
			<persName><forename type="first">O</forename><surname>Depren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Topallar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Amarim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Ciliz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="713" to="722" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The use of ranks to avoid the assumption of normality implicit in the analysis of variance</title>
		<author>
			<persName><forename type="first">M</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Statistical Association</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">200</biblScope>
			<biblScope unit="page" from="675" to="701" />
			<date type="published" when="1937">1937</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An intrusion detection model based on deep belief networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Second International Conference on Advanced Cloud and Big Data</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="247" to="252" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Explaining and harnessing adversarial examples</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICLR</title>
		<imprint>
			<biblScope unit="page" from="1" to="11" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A simple sequentially rejective multiple test procedure</title>
		<author>
			<persName><forename type="first">S</forename><surname>Holm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scandinavian Journal of Statistics</title>
		<imprint>
			<biblScope unit="page" from="65" to="70" />
			<date type="published" when="1979">1979</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Long short term memory recurrent neural network classifier for intrusion detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">T</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Thu</surname></persName>
		</author>
		<author>
			<persName><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">PlatCon: IEEE International Conference on Platform Technology and Service</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Adam: A method for stochastic optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">Preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Temporal ensembling for semisupervised learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Laine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Aila</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICLR</title>
		<imprint>
			<biblScope unit="page" from="1" to="13" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Pseudo-label: The simple and efficient semisupervised learning method for deep neural networks</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WREPL: ICML Workshop Challenges in Representation Learning</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Neural network based intrusion detection system for critical infrastructures</title>
		<author>
			<persName><forename type="first">O</forename><surname>Linda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Vollmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Manic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint Conference on Neural Networks</title>
				<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1827" to="1834" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">IDES: An intelligent system for detecting intruders</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">F</forename><surname>Lunt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Symposium on Computer Security, Threat and Countermeasures</title>
				<imprint>
			<date type="published" when="1990">1990</date>
			<biblScope unit="page" from="30" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A hybrid spectral clustering and deep neural network ensemble algorithm for intrusion detection in sensor networks</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page">1701</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>article no</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NIPS</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Virtual adversarial training: A regularization method for supervised and semi-supervised learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Miyato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">I</forename><surname>Maeda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Koyarna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ishii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="1979" to="1993" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Intrusion detection using neural networks and support vector machines</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mukkamala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Janoski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint Conference on Neural Networks</title>
				<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="1702" to="1707" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Semi-supervised deep neural network for network intrusion detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nadeem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marshall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yuan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Cybersecurity Education, Research and Practice</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="0" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Apk2vec: Semi-supervised multi-view representation learning for profiling Android applications</title>
		<author>
			<persName><forename type="first">A</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Soh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Data Mining</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="357" to="366" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Realistic evaluation of deep semi-supervised learning algorithms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Oliver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Odena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Goodfellow</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NIPS</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="19" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Semi-supervised learning with ladder networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rasmus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Valpola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Honkala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Berglund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Raiko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NIPS</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Multiple self-organizing maps for intrusion detection</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Rhodes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Mahaffey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Cannady</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">23rd National Information Systems Security Conference</title>
				<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="16" to="19" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Intrusion detection with neural networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ryan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Miikkulainen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 10</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="943" to="949" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Saxe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Berlin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1702.08568</idno>
		<title level="m">Expose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">Arxiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Program Analysis and Machine Learning Techniques for Mobile Security</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Z Y</forename><surname>Soh</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
		<respStmt>
			<orgName>Nanyang Technological University, Singapore</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results</title>
		<author>
			<persName><forename type="first">A</forename><surname>Tarvainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Valpols</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NIPS</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">A detailed analysis of the KDD cup 99 data set</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tavallaee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Bagheri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Ghorbani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Symposium on Computational Intelligence for Security and Defense Applications</title>
				<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="288" to="293" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">An analysis of recurrent neural networks for botnet detection behavior</title>
		<author>
			<persName><forename type="first">P</forename><surname>Torres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Catania</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Garino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ARGENCON: IEEE biennial congress of Argentina</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">HAST-IDS: Learning hierarchical spatialtemporal features using deep neural networks to improve intrusion detection</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="1792" to="1806" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Malware traffic classification using convolutional neural network for representation learning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICOIN: IEEE International Conference on Information Networking</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="712" to="717" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">HIDE: A hierarchical network intrusion detection system using statistical preprocessing and neural network classification</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">N</forename><surname>Manikopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jorgenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ucles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Workshop on Information Assurance and Security</title>
				<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="85" to="90" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
