=Paper=
{{Paper
|id=Vol-2718/paper03
|storemode=property
|title=Two Semi-supervised Approaches to Malware Detection with Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2718/paper03.pdf
|volume=Vol-2718
|authors=Jan Koza,Marek Krčál,Martin Holeňa
|dblpUrl=https://dblp.org/rec/conf/itat/KozaKH20
}}
==Two Semi-supervised Approaches to Malware Detection with Neural Networks==
Two Semi-supervised Approaches to Malware Detection with Neural Networks
Jan Koza1 , Marek Krčál2 , Martin Holeňa1
1 Faculty of Information Technology, Czech Technical University, Thákurova 9, Prague, Czech Republic
2 Rossum Czech Republic, Dobratická 523, Prague, Czech Republic
Abstract: Semi-supervised learning is characterized by attention only to two methods of semi-supervised ANN
using the additional information from the unlabeled data. learning, approaches not relying on neural networks are
In this paper, we compare two semi-supervised algorithms outside its scope.
for deep neural networks on a large real-world malware The next section briefly reviews using ANN in malware
dataset. Specifically, we evaluate the performance of detection and the overlapping area of network intrusion
a rather straightforward method called Pseudo-labeling, detection. In Section 3, several important methods for
which uses unlabeled samples, classified with high con- semi-supervised ANN learning are recalled, two of which
fidence, as if they were the actual labels. The second ap- have been implemented for our research. The core Sec-
proach is based on an idea to increase the consistency of tion 4 describes several experiments with a real-world mal-
the network’s prediction under altered circumstances. We ware dataset, and reports their results.
implemented such an algorithm called Π-model, which
compares outputs with different data augmentation and
different dropout setting. As a baseline, we also provide 2 Neural Networks in Malware and
results of the same deep network, trained in the fully su- Network Intrusion Detection
pervised mode using only the labeled data. We analyze the
prediction accuracy of the algorithms in relation to the size As malware detection is strongly interconnected with and
of the labeled part of the training dataset. closely related to network intrusion detection, using ANN
will be reviewed here in both areas. Probably the first
proposal to use neural networks in them was in 1990 by
1 Introduction Lunt [15] and was implemented two years later [4] in a
network trained on inputs from audit log files.
One of the application domains that pay the most attention The authors of [25] employed user commands as in-
to the progress of and new developments in machine learn- put, but rather than trying to learn benign and malicious
ing is malware detection. Vendors of antivirus software command sequences, they were detecting anomalies in fre-
cannot keep up with the increasing number of malicious quency histograms of user commands calculated for each
programs and their increasingly sophisticated obfuscation user.
and polymorphism without using more and more advanced The paper by Cannady [3] summarised ANN advan-
machine learning methods, most importantly, methods for tages and disadvantages for misuse detection. As the two
anomaly detection, classification and pattern recognition. main advantages, the flexibility with respect to incomplete,
The most successful machine learning methods for clas- distorted and noisy data, and the generalization ability
sification and pattern recognition definitely include arti- are viewed, whereas as the main disadvantage, the ANN
ficial neural networks (ANN), especially deep networks. black-box nature.
However, they have a high number of degrees of free- In the late 1990s and early 2000s, self-organizing maps
dom, thus requiring a large amount of labeled training were quite popular in this context [2, 5, 24]. In particular,
data, whereas most of the data for malware detection is Depren et al. [5] used a hierarchical model where misuse
unlabeled because its labeling requires expensive involve- detection based on self-organizing maps (SOMs) was cou-
ment of human experts. One possible way how to tackle pled with random forest-based rule system to provide not
the lack of training data is semi-supervised learning. In a only high precision, but also some sort of explanation.
narrow sense, this means supervised learning that simul- Much research has been devoted to comparing differ-
taneously to labels also uses some information from addi- ent kinds of ANN, or more generally, different classifiers
tional unlabeled data, in a broad sense any combination of including one or more kinds of ANN, on real-world mal-
supervised learning and unlabeled data, e.g., unsupervised ware detection or intrusion detection data. Probably the
learning followed by supervised learning. In the context of most popular among such data is an extensive intrusion de-
malware detection, however, semi-supervised ANN learn- tection dataset that was used at the 1999 KDD Cup [29].
ing is only emerging [20, 21]. The work in progress re- Zhang et al. [33] compared five different kinds of ANN.
ported in this paper is a small contribution to it. It restricts Mukkamala et al. [19] compared a multilayer perceptron
(MLP) with support vector machines.
Copyright c 2020 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY Among more recent ANN applications to malware and
4.0). network intrusion detection, [14] should be mentioned for
using synthetically generated attack samples to train an where C denotes the set of classes, fc (x) the activity
MLP, as well as [30] for a malware detection with recur- of the output neuron corresponding to the class c ∈ C
rent networks. Expectedly, the kinds of ANN applied to for the input x, and ϑ ∈ (0, 1) is a given threshold.
these two areas during the last decade are most often deep
(ii) Increasing the consistency of predictions for the same
networks [1, 7, 10]. In [16], deep learning was used to-
input between two instances of a neural network
gether with spectral clustering to improve the detection
differing through a random perturbation. Such a
rate of low frequency network attacks. ability to process
perturbation is typically introduced through random
raw inputs and learn their own features. Saxe et al. [26]
noise or through dropout. The overall loss function
employed a convolutional neural network (CNN) to ex-
minimized during semi-supervised learning is then
tract features that were subsequently used as the input for
the superposition of the loss of supervised learning
an MLP detecting malicious activities. CNNs seem to be
and a loss reflecting the inconsistency of the con-
particularly suitable to learn spatial features of network
sidered ANN instances. This approach was first ap-
traffic [31, 32]. In [31], a CNN was in addition combined
plied in [23] to ladder networks, which are basically
with a long short term memory learning temporal features
chained denoising autoencoders. In [12], two similar
from multiple network packets.
kinds of neural networks using this approach to semi-
To our best knowledge, there were so far only two
supervised ANN learning were proposed that can be
particular ANN applications to malware or network in-
viewed as simplifications of ladder networks. The
trusion detection that included semi-supervised learning
first kind, called Π-model, evaluates both randomly
in the narrow sense. In [20], various settings of semi-
differing ANN instances on each minibatch of data.
supervised ladder networks (see Section 3) were compared
The second kind, called temporal ensembling, eval-
on the above mentioned intrusion detection dataset [29].
uates only one of them and then uses its predictions
In [21] (cf. also the thesis [27]), skipgram networks [17]
in the inconsistency loss. As a compensation, predic-
extended with semi-supervised learning based on Pseudo-
tions from multiple previous network evaluations are
labels (see Section 3) were used for Android malware de-
aggregated into an ensemble prediction.
tection. Skipgrams are neural networks embedding large
sets of structured non-numeric data into low-dimensional (iii) Due to targets changing only once per epoch, tem-
vector spaces. Whereas in [17], skipgrams were pro- poral ensembling becomes unwieldy when learning
posed for the embedding of text (word2vec), the input set large datasets. To overcome this problem, an ap-
in [21] is the set of rooted subgraphs around every node proach called mean teacher has been proposed in
of three dependency graphs representing the API depen- [28]. Instead of aggregating predictions, it aggregates
dencies, permission dependencies, and information source weights, more precisely, averages them.
and sink dependencies of the considered Android appli-
cation. However, skipgrams were not used directly for (iv) In [18], the most sophisticated among the four con-
malware detection in [21], only for representation learn- sidered approaches has been proposed, called virtual
ing of the structured input, whereas the malware detec- adversarial training, due to using a loss function pro-
tion itself was performed by a support vector machine. So posed by Goodfellow et al. to train networks against
far, no semi-supervised neural networks have been used adversarial inputs [8], and known as adversarial loss:
directly for malware detection, and also none have been
used with unstructured inputs simply listing values of the Ladv (x, θ ) = D[q(·|x), p(·|x + radv ; θ )] (2)
evaluated features, which are encountered much more fre- where radv = arg max D[q(·|x), p(·|x + r; θ )], (3)
quently than dependency matrices. krk≤ε
In (2)–(3), q(·|x) represents our knowledge of the true
3 Semi-supervised Learning of Neural conditional distribution of labels given a particular in-
Networks put x, whereas p(·|x; θ ) represents the corresponding
distribution implied by the neural network for partic-
According to the overview paper [22], the following ap- ular values of their parameters θ , ε > 0 and D is some
proaches are most important for semi-supervised learning non-negative function on pairs of probability distribu-
of neural networks, especially deep networks: tion, such as cross entropy, which was used in [18].
(i) Pseudo-labels [13], which are ANN predictions of And the term “virtual” refers to the fact that in su-
the correct class for unlabeled data, provided the net- pervised learning, this loss needs to be minimized on
work has a sufficient confidence in such a prediction. unlabeled inputs instead on adversarial ones
Formally, a prediction serves as a pseudo-label for an
unlabeled input x if So far, we have managed to implement the first two of
those approaches, the second in both variants Π-model and
arg max fc (x) ≥ ϑ ∑ fc , (1) temporal ensembling. Some details of our implementation
c∈C c∈C are given below.
3.1 Our Implementation of ANN Learning out neurons. To augment the data for the second predic-
tion, we multiplied the input feature vector with a noise
Most parts of the two algorithms we used share the same sampled from normal distribution N (1, σ 2 ). We chose to
implementation. Fundamentally, they only differ in the multiply the data with the noise instead of adding it be-
way they compute the unsupervised component of the loss cause it is invariant to the differing variances of the indi-
function. Firstly, both methods use the same MLP ar- vidual features.
chitecture with ReLU as the activation function in the The second variant, temporal ensembling, compares the
hidden layers and utilize the same optimizing algorithm prediction of the network in the current epoch with the pre-
Adam [11] with the initial learning rate set to 0.001, dictions obtained in the previous epoch. The dropout and
β1 = 0.99, and β2 = 0.999. As was shown above, the data augmentation can be used as well. So the unsuper-
optimized loss function is defined as a weighted sum of vised loss LU for this approach is calculated as follows:
supervised and unsupervised loss L = LS + w(t)LU . The
weight w(t) depends on the ratio between the number of
labeled and all data, and the current epoch. Following a |C|
proposal in [22], we ramp up the value of the weight using LU (x) = ∑ (yi − ỹi )2 , (6)
a Gaussian curve:w(t) = w max |L |L | 2
i=1
|+|U | exp −5(1 − t) ,
where t = max( reu , 1), e is number of the current epoch,ru where y is the current output of the network in the training
is the length of the rump up period and wmax is a pa- step and ỹ is the output of the network in a different state
rameter specifying the maximum weight. Increasing the or for augmented input.
weight of the unsupervised loss during the training is nec- Our open-source implementation is publicly available at
essary as the network needs to learn to classify the su- https://github.com/c0zzy/semi-supervised-ann.
pervised data first. Eventually, it can learn to incorporate
the unlabeled information as well. Similarly, at the later
3.2 Validation using a simple artificial experiment
phase of the training, the learning rate and the β1 param-
eter of the Adam optimizer are decreased to improve the Firstly, we tried our implementations of two semi-super-
exploitation:lre = w d lre−1 and β1 = 0.4wd + 0.5, where vised methods mentioned above and a fully supervised
wd = exp −12.5t 2 ,t = max( re , 1) and rd is the length of baseline on a two-dimensional example. We chose sim-
d
the ramp down period. We also included a type of elitism ple generated moon-shaped data, which are often used for
to select the resulting model with the lowest total loss per testing of classification or clustering algorithms. The data
epoch calculated with the maximal weight for the unsuper- consist of two classes, that are linearly inseparable but do
vised component instead of a weight in the current epoch. not overlap so that the classification can be performed with
The unsupervised loss in the Pseudo-labeling algorithm no error. The advantage is that we can easily visualize the
is calculated using cross-entropy between network’s pre- classification decision border in two dimensions and ex-
dictions and pseudo-labels, but only for predictions with amine the behavior of the algorithm. For every method in
confidence above a specified threshold ϑ (cf. (1)). We this experiment, we used the same MLP architecture with
compute the vector of pseudo-labels y0 for every data sam- two hidden layers, the first having 64 neurons and the sec-
ple x using the corresponding network output f (x) in the ond 32 neurons.
following manner: In Figure 1, we present two different arrangements of
labeled and unlabeled data, each solved by the fully su-
1 if i = argmaxi0 fi0 (x) pervised learning, Pseudo-labeling, and Π-model. In the
y0i = (4) first experiment, we tested the ability of the algorithm to
0 otherwise
learn from a small amount of data, there are two moon-
Then the resulting formula based on cross entropy for the shaped clusters, each having 1000 samples, where only 16
unsupervised loss component LU of a particular data sam- of each are labeled. We let each network to train for 300
ple x is: epochs. Even though the supervised learning had avail-
able samples distributed over the whole cluster, it was not
|C|
able to learn the correct shape using only 32 samples. The
LU (x) = − ∑ y0i log(yi ), (5) Pseudo-labeling algorithm could not improve the results
i=1
using the unlabeled data. However, the results of the Π-
where |C| is the number of classes. model are notably better as it managed to capture the moon
We also implemented two variants of the consistency shape quite well.
preserving, self-ensembling algorithms: The Π-model and In the second experiment, we tried if the algorithms can
the temporal ensembling. Both approaches use mean deal with a drift in the training data. This time we used
squared error (MSE) to compute unsupervised loss. What clusters with 10,000 samples and labeled only 1000 points
is different is the target for which MSE is evaluated. The that lie near the center, for each class. We trained the net-
Π-model compares two predictions of the same state of works for 100 epochs as having it run longer did not im-
the network using different inputs and different dropped prove the results of either of the methods. The supervised
algorithm could only use the labeled data that are linearly program, and clean. There were some features with zero
separable. So it learned to classify the labeled data with or very low variance in the dataset. Therefore we used
zero error, and we present it only as a baseline for compar- principal component analysis (PCA) to reduce the dimen-
ison. Pseudo-labeling again failed to use the information sionality of the feature space and speed up the training.
contained in the unlabeled data, and its accuracy was sim- First, we min-max normalized the data between 0 and 1,
ilar to the fully supervised learning. Also in this task, the and then we projected them to the subspace spanned by
Π-model was able to use the smoothness of the data and the 128 main components while keeping more than 99 %
performed the best of three methods. To quantify the re- of the explained variance.
sults, we summarized the prediction accuracy tested on the
whole clusters in Table 1.
4.2 Experimental Design
Table 1: A summary of test accuracy on the moonshaped At first, we analyzed the hyperparameters of each algo-
data. The table compares Pseudo-labeling, Π-model, and rithm and optimized those that we expected to have the
fully supervised learning on a test data covering the whole greatest impact on the results during early tests of our
moon cluster. There are results of two experiments. In implementation. We chose the data from five weeks be-
the first one, only 16 points out of 1000 were uniformly tween 50th and 55th week. We performed stratified ran-
selected and labeled for both classes. In the second, we dom sampling and selected 10,000 training and 5000 test-
labeled 1000 points in the center out of 10,000 samples ing records. We kept only 5 % of the labeled from the
for both classes. training set, and the rest remained unlabeled. Using this
data, we evaluated the classification accuracy for various
sets of hyperparameters.
Test case
Method For the Pseudo-labeling algorithm, we optimized the
16 pts uniform 1000 pts in center threshold ϑ and the maximal weight wmax for the unsu-
Supervised 89.1 % 46.2 % pervised loss component. For the consistency preserving
Pseudo-label 85.4 % 42.9 % algorithms, we optimized the standard deviation σ of the
Π-model 95.7 % 76.0 % noise used in data augmentation and again the parameter
wmax . Furthermore, we repeated the search of parameters
for all six combinations of variants of the algorithm, which
Completing these experiments, we observed that the re-
were: Π-model or temporal ensembling and whether to
sults of the Pseudo-labeling correspond to the idea behind
use dropout, augmentation or both. We took the parame-
the algorithm. It makes the network’s decision more con-
ters from the following sets:
fident as it uses the interim predictions as if they were the
true labels. Also, the decision border did not seem to con- wmax ∈ {0.1, 1, 2, 5, 10, 15, 20, 30, 50},
verge to a stable finale state throughout the learning. It
σ ∈ {0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5},
kept shifting closer to one or the other class, roughly in
the range where the confidence of the supervised learning ϑ ∈ {0.5, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99}.
was low. We managed to get decent results using the Π-
However, because of the high time requirements, we re-
model, and it proved to be able to capture the smooth dis-
stricted attention among the two similar models proposed
tribution of data. However, the algorithm was susceptible
in [12] only to the Π-model. For the same reason, we did
to inappropriate setting of hyperparameters. It often hap-
not perform the full factorial search through all possible
pened that one class became dominant during the training,
combinations. Instead we optimized only one parame-
and the Π-model could not recover from that.
ter at time, keeping others on default values which were:
wmax = 30, σ = 0.1 and ϑ = 0.9. Among all these tuned
4 Experiments with a Real-World Malware hyperparameters, the most critical from the point of view
Dataset of the predictive accuracy were the maximal weight, and
the standard deviation of the Π-model noise. The rest of
4.1 Data the hyperparameters we used as stated in the original pa-
pers or we modified them slightly according to our obser-
We tested our implementation using a large real-world vations because the domain of our dataset is entirely differ-
malware detection dataset containing anonymized data ent. The final values of the chosen hyperparameters used
provided by the company Avast. The data concern Win- in experiments follow in Table 2. For the fully supervised
dows Portable Executables (PE) files, which were col- training, we enabled the dropout and the data augmenta-
lected during 380 weeks. It consists of 540 real-valued tion in the same manner as with the Π-model. In every
features derived directly from the binary PE files. Unfor- experiment, we used the same MLP architecture with five
tunately, the company did not reveal the semantics of the layers and the topology 128-96-64-32-5.
individual features. Each file is labeled with one of the five Then we measured the performance of the Pseudo-
classes: malware, adware, infected, potentially unwanted labeling, Π-model, and the purely supervised baseline
(a) Supervised (b) Supervised
(c) Pseudo-labeling (d) Pseudo-labeling
(e) Π-model (f) Π-model
Figure 1: A comparison of the decision border of three algorithms on simple moon-shaped data. The decision border
is visualized as a transition from blue to red. The saturation expresses the classification confidence of the network. The
labeled data are shown as cyan or orange circles, while unlabeled are drawn in gray. On the left side, we randomly labeled
only 16 samples out of 2000 from each class. On the right side, we labeled 1000 samples close to the center out of 5000
from each class.
Table 2: Final setting of model hyperparameters. for the network. The results of the semi-supervised algo-
rithms Pseudo-labeling and Π-model are more interesting.
Both algorithms bring a slight increase in the accuracy of
Common low ratios of the labels. The most noticeable improvement
Number of training epochs 100 is when there are only around 1 or 2 % of labels. When
Training batch size 100 the ratio gets above 10 %, the accuracy gain is negligi-
Weight ramp-up period ru 70 ble, and for the higher values, the semi-supervised effect
Optimizer ramp-down period rd 20 is even negative. Also, it seems that Π-model outperforms
Pseudo-labeling, as its accuracy is higher in most of the
Initial learning rate 0.001
measurements.
Pseudo-Labeling To verify our observations, we tested whether the dis-
Pseudo-labeling threshold ϑ 0.9 tributions of predictive accuracy achieved by the three
Maximal weight wmax 10 considered methods significantly differ from each other.
Consistency preserving Those distributions are for the considered ratios of labeled
to all data shown in Figure 2, but – due to lack of space
Consistency preserving variant Π-model
– only for the networks trained on data from the first five
Use dropout Yes weeks. Firstly, we applied the Friedman test [6] to reject
Use data augmentation Yes the hypothesis that all three methods can be considered
Maximal weight wmax 20 equal. Then we performed a post hoc pairwise test to find
Standard deviation σ of the noise 0.2 out among which of them there were differences at the 5
% level of family-wise significance with Holm [9] correc-
tion. We took the data from all of the following weeks and
for different proportions of labeled data. We varied evaluated the accuracy for all considered ratios of labeled
the ratio r = |L | : (|L | + |U |) in the set of values and all data, training for each of them 20 models. A signif-
{0.5%, 1%, 2%, 5%, 10%, 25%, 50%, 75%}. As the train- icant difference between the compared methods was found
ing union of labeled and unlabeled data, we took 10,000 for 80 among the 96 compared pairs corresponding to the
stratified samples from 5 consequent weeks and split them 32 combinations of training weeks and ratios. We summa-
in the considered ratios. Then we trained 20 separate in- rized the results in Table 4, where we compared the aver-
stances of the network and calculated the average accu- age accuracy for the three implemented methods. When
racy on a stratified test set of size 5000 for them. We re- we consider only tests with ratio up to 5 %, where the im-
peated this experiment for four arbitrarily chosen distinct provement was visible, then the Pseudo-labeling was sig-
groups of weeks: 1-5, 51-55, 101-105, and 151-155. We nificantly better than supervised learning in 3 cases and the
also evaluated the performance of trained networks on the Π-model in 11 cases. Pseudo-labeling was significantly
data from all of the following weeks. This is particularly better than Π-model only in 3 out of 14 significant com-
interesting from the point of view of the considered appli- parisons.
cation domain. Because the structure of malware changes We also visualized the progress of the classification ac-
over time, the prediction accuracy of the newer data tends curacy over time for networks trained during three arbi-
to get worse. That means that if semi-supervised learn- trarily chosen sequences of 5 contiguous weeks in Fig-
ing could overcome this problem, it could be beneficial. ure 3. To capture the variance of the results, we plotted
Therefore, we tried to take the data from newer periods three quartiles. Because the accuracy oscillated greatly
than the labeled weeks as the unlabeled training set. So through the individual weeks, we used a moving average
we trained the network using labeled data together with with a window size of five weeks to smooth the curves
unlabeled data from several weeks later. Unfortunately, (the accuracy during the first five weeks, for which a win-
we did not manage to outperform the standard fully su- dow of that size has not been available, is dashed). We
pervised learning this way using any of the implemented can see that both semi-supervised algorithms slightly im-
methods, so we refrained from it. We present the results proved the accuracy of the network on the roughly first
of these experiments in the following section. 30 weeks. The Pseudo-labeling is around 1 or 2 % bet-
ter than supervised learning, while Π-model gets another
4.3 Results and Their Discussion 1 or 2 % above the Pseudo-labeling. However, all three
trained networks share the trend of decreasing predictive
Using the hyperparameters setting presented in the previ- accuracy during the early weeks when the moving aver-
ous section, we measured the average test accuracy of 20 age has been applied, though the number of such weeks
training runs of our three implementations in relation to is network-specific. After around 40 weeks, the results of
the proportion of the labeled data in the training data set. all three methods are very similar. As the properties of the
The results can be found in Table 3. We can see that the data shift over time, the overall results on the data beyond
performance of the fully supervised learning depends on 50 weeks got considerably worse and fluctuated more for
the number of labeled data as it is the only learning source all methods.
Table 3: Comparison of the Π-model, Pseudo-labeling and the supervised baseline, for different ratios of labeled and all
data. The table depicts the percentage of the average testing accuracy in four different periods. The S columns contain
the results of the supervised baseline, the ∆ Ps and ∆ Π columns show the difference using pseudo-labeling and Π-model,
Ratio 0.5% Ratio 1% Ratio 2% Ratio 5%
respectively.
0.8 0.8 0.8
0.8
0.7 Weeks:
0.71–5 Weeks: 51–55
0.7 Weeks: 101–105 Weeks: 151–155
Ratio
S ∆ Ps ∆ Π S ∆ Ps ∆ Π S ∆ Ps ∆ Π S ∆ Ps ∆ Π
Accuracy
0.6 0.6 0.6
0.5 % 67.9 +0.4 +3.1 63.9 +2.8 +3.5 56.8 +5.2 +6.8 0.667.6 +0.3 +1.9
0.5 0.5 0.5
1% 71.0 +1.7 +4.5 67.1 +5.0 +6.4 61.8 +6.9 +9.0 70.3 +5.7 +6.4
0.4 2% 76.8 +1.3
0.4 +2.5 73.9 +3.4 +5.7
0.4 69.7 +5.9 +6.2 0.476.6 +1.9 +2.0
5% 82.4 -0.1
0.3
+1.1 82.2 +0.4 +2.4 77.8 +3.7 +3.3 80.4 +0.2 +0.8
0.3 0.3
10 % 85.1 +0.0 +1.1 86.1 -0.4 +0.8 83.2 +1.0 +1.1 81.7 +0.3 +0.6
0.2 0.2
25 % 88.3 -0.4 +0.3 0.2
89.2 -0.5 +0.1 87.4 +0.7 +0.3 0.283.1 +0.3 +0.3
S50 %P 89.9 -0.4 S -0.1 P90.6 -0.7 +0.0 S89.8 P -0.3 -0.2 84.2S -0.1P -0.2
75 % 90.4 -0.1 -0.1 91.2 -0.3 -0.3 90.7 -0.1 -0.4 84.4 +0.3 -0.2
100 % 90.9 91.4 91.3 84.8
Ratio 10% Ratio 25% Ratio 50% Ratio 75%
0.8 0.8 0.8 0.8
Accuracy
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2
0.2 0.2 0.2
S P S P S P S P
Figure 2: Boxplots summarizing the distributions of predictive accuracy achieved by supervised learning (S), pseudo-
labeling (P) and the Π-model (Π) for the considered ratios of labeled to all data and the networks trained on data from the
first five weeks
5 Conclusion We have also found that these semi-supervised methods
can increase the accuracy on data newer than the training
In this paper, we presented an application of semi-super- set, for which drift in structure is likely to occur, but only
vised learning of deep neural networks to malware data. to a certain extent. Based on our experiments, the slightly
At the beginning, we recalled the current state of detecting more complex algorithm Π-model has got slightly better
malware with artificial neural networks and introduced the results than Pseudo-labeling in most cases.
principles of neural semi-supervised learning. Then we
outlined four semi-supervised approaches to deep learn-
ing. We covered two semi-supervised algorithms, Pseudo- Acknowledgement
labeling and Π-model in more detail and compared them
with the fully supervised baseline. We evaluated the classi- The research reported in this paper has been supported by
fication accuracy on a real-world malware dataset divided the Czech Science Foundation (GAČR) grant 18-18080S.
to 380 weeks by the time of the first recording of the re- For the employed data and the work of M. Krčál, his
spective binary file. Despite having been developed for support through Avast fellowship is appreciated. Access
the classification of image data, the results showed that to computing and storage facilities owned by parties and
both methods could improve the performance of a neural projects contributing to the National Grid Infrastructure
network on malware data. However, implemented algo- MetaCentrum provided under the program "Projects of
rithms have the limitation of being beneficial only when Large Research, Development, and Innovations Infrastruc-
the proportion of labeled data is low, ideally around 1 %. tures" (CESNET LM2015042), is greatly appreciated.
Supervised
0.76
Pseudo-labeling
0.74 -model
Accuracy quartiles
0.72
0.70
0.68
0.66
0.64
0.62
Training
0.60
0 10 20 30 40 50
Week number
Supervised
0.75 Pseudo-labeling
-model
0.70
Accuracy quartiles
0.65
0.60
0.55
0.50
Training
0.45
50 60 70 80 90 100
Week number
0.70 Supervised
Pseudo-labeling
0.65 -model
0.60
Accuracy quartiles
0.55
0.50
0.45
0.40
0.35
Training
0.30
100 110 120 130 140 150
Week number
Figure 3: The progression of the classification accuracy on later weeks using Pseudo-labeling, Π-model, and fully su-
pervised learning, trained using set with 1 % of labels. For each plot, there are three quartiles visualized; the median is
drawn with a solid line, while the first and the third quartiles are dotted. The curves correspond to the moving average
with the window size of five weeks. The first five dashed weeks are means of all previous weeks. The first five weeks at
the beginning of each plot were used for the training.
Table 4: Multiple comparisons test of three methods for [9] S. Holm. A simple sequentially rejective multiple test pro-
different ratios of labeled to all data, tested on the data cedure. Scandinavian Journal of Statistics, pages 65–70,
from all of the following weeks till the end. Each cell 1979.
contains a triplet of symbols representing the results of [10] J. Kim, J. Kim, H. Le T. Thu, and H. Kim. Long short
three post hoc pairwise tests. The order of the comparisons term memory recurrent neural network classifier for intru-
is: supervised to Pseudo-labeling, supervised to Π-model, sion detection. In PlatCon: IEEE International Conference
and Pseudo-labeling to Π-model. The dash means that the on Platform Technology and Service, pages 1–5, 2016.
difference was not statistically significant and the letters [11] D.P. Kingma and J. Ba. Adam: A method for stochastic
S, P, and Π mark whether supervised, Pseudo-labeling, or optimization. Preprint arXiv:1412.6980, 2014.
Π-model were significantly better than the other compared [12] S. Laine and T. Aila. Temporal ensembling for semi-
algorithm. supervised learning. In ICLR, pages 1–13, 2017.
[13] D.H. Lee. Pseudo-label: The simple and efficient semi-
supervised learning method for deep neural networks. In
Training weeks WREPL: ICML Workshop Challenges in Representation
Ratio Learning, pages 1–6, 2013.
1–5 51–55 101–105 151–155
[14] O. Linda, T. Vollmer, and M. Manic. Neural network based
0.5 % P, –, Π P, S, Π P, –, Π P, –, Π intrusion detection system for critical infrastructures. In
1% P, –, – S , Π, Π P , – , Π P , Π, Π International Joint Conference on Neural Networks, pages
2% S, S, Π S , S , Π P , Π, P S, S, – 1827–1834, 2009.
5% –, S, Π –, S, P P , Π, P P, S, Π [15] T.F. Lunt. IDES: An intelligent system for detecting in-
truders. In Symposium on Computer Security, Threat and
10 % –, –, Π P, S, Π S, S, P S, S, Π
Countermeasures, pages 30–45, 1990.
25 % P , Π, Π S, S, Π S, P, P S, S, Π [16] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen. A hybrid
50 % – , Π, Π – , S , Π S , Π, P S, S, Π spectral clustering and deep neural network ensemble algo-
75 % S , Π, – S , Π, – S, –, P S, S, Π rithm for intrusion detection in sensor networks. Sensors,
16:article no. 1701, 2016.
[17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.
References Distributed representations of words and phrases and their
compositionality. In NIPS, pages 1–9, 2013.
[1] B. Abolhasanzadeh. Nonlinear dimensionality reduction [18] T. Miyato, S.I. Maeda, M. Koyarna, and S. Ishii. Virtual
for intrusion detection using auto-encoder bottleneck fea- adversarial training: A regularization method for super-
tures. In IKT: IEEE 7th Conference on Information and vised and semi-supervised learning. IEEE Transactions on
Knowledge Technology, pages 1–5, 2015. Pattern Analysis and Machine Intelligence, 41:1979–1993,
2019.
[2] J.M. Bonifacio, A.M. Cansian, A.C.P.L.F. De Carvalho,
and E.S. Moreira. Neural networks applied in intrusion de- [19] S. Mukkamala, G. Janoski, and A. Sung. Intrusion detec-
tection systems. In IEEE International Joint Conference on tion using neural networks and support vector machines. In
Neural Networks, pages 205–210, 1998. International Joint Conference on Neural Networks, pages
1702–1707, 2002.
[3] J. Cannady. Artificial neural networks for misuse detec-
tion. In National Information Systems Security Conference, [20] M. Nadeem, O. Marshall, S. Singh, X. Fang, and X. Yuan.
pages 368–381, 1998. Semi-supervised deep neural network for network intrusion
detection. In Conference on Cybersecurity Education, Re-
[4] H. Debar, M. Becker, and D. Siboni. A neural network
search and Practice, pages 0–11, 2016.
component for an intrusion detection system. In IEEE
Computer Society Symposium on Research in Security and [21] A. Narayanan, C. Soh, L. Chen, Y. Liu, and L. Wang.
Privacy, pages 240–250, 1992. Apk2vec: Semi-supervised multi-view representation lear-
ning for profiling Android applications. In IEEE Interna-
[5] O. Depren, M. Topallar, E. Amarim, and M.K. Ciliz. An in-
tional Conference on Data Mining, pages 357–366, 2018.
telligent intrusion detection system (IDS) for anomaly and
misuse detection in computer networks. Expert Systems [22] A. Oliver, A. Odena, C. Raffel, E.D. Cubuk, and I.J. Good-
with Applications, 29:713–722, 2005. fellow. Realistic evaluation of deep semi-supervised learn-
ing algorithms. In NIPS, pages 1–19, 2018.
[6] M. Friedman. The use of ranks to avoid the assumption
of normality implicit in the analysis of variance. Journal [23] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and
of the American Statistical Association, 32(200):675–701, T. Raiko. Semi-supervised learning with ladder networks.
1937. In NIPS, pages 1–9, 2015.
[7] N. Gao, L. Gao, Q. Gao, and H. Wang. An intrusion de- [24] B.C. Rhodes, J.A. Mahaffey, and J.D. Cannady. Multiple
tection model based on deep belief networks. In IEEE Sec- self-organizing maps for intrusion detection. In 23rd Na-
ond International Conference on Advanced Cloud and Big tional Information Systems Security Conference, pages 16–
Data, pages 247–252, 2014. 19, 2000.
[8] I.J. Goodfellow, J. Shlens, and C. Szegedy. Explaining [25] J. Ryan, M.J. Lin, and R. Miikkulainen. Intrusion detection
and harnessing adversarial examples. In ICLR, pages 1–11, with neural networks. In Advances in Neural Information
2015. Processing Systems 10, pages 943–949, 1998.
[26] J. Saxe and K. Berlin. Expose: A character-level convo-
lutional neural network with embeddings for detecting ma-
licious URLs, file paths and registry keys. Arxiv preprint
arXiv:1702.08568, 2017.
[27] C.Z.Y. Soh. Program Analysis and Machine Learning
Techniques for Mobile Security. PhD thesis, Nanyang Tech-
nological University, Singapore, 2019.
[28] A. Tarvainen and H. Valpols. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In NIPS, pages 1–
16, 2017.
[29] M. Tavallaee, E. Bagheri, W. Lu, and A.A. Ghorbani. A
detailed analysis of the KDD cup 99 data set. In IEEE
Symposium on Computational Intelligence for Security and
Defense Applications, pages 288–293, 2009.
[30] P. Torres, C. Catania, S. Garcia, and C.G Garino. An analy-
sis of recurrent neural networks for botnet detection behav-
ior. In ARGENCON: IEEE biennial congress of Argentina,
pages 1–6, 2016.
[31] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang,
and M. Zhu. HAST-IDS: Learning hierarchical spatial-
temporal features using deep neural networks to improve
intrusion detection. IEEE Access, 6:1792–1806, 2017.
[32] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng. Mal-
ware traffic classification using convolutional neural net-
work for representation learning. In ICOIN: IEEE Interna-
tional Conference on Information Networking, pages 712–
717, 2017.
[33] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, and
J. Ucles. HIDE: A hierarchical network intrusion detection
system using statistical preprocessing and neural network
classification. In IEEE Workshop on Information Assur-
ance and Security, pages 85–90, 2001.