=Paper= {{Paper |id=Vol-2718/paper03 |storemode=property |title=Two Semi-supervised Approaches to Malware Detection with Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2718/paper03.pdf |volume=Vol-2718 |authors=Jan Koza,Marek Krčál,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/KozaKH20 }} ==Two Semi-supervised Approaches to Malware Detection with Neural Networks== https://ceur-ws.org/Vol-2718/paper03.pdf
Two Semi-supervised Approaches to Malware Detection with Neural Networks

                                                Jan Koza1 , Marek Krčál2 , Martin Holeňa1
                1   Faculty of Information Technology, Czech Technical University, Thákurova 9, Prague, Czech Republic
                                    2 Rossum Czech Republic, Dobratická 523, Prague, Czech Republic


Abstract: Semi-supervised learning is characterized by                    attention only to two methods of semi-supervised ANN
using the additional information from the unlabeled data.                 learning, approaches not relying on neural networks are
In this paper, we compare two semi-supervised algorithms                  outside its scope.
for deep neural networks on a large real-world malware                       The next section briefly reviews using ANN in malware
dataset. Specifically, we evaluate the performance of                     detection and the overlapping area of network intrusion
a rather straightforward method called Pseudo-labeling,                   detection. In Section 3, several important methods for
which uses unlabeled samples, classified with high con-                   semi-supervised ANN learning are recalled, two of which
fidence, as if they were the actual labels. The second ap-                have been implemented for our research. The core Sec-
proach is based on an idea to increase the consistency of                 tion 4 describes several experiments with a real-world mal-
the network’s prediction under altered circumstances. We                  ware dataset, and reports their results.
implemented such an algorithm called Π-model, which
compares outputs with different data augmentation and
different dropout setting. As a baseline, we also provide                 2   Neural Networks in Malware and
results of the same deep network, trained in the fully su-                    Network Intrusion Detection
pervised mode using only the labeled data. We analyze the
prediction accuracy of the algorithms in relation to the size             As malware detection is strongly interconnected with and
of the labeled part of the training dataset.                              closely related to network intrusion detection, using ANN
                                                                          will be reviewed here in both areas. Probably the first
                                                                          proposal to use neural networks in them was in 1990 by
1    Introduction                                                         Lunt [15] and was implemented two years later [4] in a
                                                                          network trained on inputs from audit log files.
One of the application domains that pay the most attention                   The authors of [25] employed user commands as in-
to the progress of and new developments in machine learn-                 put, but rather than trying to learn benign and malicious
ing is malware detection. Vendors of antivirus software                   command sequences, they were detecting anomalies in fre-
cannot keep up with the increasing number of malicious                    quency histograms of user commands calculated for each
programs and their increasingly sophisticated obfuscation                 user.
and polymorphism without using more and more advanced                        The paper by Cannady [3] summarised ANN advan-
machine learning methods, most importantly, methods for                   tages and disadvantages for misuse detection. As the two
anomaly detection, classification and pattern recognition.                main advantages, the flexibility with respect to incomplete,
   The most successful machine learning methods for clas-                 distorted and noisy data, and the generalization ability
sification and pattern recognition definitely include arti-               are viewed, whereas as the main disadvantage, the ANN
ficial neural networks (ANN), especially deep networks.                   black-box nature.
However, they have a high number of degrees of free-                         In the late 1990s and early 2000s, self-organizing maps
dom, thus requiring a large amount of labeled training                    were quite popular in this context [2, 5, 24]. In particular,
data, whereas most of the data for malware detection is                   Depren et al. [5] used a hierarchical model where misuse
unlabeled because its labeling requires expensive involve-                detection based on self-organizing maps (SOMs) was cou-
ment of human experts. One possible way how to tackle                     pled with random forest-based rule system to provide not
the lack of training data is semi-supervised learning. In a               only high precision, but also some sort of explanation.
narrow sense, this means supervised learning that simul-                     Much research has been devoted to comparing differ-
taneously to labels also uses some information from addi-                 ent kinds of ANN, or more generally, different classifiers
tional unlabeled data, in a broad sense any combination of                including one or more kinds of ANN, on real-world mal-
supervised learning and unlabeled data, e.g., unsupervised                ware detection or intrusion detection data. Probably the
learning followed by supervised learning. In the context of               most popular among such data is an extensive intrusion de-
malware detection, however, semi-supervised ANN learn-                    tection dataset that was used at the 1999 KDD Cup [29].
ing is only emerging [20, 21]. The work in progress re-                   Zhang et al. [33] compared five different kinds of ANN.
ported in this paper is a small contribution to it. It restricts          Mukkamala et al. [19] compared a multilayer perceptron
                                                                          (MLP) with support vector machines.
      Copyright c 2020 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY            Among more recent ANN applications to malware and
4.0).                                                                     network intrusion detection, [14] should be mentioned for
using synthetically generated attack samples to train an            where C denotes the set of classes, fc (x) the activity
MLP, as well as [30] for a malware detection with recur-            of the output neuron corresponding to the class c ∈ C
rent networks. Expectedly, the kinds of ANN applied to              for the input x, and ϑ ∈ (0, 1) is a given threshold.
these two areas during the last decade are most often deep
                                                               (ii) Increasing the consistency of predictions for the same
networks [1, 7, 10]. In [16], deep learning was used to-
                                                                    input between two instances of a neural network
gether with spectral clustering to improve the detection
                                                                    differing through a random perturbation. Such a
rate of low frequency network attacks. ability to process
                                                                    perturbation is typically introduced through random
raw inputs and learn their own features. Saxe et al. [26]
                                                                    noise or through dropout. The overall loss function
employed a convolutional neural network (CNN) to ex-
                                                                    minimized during semi-supervised learning is then
tract features that were subsequently used as the input for
                                                                    the superposition of the loss of supervised learning
an MLP detecting malicious activities. CNNs seem to be
                                                                    and a loss reflecting the inconsistency of the con-
particularly suitable to learn spatial features of network
                                                                    sidered ANN instances. This approach was first ap-
traffic [31, 32]. In [31], a CNN was in addition combined
                                                                    plied in [23] to ladder networks, which are basically
with a long short term memory learning temporal features
                                                                    chained denoising autoencoders. In [12], two similar
from multiple network packets.
                                                                    kinds of neural networks using this approach to semi-
   To our best knowledge, there were so far only two
                                                                    supervised ANN learning were proposed that can be
particular ANN applications to malware or network in-
                                                                    viewed as simplifications of ladder networks. The
trusion detection that included semi-supervised learning
                                                                    first kind, called Π-model, evaluates both randomly
in the narrow sense. In [20], various settings of semi-
                                                                    differing ANN instances on each minibatch of data.
supervised ladder networks (see Section 3) were compared
                                                                    The second kind, called temporal ensembling, eval-
on the above mentioned intrusion detection dataset [29].
                                                                    uates only one of them and then uses its predictions
In [21] (cf. also the thesis [27]), skipgram networks [17]
                                                                    in the inconsistency loss. As a compensation, predic-
extended with semi-supervised learning based on Pseudo-
                                                                    tions from multiple previous network evaluations are
labels (see Section 3) were used for Android malware de-
                                                                    aggregated into an ensemble prediction.
tection. Skipgrams are neural networks embedding large
sets of structured non-numeric data into low-dimensional       (iii) Due to targets changing only once per epoch, tem-
vector spaces. Whereas in [17], skipgrams were pro-                  poral ensembling becomes unwieldy when learning
posed for the embedding of text (word2vec), the input set            large datasets. To overcome this problem, an ap-
in [21] is the set of rooted subgraphs around every node             proach called mean teacher has been proposed in
of three dependency graphs representing the API depen-               [28]. Instead of aggregating predictions, it aggregates
dencies, permission dependencies, and information source             weights, more precisely, averages them.
and sink dependencies of the considered Android appli-
cation. However, skipgrams were not used directly for          (iv) In [18], the most sophisticated among the four con-
malware detection in [21], only for representation learn-           sidered approaches has been proposed, called virtual
ing of the structured input, whereas the malware detec-             adversarial training, due to using a loss function pro-
tion itself was performed by a support vector machine. So           posed by Goodfellow et al. to train networks against
far, no semi-supervised neural networks have been used              adversarial inputs [8], and known as adversarial loss:
directly for malware detection, and also none have been
used with unstructured inputs simply listing values of the                  Ladv (x, θ ) = D[q(·|x), p(·|x + radv ; θ )]   (2)
evaluated features, which are encountered much more fre-              where radv = arg max D[q(·|x), p(·|x + r; θ )],      (3)
quently than dependency matrices.                                                        krk≤ε


                                                                    In (2)–(3), q(·|x) represents our knowledge of the true
3   Semi-supervised Learning of Neural                              conditional distribution of labels given a particular in-
    Networks                                                        put x, whereas p(·|x; θ ) represents the corresponding
                                                                    distribution implied by the neural network for partic-
According to the overview paper [22], the following ap-             ular values of their parameters θ , ε > 0 and D is some
proaches are most important for semi-supervised learning            non-negative function on pairs of probability distribu-
of neural networks, especially deep networks:                       tion, such as cross entropy, which was used in [18].
(i) Pseudo-labels [13], which are ANN predictions of                And the term “virtual” refers to the fact that in su-
     the correct class for unlabeled data, provided the net-        pervised learning, this loss needs to be minimized on
     work has a sufficient confidence in such a prediction.         unlabeled inputs instead on adversarial ones
     Formally, a prediction serves as a pseudo-label for an
     unlabeled input x if                                         So far, we have managed to implement the first two of
                                                               those approaches, the second in both variants Π-model and
                   arg max fc (x) ≥ ϑ ∑ fc ,            (1)    temporal ensembling. Some details of our implementation
                       c∈C            c∈C                      are given below.
3.1   Our Implementation of ANN Learning                       out neurons. To augment the data for the second predic-
                                                               tion, we multiplied the input feature vector with a noise
Most parts of the two algorithms we used share the same        sampled from normal distribution N (1, σ 2 ). We chose to
implementation. Fundamentally, they only differ in the         multiply the data with the noise instead of adding it be-
way they compute the unsupervised component of the loss        cause it is invariant to the differing variances of the indi-
function. Firstly, both methods use the same MLP ar-           vidual features.
chitecture with ReLU as the activation function in the            The second variant, temporal ensembling, compares the
hidden layers and utilize the same optimizing algorithm        prediction of the network in the current epoch with the pre-
Adam [11] with the initial learning rate set to 0.001,         dictions obtained in the previous epoch. The dropout and
β1 = 0.99, and β2 = 0.999. As was shown above, the             data augmentation can be used as well. So the unsuper-
optimized loss function is defined as a weighted sum of        vised loss LU for this approach is calculated as follows:
supervised and unsupervised loss L = LS + w(t)LU . The
weight w(t) depends on the ratio between the number of
labeled and all data, and the current epoch. Following a                                    |C|
proposal in [22], we ramp up the value of the weight using                        LU (x) = ∑ (yi − ỹi )2 ,             (6)
a Gaussian curve:w(t) = w max |L |L     |                2
                                                                                           i=1
                                     |+|U | exp −5(1 − t) ,
where t = max( reu , 1), e is number of the current epoch,ru   where y is the current output of the network in the training
is the length of the rump up period and wmax is a pa-          step and ỹ is the output of the network in a different state
rameter specifying the maximum weight. Increasing the          or for augmented input.
weight of the unsupervised loss during the training is nec-       Our open-source implementation is publicly available at
essary as the network needs to learn to classify the su-       https://github.com/c0zzy/semi-supervised-ann.
pervised data first. Eventually, it can learn to incorporate
the unlabeled information as well. Similarly, at the later
                                                               3.2   Validation using a simple artificial experiment
phase of the training, the learning rate and the β1 param-
eter of the Adam optimizer are decreased to improve the        Firstly, we tried our implementations of two semi-super-
exploitation:lre = w  d lre−1 and β1 = 0.4wd + 0.5, where     vised methods mentioned above and a fully supervised
wd = exp −12.5t 2 ,t = max( re , 1) and rd is the length of    baseline on a two-dimensional example. We chose sim-
                                  d
the ramp down period. We also included a type of elitism       ple generated moon-shaped data, which are often used for
to select the resulting model with the lowest total loss per   testing of classification or clustering algorithms. The data
epoch calculated with the maximal weight for the unsuper-      consist of two classes, that are linearly inseparable but do
vised component instead of a weight in the current epoch.      not overlap so that the classification can be performed with
   The unsupervised loss in the Pseudo-labeling algorithm      no error. The advantage is that we can easily visualize the
is calculated using cross-entropy between network’s pre-       classification decision border in two dimensions and ex-
dictions and pseudo-labels, but only for predictions with      amine the behavior of the algorithm. For every method in
confidence above a specified threshold ϑ (cf. (1)). We         this experiment, we used the same MLP architecture with
compute the vector of pseudo-labels y0 for every data sam-     two hidden layers, the first having 64 neurons and the sec-
ple x using the corresponding network output f (x) in the      ond 32 neurons.
following manner:                                                 In Figure 1, we present two different arrangements of
                                                               labeled and unlabeled data, each solved by the fully su-
                    
                        1   if i = argmaxi0 fi0 (x)            pervised learning, Pseudo-labeling, and Π-model. In the
            y0i =                                       (4)    first experiment, we tested the ability of the algorithm to
                        0   otherwise
                                                               learn from a small amount of data, there are two moon-
Then the resulting formula based on cross entropy for the      shaped clusters, each having 1000 samples, where only 16
unsupervised loss component LU of a particular data sam-       of each are labeled. We let each network to train for 300
ple x is:                                                      epochs. Even though the supervised learning had avail-
                                                               able samples distributed over the whole cluster, it was not
                                |C|
                                                               able to learn the correct shape using only 32 samples. The
                    LU (x) = − ∑ y0i log(yi ),          (5)    Pseudo-labeling algorithm could not improve the results
                                i=1
                                                               using the unlabeled data. However, the results of the Π-
where |C| is the number of classes.                            model are notably better as it managed to capture the moon
   We also implemented two variants of the consistency         shape quite well.
preserving, self-ensembling algorithms: The Π-model and           In the second experiment, we tried if the algorithms can
the temporal ensembling. Both approaches use mean              deal with a drift in the training data. This time we used
squared error (MSE) to compute unsupervised loss. What         clusters with 10,000 samples and labeled only 1000 points
is different is the target for which MSE is evaluated. The     that lie near the center, for each class. We trained the net-
Π-model compares two predictions of the same state of          works for 100 epochs as having it run longer did not im-
the network using different inputs and different dropped       prove the results of either of the methods. The supervised
algorithm could only use the labeled data that are linearly      program, and clean. There were some features with zero
separable. So it learned to classify the labeled data with       or very low variance in the dataset. Therefore we used
zero error, and we present it only as a baseline for compar-     principal component analysis (PCA) to reduce the dimen-
ison. Pseudo-labeling again failed to use the information        sionality of the feature space and speed up the training.
contained in the unlabeled data, and its accuracy was sim-       First, we min-max normalized the data between 0 and 1,
ilar to the fully supervised learning. Also in this task, the    and then we projected them to the subspace spanned by
Π-model was able to use the smoothness of the data and           the 128 main components while keeping more than 99 %
performed the best of three methods. To quantify the re-         of the explained variance.
sults, we summarized the prediction accuracy tested on the
whole clusters in Table 1.
                                                                 4.2   Experimental Design
Table 1: A summary of test accuracy on the moonshaped            At first, we analyzed the hyperparameters of each algo-
data. The table compares Pseudo-labeling, Π-model, and           rithm and optimized those that we expected to have the
fully supervised learning on a test data covering the whole      greatest impact on the results during early tests of our
moon cluster. There are results of two experiments. In           implementation. We chose the data from five weeks be-
the first one, only 16 points out of 1000 were uniformly         tween 50th and 55th week. We performed stratified ran-
selected and labeled for both classes. In the second, we         dom sampling and selected 10,000 training and 5000 test-
labeled 1000 points in the center out of 10,000 samples          ing records. We kept only 5 % of the labeled from the
for both classes.                                                training set, and the rest remained unlabeled. Using this
                                                                 data, we evaluated the classification accuracy for various
                                                                 sets of hyperparameters.
                                  Test case
        Method                                                      For the Pseudo-labeling algorithm, we optimized the
                     16 pts uniform 1000 pts in center           threshold ϑ and the maximal weight wmax for the unsu-
       Supervised            89.1 %            46.2 %            pervised loss component. For the consistency preserving
      Pseudo-label           85.4 %            42.9 %            algorithms, we optimized the standard deviation σ of the
        Π-model              95.7 %            76.0 %            noise used in data augmentation and again the parameter
                                                                 wmax . Furthermore, we repeated the search of parameters
                                                                 for all six combinations of variants of the algorithm, which
   Completing these experiments, we observed that the re-
                                                                 were: Π-model or temporal ensembling and whether to
sults of the Pseudo-labeling correspond to the idea behind
                                                                 use dropout, augmentation or both. We took the parame-
the algorithm. It makes the network’s decision more con-
                                                                 ters from the following sets:
fident as it uses the interim predictions as if they were the
true labels. Also, the decision border did not seem to con-              wmax ∈ {0.1, 1, 2, 5, 10, 15, 20, 30, 50},
verge to a stable finale state throughout the learning. It
                                                                       σ ∈ {0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5},
kept shifting closer to one or the other class, roughly in
the range where the confidence of the supervised learning               ϑ ∈ {0.5, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99}.
was low. We managed to get decent results using the Π-
                                                                 However, because of the high time requirements, we re-
model, and it proved to be able to capture the smooth dis-
                                                                 stricted attention among the two similar models proposed
tribution of data. However, the algorithm was susceptible
                                                                 in [12] only to the Π-model. For the same reason, we did
to inappropriate setting of hyperparameters. It often hap-
                                                                 not perform the full factorial search through all possible
pened that one class became dominant during the training,
                                                                 combinations. Instead we optimized only one parame-
and the Π-model could not recover from that.
                                                                 ter at time, keeping others on default values which were:
                                                                 wmax = 30, σ = 0.1 and ϑ = 0.9. Among all these tuned
4      Experiments with a Real-World Malware                     hyperparameters, the most critical from the point of view
       Dataset                                                   of the predictive accuracy were the maximal weight, and
                                                                 the standard deviation of the Π-model noise. The rest of
4.1     Data                                                     the hyperparameters we used as stated in the original pa-
                                                                 pers or we modified them slightly according to our obser-
We tested our implementation using a large real-world            vations because the domain of our dataset is entirely differ-
malware detection dataset containing anonymized data             ent. The final values of the chosen hyperparameters used
provided by the company Avast. The data concern Win-             in experiments follow in Table 2. For the fully supervised
dows Portable Executables (PE) files, which were col-            training, we enabled the dropout and the data augmenta-
lected during 380 weeks. It consists of 540 real-valued          tion in the same manner as with the Π-model. In every
features derived directly from the binary PE files. Unfor-       experiment, we used the same MLP architecture with five
tunately, the company did not reveal the semantics of the        layers and the topology 128-96-64-32-5.
individual features. Each file is labeled with one of the five      Then we measured the performance of the Pseudo-
classes: malware, adware, infected, potentially unwanted         labeling, Π-model, and the purely supervised baseline
                       (a) Supervised                                                 (b) Supervised




                     (c) Pseudo-labeling                                            (d) Pseudo-labeling




                        (e) Π-model                                                    (f) Π-model

Figure 1: A comparison of the decision border of three algorithms on simple moon-shaped data. The decision border
is visualized as a transition from blue to red. The saturation expresses the classification confidence of the network. The
labeled data are shown as cyan or orange circles, while unlabeled are drawn in gray. On the left side, we randomly labeled
only 16 samples out of 2000 from each class. On the right side, we labeled 1000 samples close to the center out of 5000
from each class.
      Table 2: Final setting of model hyperparameters.         for the network. The results of the semi-supervised algo-
                                                               rithms Pseudo-labeling and Π-model are more interesting.
                                                               Both algorithms bring a slight increase in the accuracy of
                           Common                              low ratios of the labels. The most noticeable improvement
       Number of training epochs             100               is when there are only around 1 or 2 % of labels. When
       Training batch size                   100               the ratio gets above 10 %, the accuracy gain is negligi-
       Weight ramp-up period ru              70                ble, and for the higher values, the semi-supervised effect
       Optimizer ramp-down period rd         20                is even negative. Also, it seems that Π-model outperforms
                                                               Pseudo-labeling, as its accuracy is higher in most of the
       Initial learning rate                 0.001
                                                               measurements.
                       Pseudo-Labeling                            To verify our observations, we tested whether the dis-
       Pseudo-labeling threshold ϑ           0.9               tributions of predictive accuracy achieved by the three
       Maximal weight wmax                   10                considered methods significantly differ from each other.
                    Consistency preserving                     Those distributions are for the considered ratios of labeled
                                                               to all data shown in Figure 2, but – due to lack of space
       Consistency preserving variant        Π-model
                                                               – only for the networks trained on data from the first five
       Use dropout                           Yes               weeks. Firstly, we applied the Friedman test [6] to reject
       Use data augmentation                 Yes               the hypothesis that all three methods can be considered
       Maximal weight wmax                   20                equal. Then we performed a post hoc pairwise test to find
       Standard deviation σ of the noise     0.2               out among which of them there were differences at the 5
                                                               % level of family-wise significance with Holm [9] correc-
                                                               tion. We took the data from all of the following weeks and
for different proportions of labeled data. We varied           evaluated the accuracy for all considered ratios of labeled
the ratio r = |L | : (|L | + |U |) in the set of values        and all data, training for each of them 20 models. A signif-
{0.5%, 1%, 2%, 5%, 10%, 25%, 50%, 75%}. As the train-          icant difference between the compared methods was found
ing union of labeled and unlabeled data, we took 10,000        for 80 among the 96 compared pairs corresponding to the
stratified samples from 5 consequent weeks and split them      32 combinations of training weeks and ratios. We summa-
in the considered ratios. Then we trained 20 separate in-      rized the results in Table 4, where we compared the aver-
stances of the network and calculated the average accu-        age accuracy for the three implemented methods. When
racy on a stratified test set of size 5000 for them. We re-    we consider only tests with ratio up to 5 %, where the im-
peated this experiment for four arbitrarily chosen distinct    provement was visible, then the Pseudo-labeling was sig-
groups of weeks: 1-5, 51-55, 101-105, and 151-155. We          nificantly better than supervised learning in 3 cases and the
also evaluated the performance of trained networks on the      Π-model in 11 cases. Pseudo-labeling was significantly
data from all of the following weeks. This is particularly     better than Π-model only in 3 out of 14 significant com-
interesting from the point of view of the considered appli-    parisons.
cation domain. Because the structure of malware changes           We also visualized the progress of the classification ac-
over time, the prediction accuracy of the newer data tends     curacy over time for networks trained during three arbi-
to get worse. That means that if semi-supervised learn-        trarily chosen sequences of 5 contiguous weeks in Fig-
ing could overcome this problem, it could be beneficial.       ure 3. To capture the variance of the results, we plotted
Therefore, we tried to take the data from newer periods        three quartiles. Because the accuracy oscillated greatly
than the labeled weeks as the unlabeled training set. So       through the individual weeks, we used a moving average
we trained the network using labeled data together with        with a window size of five weeks to smooth the curves
unlabeled data from several weeks later. Unfortunately,        (the accuracy during the first five weeks, for which a win-
we did not manage to outperform the standard fully su-         dow of that size has not been available, is dashed). We
pervised learning this way using any of the implemented        can see that both semi-supervised algorithms slightly im-
methods, so we refrained from it. We present the results       proved the accuracy of the network on the roughly first
of these experiments in the following section.                 30 weeks. The Pseudo-labeling is around 1 or 2 % bet-
                                                               ter than supervised learning, while Π-model gets another
4.3   Results and Their Discussion                             1 or 2 % above the Pseudo-labeling. However, all three
                                                               trained networks share the trend of decreasing predictive
Using the hyperparameters setting presented in the previ-      accuracy during the early weeks when the moving aver-
ous section, we measured the average test accuracy of 20       age has been applied, though the number of such weeks
training runs of our three implementations in relation to      is network-specific. After around 40 weeks, the results of
the proportion of the labeled data in the training data set.   all three methods are very similar. As the properties of the
The results can be found in Table 3. We can see that the       data shift over time, the overall results on the data beyond
performance of the fully supervised learning depends on        50 weeks got considerably worse and fluctuated more for
the number of labeled data as it is the only learning source   all methods.
Table 3: Comparison of the Π-model, Pseudo-labeling and the supervised baseline, for different ratios of labeled and all
data. The table depicts the percentage of the average testing accuracy in four different periods. The S columns contain
the results of the supervised baseline, the ∆ Ps and ∆ Π columns show the difference using pseudo-labeling and Π-model,
             Ratio 0.5%                   Ratio 1%                  Ratio 2%                    Ratio 5%
respectively.
               0.8                    0.8                      0.8
                                                                                               0.8
               0.7                Weeks:
                                      0.71–5          Weeks: 51–55
                                                                0.7   Weeks: 101–105    Weeks: 151–155
                     Ratio
                               S    ∆ Ps ∆ Π         S   ∆ Ps ∆ Π     S    ∆ Ps ∆ Π     S   ∆ Ps ∆ Π
    Accuracy




               0.6                    0.6                       0.6
                      0.5 %   67.9 +0.4 +3.1        63.9 +2.8 +3.5 56.8 +5.2 +6.8 0.667.6 +0.3 +1.9
               0.5                    0.5                       0.5
                       1%     71.0 +1.7    +4.5     67.1 +5.0 +6.4   61.8 +6.9 +9.0 70.3 +5.7 +6.4
               0.4     2%     76.8 +1.3
                                      0.4 +2.5      73.9 +3.4 +5.7
                                                                0.4  69.7 +5.9 +6.2 0.476.6 +1.9 +2.0
                       5%     82.4 -0.1
                                      0.3
                                           +1.1     82.2 +0.4 +2.4 77.8 +3.7 +3.3 80.4 +0.2 +0.8
               0.3                                              0.3
                      10 %    85.1 +0.0 +1.1        86.1 -0.4 +0.8 83.2 +1.0 +1.1 81.7 +0.3 +0.6
               0.2                    0.2
                      25 %    88.3 -0.4    +0.3                 0.2
                                                    89.2 -0.5 +0.1   87.4 +0.7 +0.3 0.283.1 +0.3 +0.3
                     S50 %P   89.9 -0.4 S   -0.1   P90.6  -0.7 +0.0 S89.8 P -0.3 -0.2 84.2S -0.1P -0.2
                      75 %    90.4 -0.1 -0.1        91.2 -0.3 -0.3 90.7 -0.1 -0.4 84.4 +0.3 -0.2
                     100 %    90.9                  91.4             91.3              84.8

                     Ratio 10%                  Ratio 25%                  Ratio 50%                     Ratio 75%


               0.8                    0.8                      0.8                             0.8
    Accuracy




               0.6                    0.6                      0.6                             0.6


               0.4                    0.4                      0.4                             0.4


               0.2
                                      0.2                      0.2                             0.2
                     S   P                  S      P                   S      P                      S       P

Figure 2: Boxplots summarizing the distributions of predictive accuracy achieved by supervised learning (S), pseudo-
labeling (P) and the Π-model (Π) for the considered ratios of labeled to all data and the networks trained on data from the
first five weeks


5        Conclusion                                                  We have also found that these semi-supervised methods
                                                                     can increase the accuracy on data newer than the training
In this paper, we presented an application of semi-super-            set, for which drift in structure is likely to occur, but only
vised learning of deep neural networks to malware data.              to a certain extent. Based on our experiments, the slightly
At the beginning, we recalled the current state of detecting         more complex algorithm Π-model has got slightly better
malware with artificial neural networks and introduced the           results than Pseudo-labeling in most cases.
principles of neural semi-supervised learning. Then we
outlined four semi-supervised approaches to deep learn-
ing. We covered two semi-supervised algorithms, Pseudo-              Acknowledgement
labeling and Π-model in more detail and compared them
with the fully supervised baseline. We evaluated the classi-         The research reported in this paper has been supported by
fication accuracy on a real-world malware dataset divided            the Czech Science Foundation (GAČR) grant 18-18080S.
to 380 weeks by the time of the first recording of the re-           For the employed data and the work of M. Krčál, his
spective binary file. Despite having been developed for              support through Avast fellowship is appreciated. Access
the classification of image data, the results showed that            to computing and storage facilities owned by parties and
both methods could improve the performance of a neural               projects contributing to the National Grid Infrastructure
network on malware data. However, implemented algo-                  MetaCentrum provided under the program "Projects of
rithms have the limitation of being beneficial only when             Large Research, Development, and Innovations Infrastruc-
the proportion of labeled data is low, ideally around 1 %.           tures" (CESNET LM2015042), is greatly appreciated.
                                                                                                         Supervised
                     0.76
                                                                                                         Pseudo-labeling
                     0.74                                                                                 -model
Accuracy quartiles




                     0.72
                     0.70
                     0.68
                     0.66
                     0.64
                     0.62
                             Training
                     0.60
                            0           10       20                 30                 40                 50
                                                          Week number

                                                                                                         Supervised
                     0.75                                                                                Pseudo-labeling
                                                                                                          -model
                     0.70
Accuracy quartiles




                     0.65

                     0.60

                     0.55

                     0.50
                             Training
                     0.45
                            50          60       70                 80                 90                 100
                                                          Week number

                     0.70                                                                                Supervised
                                                                                                         Pseudo-labeling
                     0.65                                                                                 -model
                     0.60
Accuracy quartiles




                     0.55
                     0.50
                     0.45
                     0.40
                     0.35
                             Training
                     0.30
                            100         110     120                130                140                 150
                                                          Week number

Figure 3: The progression of the classification accuracy on later weeks using Pseudo-labeling, Π-model, and fully su-
pervised learning, trained using set with 1 % of labels. For each plot, there are three quartiles visualized; the median is
drawn with a solid line, while the first and the third quartiles are dotted. The curves correspond to the moving average
with the window size of five weeks. The first five dashed weeks are means of all previous weeks. The first five weeks at
the beginning of each plot were used for the training.
Table 4: Multiple comparisons test of three methods for            [9] S. Holm. A simple sequentially rejective multiple test pro-
different ratios of labeled to all data, tested on the data            cedure. Scandinavian Journal of Statistics, pages 65–70,
from all of the following weeks till the end. Each cell                1979.
contains a triplet of symbols representing the results of         [10] J. Kim, J. Kim, H. Le T. Thu, and H. Kim. Long short
three post hoc pairwise tests. The order of the comparisons            term memory recurrent neural network classifier for intru-
is: supervised to Pseudo-labeling, supervised to Π-model,              sion detection. In PlatCon: IEEE International Conference
and Pseudo-labeling to Π-model. The dash means that the                on Platform Technology and Service, pages 1–5, 2016.
difference was not statistically significant and the letters      [11] D.P. Kingma and J. Ba. Adam: A method for stochastic
S, P, and Π mark whether supervised, Pseudo-labeling, or               optimization. Preprint arXiv:1412.6980, 2014.
Π-model were significantly better than the other compared         [12] S. Laine and T. Aila. Temporal ensembling for semi-
algorithm.                                                             supervised learning. In ICLR, pages 1–13, 2017.
                                                                  [13] D.H. Lee. Pseudo-label: The simple and efficient semi-
                                                                       supervised learning method for deep neural networks. In
                            Training weeks                             WREPL: ICML Workshop Challenges in Representation
  Ratio                                                                Learning, pages 1–6, 2013.
               1–5        51–55      101–105      151–155
                                                                  [14] O. Linda, T. Vollmer, and M. Manic. Neural network based
  0.5 %     P, –, Π      P, S, Π P, –, Π          P, –, Π              intrusion detection system for critical infrastructures. In
  1%        P, –, –      S , Π, Π P , – , Π       P , Π, Π             International Joint Conference on Neural Networks, pages
  2%        S, S, Π      S , S , Π P , Π, P       S, S, –              1827–1834, 2009.
  5%        –, S, Π      –, S, P     P , Π, P     P, S, Π         [15] T.F. Lunt. IDES: An intelligent system for detecting in-
                                                                       truders. In Symposium on Computer Security, Threat and
  10 %      –, –, Π      P, S, Π S, S, P          S, S, Π
                                                                       Countermeasures, pages 30–45, 1990.
  25 %      P , Π, Π     S, S, Π S, P, P          S, S, Π         [16] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen. A hybrid
  50 %      – , Π, Π     – , S , Π S , Π, P       S, S, Π              spectral clustering and deep neural network ensemble algo-
  75 %      S , Π, –     S , Π, –    S, –, P      S, S, Π              rithm for intrusion detection in sensor networks. Sensors,
                                                                       16:article no. 1701, 2016.
                                                                  [17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.
References                                                             Distributed representations of words and phrases and their
                                                                       compositionality. In NIPS, pages 1–9, 2013.
 [1] B. Abolhasanzadeh. Nonlinear dimensionality reduction        [18] T. Miyato, S.I. Maeda, M. Koyarna, and S. Ishii. Virtual
     for intrusion detection using auto-encoder bottleneck fea-        adversarial training: A regularization method for super-
     tures. In IKT: IEEE 7th Conference on Information and             vised and semi-supervised learning. IEEE Transactions on
     Knowledge Technology, pages 1–5, 2015.                            Pattern Analysis and Machine Intelligence, 41:1979–1993,
                                                                       2019.
 [2] J.M. Bonifacio, A.M. Cansian, A.C.P.L.F. De Carvalho,
     and E.S. Moreira. Neural networks applied in intrusion de-   [19] S. Mukkamala, G. Janoski, and A. Sung. Intrusion detec-
     tection systems. In IEEE International Joint Conference on        tion using neural networks and support vector machines. In
     Neural Networks, pages 205–210, 1998.                             International Joint Conference on Neural Networks, pages
                                                                       1702–1707, 2002.
 [3] J. Cannady. Artificial neural networks for misuse detec-
     tion. In National Information Systems Security Conference,   [20] M. Nadeem, O. Marshall, S. Singh, X. Fang, and X. Yuan.
     pages 368–381, 1998.                                              Semi-supervised deep neural network for network intrusion
                                                                       detection. In Conference on Cybersecurity Education, Re-
 [4] H. Debar, M. Becker, and D. Siboni. A neural network
                                                                       search and Practice, pages 0–11, 2016.
     component for an intrusion detection system. In IEEE
     Computer Society Symposium on Research in Security and       [21] A. Narayanan, C. Soh, L. Chen, Y. Liu, and L. Wang.
     Privacy, pages 240–250, 1992.                                     Apk2vec: Semi-supervised multi-view representation lear-
                                                                       ning for profiling Android applications. In IEEE Interna-
 [5] O. Depren, M. Topallar, E. Amarim, and M.K. Ciliz. An in-
                                                                       tional Conference on Data Mining, pages 357–366, 2018.
     telligent intrusion detection system (IDS) for anomaly and
     misuse detection in computer networks. Expert Systems        [22] A. Oliver, A. Odena, C. Raffel, E.D. Cubuk, and I.J. Good-
     with Applications, 29:713–722, 2005.                              fellow. Realistic evaluation of deep semi-supervised learn-
                                                                       ing algorithms. In NIPS, pages 1–19, 2018.
 [6] M. Friedman. The use of ranks to avoid the assumption
     of normality implicit in the analysis of variance. Journal   [23] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and
     of the American Statistical Association, 32(200):675–701,         T. Raiko. Semi-supervised learning with ladder networks.
     1937.                                                             In NIPS, pages 1–9, 2015.
 [7] N. Gao, L. Gao, Q. Gao, and H. Wang. An intrusion de-        [24] B.C. Rhodes, J.A. Mahaffey, and J.D. Cannady. Multiple
     tection model based on deep belief networks. In IEEE Sec-         self-organizing maps for intrusion detection. In 23rd Na-
     ond International Conference on Advanced Cloud and Big            tional Information Systems Security Conference, pages 16–
     Data, pages 247–252, 2014.                                        19, 2000.
 [8] I.J. Goodfellow, J. Shlens, and C. Szegedy. Explaining       [25] J. Ryan, M.J. Lin, and R. Miikkulainen. Intrusion detection
     and harnessing adversarial examples. In ICLR, pages 1–11,         with neural networks. In Advances in Neural Information
     2015.                                                             Processing Systems 10, pages 943–949, 1998.
[26] J. Saxe and K. Berlin. Expose: A character-level convo-
     lutional neural network with embeddings for detecting ma-
     licious URLs, file paths and registry keys. Arxiv preprint
     arXiv:1702.08568, 2017.
[27] C.Z.Y. Soh. Program Analysis and Machine Learning
     Techniques for Mobile Security. PhD thesis, Nanyang Tech-
     nological University, Singapore, 2019.
[28] A. Tarvainen and H. Valpols. Mean teachers are better
     role models: Weight-averaged consistency targets improve
     semi-supervised deep learning results. In NIPS, pages 1–
     16, 2017.
[29] M. Tavallaee, E. Bagheri, W. Lu, and A.A. Ghorbani. A
     detailed analysis of the KDD cup 99 data set. In IEEE
     Symposium on Computational Intelligence for Security and
     Defense Applications, pages 288–293, 2009.
[30] P. Torres, C. Catania, S. Garcia, and C.G Garino. An analy-
     sis of recurrent neural networks for botnet detection behav-
     ior. In ARGENCON: IEEE biennial congress of Argentina,
     pages 1–6, 2016.
[31] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang,
     and M. Zhu. HAST-IDS: Learning hierarchical spatial-
     temporal features using deep neural networks to improve
     intrusion detection. IEEE Access, 6:1792–1806, 2017.
[32] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng. Mal-
     ware traffic classification using convolutional neural net-
     work for representation learning. In ICOIN: IEEE Interna-
     tional Conference on Information Networking, pages 712–
     717, 2017.
[33] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, and
     J. Ucles. HIDE: A hierarchical network intrusion detection
     system using statistical preprocessing and neural network
     classification. In IEEE Workshop on Information Assur-
     ance and Security, pages 85–90, 2001.