<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two Semi-supervised Approaches to Malware Detection with Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Koza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marek Krcˇál</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University</institution>
          ,
          <addr-line>Thákurova 9, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rossum Czech Republic</institution>
          ,
          <addr-line>Dobratická 523, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semi-supervised learning is characterized by using the additional information from the unlabeled data. In this paper, we compare two semi-supervised algorithms for deep neural networks on a large real-world malware dataset. Specifically, we evaluate the performance of a rather straightforward method called Pseudo-labeling, which uses unlabeled samples, classified with high confidence, as if they were the actual labels. The second approach is based on an idea to increase the consistency of the network's prediction under altered circumstances. We implemented such an algorithm called P-model, which compares outputs with different data augmentation and different dropout setting. As a baseline, we also provide results of the same deep network, trained in the fully supervised mode using only the labeled data. We analyze the prediction accuracy of the algorithms in relation to the size of the labeled part of the training dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>One of the application domains that pay the most attention
to the progress of and new developments in machine
learning is malware detection. Vendors of antivirus software
cannot keep up with the increasing number of malicious
programs and their increasingly sophisticated obfuscation
and polymorphism without using more and more advanced
machine learning methods, most importantly, methods for
anomaly detection, classification and pattern recognition.</p>
      <p>
        The most successful machine learning methods for
classification and pattern recognition definitely include
artificial neural networks (ANN), especially deep networks.
However, they have a high number of degrees of
freedom, thus requiring a large amount of labeled training
data, whereas most of the data for malware detection is
unlabeled because its labeling requires expensive
involvement of human experts. One possible way how to tackle
the lack of training data is semi-supervised learning. In a
narrow sense, this means supervised learning that
simultaneously to labels also uses some information from
additional unlabeled data, in a broad sense any combination of
supervised learning and unlabeled data, e.g., unsupervised
learning followed by supervised learning. In the context of
malware detection, however, semi-supervised ANN
learning is only emerging [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]. The work in progress
reported in this paper is a small contribution to it. It restricts
attention only to two methods of semi-supervised ANN
learning, approaches not relying on neural networks are
outside its scope.
      </p>
      <p>The next section briefly reviews using ANN in malware
detection and the overlapping area of network intrusion
detection. In Section 3, several important methods for
semi-supervised ANN learning are recalled, two of which
have been implemented for our research. The core
Section 4 describes several experiments with a real-world
malware dataset, and reports their results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Neural Networks in Malware and</title>
    </sec>
    <sec id="sec-3">
      <title>Network Intrusion Detection</title>
      <p>
        As malware detection is strongly interconnected with and
closely related to network intrusion detection, using ANN
will be reviewed here in both areas. Probably the first
proposal to use neural networks in them was in 1990 by
Lunt [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and was implemented two years later [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in a
network trained on inputs from audit log files.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] employed user commands as
input, but rather than trying to learn benign and malicious
command sequences, they were detecting anomalies in
frequency histograms of user commands calculated for each
user.
      </p>
      <p>
        The paper by Cannady [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] summarised ANN
advantages and disadvantages for misuse detection. As the two
main advantages, the flexibility with respect to incomplete,
distorted and noisy data, and the generalization ability
are viewed, whereas as the main disadvantage, the ANN
black-box nature.
      </p>
      <p>
        In the late 1990s and early 2000s, self-organizing maps
were quite popular in this context [
        <xref ref-type="bibr" rid="ref2 ref24 ref5">2, 5, 24</xref>
        ]. In particular,
Depren et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used a hierarchical model where misuse
detection based on self-organizing maps (SOMs) was
coupled with random forest-based rule system to provide not
only high precision, but also some sort of explanation.
      </p>
      <p>
        Much research has been devoted to comparing
different kinds of ANN, or more generally, different classifiers
including one or more kinds of ANN, on real-world
malware detection or intrusion detection data. Probably the
most popular among such data is an extensive intrusion
detection dataset that was used at the 1999 KDD Cup [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
Zhang et al. [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] compared five different kinds of ANN.
Mukkamala et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] compared a multilayer perceptron
(MLP) with support vector machines.
      </p>
      <p>
        Among more recent ANN applications to malware and
network intrusion detection, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] should be mentioned for
using synthetically generated attack samples to train an
MLP, as well as [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] for a malware detection with
recurrent networks. Expectedly, the kinds of ANN applied to
these two areas during the last decade are most often deep
networks [
        <xref ref-type="bibr" rid="ref1 ref10 ref7">1, 7, 10</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], deep learning was used
together with spectral clustering to improve the detection
rate of low frequency network attacks. ability to process
raw inputs and learn their own features. Saxe et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
employed a convolutional neural network (CNN) to
extract features that were subsequently used as the input for
an MLP detecting malicious activities. CNNs seem to be
particularly suitable to learn spatial features of network
traffic [
        <xref ref-type="bibr" rid="ref31 ref32">31, 32</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], a CNN was in addition combined
with a long short term memory learning temporal features
from multiple network packets.
      </p>
      <p>
        To our best knowledge, there were so far only two
particular ANN applications to malware or network
intrusion detection that included semi-supervised learning
in the narrow sense. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], various settings of
semisupervised ladder networks (see Section 3) were compared
on the above mentioned intrusion detection dataset [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] (cf. also the thesis [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]), skipgram networks [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
extended with semi-supervised learning based on
Pseudolabels (see Section 3) were used for Android malware
detection. Skipgrams are neural networks embedding large
sets of structured non-numeric data into low-dimensional
vector spaces. Whereas in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], skipgrams were
proposed for the embedding of text (word2vec), the input set
in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is the set of rooted subgraphs around every node
of three dependency graphs representing the API
dependencies, permission dependencies, and information source
and sink dependencies of the considered Android
application. However, skipgrams were not used directly for
malware detection in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], only for representation
learning of the structured input, whereas the malware
detection itself was performed by a support vector machine. So
far, no semi-supervised neural networks have been used
directly for malware detection, and also none have been
used with unstructured inputs simply listing values of the
evaluated features, which are encountered much more
frequently than dependency matrices.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Semi-supervised Learning of Neural</title>
    </sec>
    <sec id="sec-5">
      <title>Networks</title>
      <p>
        According to the overview paper [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], the following
approaches are most important for semi-supervised learning
of neural networks, especially deep networks:
(i) Pseudo-labels [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which are ANN predictions of
the correct class for unlabeled data, provided the
network has a sufficient confidence in such a prediction.
Formally, a prediction serves as a pseudo-label for an
unlabeled input x if
arg max fc(x)
c2C
      </p>
      <p>
        J å fc;
c2C
(1)
where C denotes the set of classes, fc(x) the activity
of the output neuron corresponding to the class c 2 C
for the input x, and J 2 (0; 1) is a given threshold.
(ii) Increasing the consistency of predictions for the same
input between two instances of a neural network
differing through a random perturbation. Such a
perturbation is typically introduced through random
noise or through dropout. The overall loss function
minimized during semi-supervised learning is then
the superposition of the loss of supervised learning
and a loss reflecting the inconsistency of the
considered ANN instances. This approach was first
applied in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] to ladder networks, which are basically
chained denoising autoencoders. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], two similar
kinds of neural networks using this approach to
semisupervised ANN learning were proposed that can be
viewed as simplifications of ladder networks. The
first kind, called P-model, evaluates both randomly
differing ANN instances on each minibatch of data.
The second kind, called temporal ensembling,
evaluates only one of them and then uses its predictions
in the inconsistency loss. As a compensation,
predictions from multiple previous network evaluations are
aggregated into an ensemble prediction.
(iii) Due to targets changing only once per epoch,
temporal ensembling becomes unwieldy when learning
large datasets. To overcome this problem, an
approach called mean teacher has been proposed in
[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Instead of aggregating predictions, it aggregates
weights, more precisely, averages them.
(iv) In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the most sophisticated among the four
considered approaches has been proposed, called virtual
adversarial training, due to using a loss function
proposed by Goodfellow et al. to train networks against
adversarial inputs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and known as adversarial loss:
Ladv(x; q ) = D[q( jx); p( jx + radv; q )]
(2)
where radv = arg max D[q( jx); p( jx + r; q )]; (3)
krk e
In (2)–(3), q( jx) represents our knowledge of the true
conditional distribution of labels given a particular
input x, whereas p( jx; q ) represents the corresponding
distribution implied by the neural network for
particular values of their parameters q , e &gt; 0 and D is some
non-negative function on pairs of probability
distribution, such as cross entropy, which was used in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
And the term “virtual” refers to the fact that in
supervised learning, this loss needs to be minimized on
unlabeled inputs instead on adversarial ones
      </p>
      <p>
        So far, we have managed to implement the first two of
those approaches, the second in both variants P-model and
temporal ensembling. Some details of our implementation
are given below.
Most parts of the two algorithms we used share the same
implementation. Fundamentally, they only differ in the
way they compute the unsupervised component of the loss
function. Firstly, both methods use the same MLP
architecture with ReLU as the activation function in the
hidden layers and utilize the same optimizing algorithm
Adam [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with the initial learning rate set to 0:001,
b1 = 0:99, and b2 = 0:999. As was shown above, the
optimized loss function is defined as a weighted sum of
supervised and unsupervised loss L = LS + w(t)LU . The
weight w(t) depends on the ratio between the number of
labeled and all data, and the current epoch. Following a
proposal in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], we ramp up the value of the weight using
a Gaussian curve:w(t) = w max jLjjL+jjU j exp 5(1 t)2 ,
where t = max( reu ; 1), e is number of the current epoch,ru
is the length of the rump up period and wmax is a
parameter specifying the maximum weight. Increasing the
weight of the unsupervised loss during the training is
necessary as the network needs to learn to classify the
supervised data first. Eventually, it can learn to incorporate
the unlabeled information as well. Similarly, at the later
phase of the training, the learning rate and the b1
parameter of the Adam optimizer are decreased to improve the
exploitation:lre = wd lre 1 and b1 = 0:4wd + 0:5, where
wd = exp 12:5t2 ,t = max( red ; 1) and rd is the length of
the ramp down period. We also included a type of elitism
to select the resulting model with the lowest total loss per
epoch calculated with the maximal weight for the
unsupervised component instead of a weight in the current epoch.
      </p>
      <p>The unsupervised loss in the Pseudo-labeling algorithm
is calculated using cross-entropy between network’s
predictions and pseudo-labels, but only for predictions with
confidence above a specified threshold J (cf. (1)). We
compute the vector of pseudo-labels y0 for every data
sample x using the corresponding network output f (x) in the
following manner:
y0i =
Then the resulting formula based on cross entropy for the
unsupervised loss component LU of a particular data
sample x is:
(4)
(5)
LU (x) =
jCj
å y0i log(yi);
i=1
where jCj is the number of classes.</p>
      <p>We also implemented two variants of the consistency
preserving, self-ensembling algorithms: The P-model and
the temporal ensembling. Both approaches use mean
squared error (MSE) to compute unsupervised loss. What
is different is the target for which MSE is evaluated. The
P-model compares two predictions of the same state of
the network using different inputs and different dropped
out neurons. To augment the data for the second
prediction, we multiplied the input feature vector with a noise
sampled from normal distribution N (1; s 2). We chose to
multiply the data with the noise instead of adding it
because it is invariant to the differing variances of the
individual features.</p>
      <p>The second variant, temporal ensembling, compares the
prediction of the network in the current epoch with the
predictions obtained in the previous epoch. The dropout and
data augmentation can be used as well. So the
unsupervised loss LU for this approach is calculated as follows:
jCj
LU (x) = å(yi y˜i)2;
i=1
(6)
where y is the current output of the network in the training
step and y˜ is the output of the network in a different state
or for augmented input.</p>
      <p>Our open-source implementation is publicly available at
https://github.com/c0zzy/semi-supervised-ann.
3.2</p>
      <sec id="sec-5-1">
        <title>Validation using a simple artificial experiment</title>
        <p>Firstly, we tried our implementations of two
semi-supervised methods mentioned above and a fully supervised
baseline on a two-dimensional example. We chose
simple generated moon-shaped data, which are often used for
testing of classification or clustering algorithms. The data
consist of two classes, that are linearly inseparable but do
not overlap so that the classification can be performed with
no error. The advantage is that we can easily visualize the
classification decision border in two dimensions and
examine the behavior of the algorithm. For every method in
this experiment, we used the same MLP architecture with
two hidden layers, the first having 64 neurons and the
second 32 neurons.</p>
        <p>In Figure 1, we present two different arrangements of
labeled and unlabeled data, each solved by the fully
supervised learning, Pseudo-labeling, and P-model. In the
first experiment, we tested the ability of the algorithm to
learn from a small amount of data, there are two
moonshaped clusters, each having 1000 samples, where only 16
of each are labeled. We let each network to train for 300
epochs. Even though the supervised learning had
available samples distributed over the whole cluster, it was not
able to learn the correct shape using only 32 samples. The
Pseudo-labeling algorithm could not improve the results
using the unlabeled data. However, the results of the
Pmodel are notably better as it managed to capture the moon
shape quite well.</p>
        <p>In the second experiment, we tried if the algorithms can
deal with a drift in the training data. This time we used
clusters with 10,000 samples and labeled only 1000 points
that lie near the center, for each class. We trained the
networks for 100 epochs as having it run longer did not
improve the results of either of the methods. The supervised
algorithm could only use the labeled data that are linearly
separable. So it learned to classify the labeled data with
zero error, and we present it only as a baseline for
comparison. Pseudo-labeling again failed to use the information
contained in the unlabeled data, and its accuracy was
similar to the fully supervised learning. Also in this task, the
P-model was able to use the smoothness of the data and
performed the best of three methods. To quantify the
results, we summarized the prediction accuracy tested on the
whole clusters in Table 1.</p>
        <p>Completing these experiments, we observed that the
results of the Pseudo-labeling correspond to the idea behind
the algorithm. It makes the network’s decision more
confident as it uses the interim predictions as if they were the
true labels. Also, the decision border did not seem to
converge to a stable finale state throughout the learning. It
kept shifting closer to one or the other class, roughly in
the range where the confidence of the supervised learning
was low. We managed to get decent results using the
Pmodel, and it proved to be able to capture the smooth
distribution of data. However, the algorithm was susceptible
to inappropriate setting of hyperparameters. It often
happened that one class became dominant during the training,
and the P-model could not recover from that.
4</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experiments with a Real-World Malware</title>
    </sec>
    <sec id="sec-7">
      <title>Dataset</title>
      <p>4.1</p>
      <sec id="sec-7-1">
        <title>Data</title>
        <p>We tested our implementation using a large real-world
malware detection dataset containing anonymized data
provided by the company Avast. The data concern
Windows Portable Executables (PE) files, which were
collected during 380 weeks. It consists of 540 real-valued
features derived directly from the binary PE files.
Unfortunately, the company did not reveal the semantics of the
individual features. Each file is labeled with one of the five
classes: malware, adware, infected, potentially unwanted
program, and clean. There were some features with zero
or very low variance in the dataset. Therefore we used
principal component analysis (PCA) to reduce the
dimensionality of the feature space and speed up the training.
First, we min-max normalized the data between 0 and 1,
and then we projected them to the subspace spanned by
the 128 main components while keeping more than 99 %
of the explained variance.
4.2</p>
      </sec>
      <sec id="sec-7-2">
        <title>Experimental Design</title>
        <p>At first, we analyzed the hyperparameters of each
algorithm and optimized those that we expected to have the
greatest impact on the results during early tests of our
implementation. We chose the data from five weeks
between 50th and 55th week. We performed stratified
random sampling and selected 10,000 training and 5000
testing records. We kept only 5 % of the labeled from the
training set, and the rest remained unlabeled. Using this
data, we evaluated the classification accuracy for various
sets of hyperparameters.</p>
        <p>For the Pseudo-labeling algorithm, we optimized the
threshold J and the maximal weight wmax for the
unsupervised loss component. For the consistency preserving
algorithms, we optimized the standard deviation s of the
noise used in data augmentation and again the parameter
wmax. Furthermore, we repeated the search of parameters
for all six combinations of variants of the algorithm, which
were: P-model or temporal ensembling and whether to
use dropout, augmentation or both. We took the
parameters from the following sets:</p>
        <p>wmax 2 f0:1; 1; 2; 5; 10; 15; 20; 30; 50g;
s 2 f0:01; 0:05; 0:1; 0:15; 0:2; 0:3; 0:5g;</p>
        <p>
          J 2 f0:5; 0:7; 0:8; 0:9; 0:95; 0:98; 0:99g:
However, because of the high time requirements, we
restricted attention among the two similar models proposed
in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] only to the P-model. For the same reason, we did
not perform the full factorial search through all possible
combinations. Instead we optimized only one
parameter at time, keeping others on default values which were:
wmax = 30, s = 0:1 and J = 0:9. Among all these tuned
hyperparameters, the most critical from the point of view
of the predictive accuracy were the maximal weight, and
the standard deviation of the P-model noise. The rest of
the hyperparameters we used as stated in the original
papers or we modified them slightly according to our
observations because the domain of our dataset is entirely
different. The final values of the chosen hyperparameters used
in experiments follow in Table 2. For the fully supervised
training, we enabled the dropout and the data
augmentation in the same manner as with the P-model. In every
experiment, we used the same MLP architecture with five
layers and the topology 128-96-64-32-5.
        </p>
        <p>Then we measured the performance of the
Pseudolabeling, P-model, and the purely supervised baseline
(a) Supervised
(b) Supervised
(c) Pseudo-labeling
(d) Pseudo-labeling
(e) P-model
(f) P-model
for different proportions of labeled data. We varied
the ratio r = jL j : (jL j + jU j) in the set of values
f0:5%; 1%; 2%; 5%; 10%; 25%; 50%; 75%g. As the
training union of labeled and unlabeled data, we took 10,000
stratified samples from 5 consequent weeks and split them
in the considered ratios. Then we trained 20 separate
instances of the network and calculated the average
accuracy on a stratified test set of size 5000 for them. We
repeated this experiment for four arbitrarily chosen distinct
groups of weeks: 1-5, 51-55, 101-105, and 151-155. We
also evaluated the performance of trained networks on the
data from all of the following weeks. This is particularly
interesting from the point of view of the considered
application domain. Because the structure of malware changes
over time, the prediction accuracy of the newer data tends
to get worse. That means that if semi-supervised
learning could overcome this problem, it could be beneficial.
Therefore, we tried to take the data from newer periods
than the labeled weeks as the unlabeled training set. So
we trained the network using labeled data together with
unlabeled data from several weeks later. Unfortunately,
we did not manage to outperform the standard fully
supervised learning this way using any of the implemented
methods, so we refrained from it. We present the results
of these experiments in the following section.
4.3</p>
      </sec>
      <sec id="sec-7-3">
        <title>Results and Their Discussion</title>
        <p>Using the hyperparameters setting presented in the
previous section, we measured the average test accuracy of 20
training runs of our three implementations in relation to
the proportion of the labeled data in the training data set.
The results can be found in Table 3. We can see that the
performance of the fully supervised learning depends on
the number of labeled data as it is the only learning source
for the network. The results of the semi-supervised
algorithms Pseudo-labeling and P-model are more interesting.
Both algorithms bring a slight increase in the accuracy of
low ratios of the labels. The most noticeable improvement
is when there are only around 1 or 2 % of labels. When
the ratio gets above 10 %, the accuracy gain is
negligible, and for the higher values, the semi-supervised effect
is even negative. Also, it seems that P-model outperforms
Pseudo-labeling, as its accuracy is higher in most of the
measurements.</p>
        <p>
          To verify our observations, we tested whether the
distributions of predictive accuracy achieved by the three
considered methods significantly differ from each other.
Those distributions are for the considered ratios of labeled
to all data shown in Figure 2, but – due to lack of space
– only for the networks trained on data from the first five
weeks. Firstly, we applied the Friedman test [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to reject
the hypothesis that all three methods can be considered
equal. Then we performed a post hoc pairwise test to find
out among which of them there were differences at the 5
% level of family-wise significance with Holm [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
correction. We took the data from all of the following weeks and
evaluated the accuracy for all considered ratios of labeled
and all data, training for each of them 20 models. A
significant difference between the compared methods was found
for 80 among the 96 compared pairs corresponding to the
32 combinations of training weeks and ratios. We
summarized the results in Table 4, where we compared the
average accuracy for the three implemented methods. When
we consider only tests with ratio up to 5 %, where the
improvement was visible, then the Pseudo-labeling was
significantly better than supervised learning in 3 cases and the
P-model in 11 cases. Pseudo-labeling was significantly
better than P-model only in 3 out of 14 significant
comparisons.
        </p>
        <p>We also visualized the progress of the classification
accuracy over time for networks trained during three
arbitrarily chosen sequences of 5 contiguous weeks in
Figure 3. To capture the variance of the results, we plotted
three quartiles. Because the accuracy oscillated greatly
through the individual weeks, we used a moving average
with a window size of five weeks to smooth the curves
(the accuracy during the first five weeks, for which a
window of that size has not been available, is dashed). We
can see that both semi-supervised algorithms slightly
improved the accuracy of the network on the roughly first
30 weeks. The Pseudo-labeling is around 1 or 2 %
better than supervised learning, while P-model gets another
1 or 2 % above the Pseudo-labeling. However, all three
trained networks share the trend of decreasing predictive
accuracy during the early weeks when the moving
average has been applied, though the number of such weeks
is network-specific. After around 40 weeks, the results of
all three methods are very similar. As the properties of the
data shift over time, the overall results on the data beyond
50 weeks got considerably worse and fluctuated more for
all methods.
In this paper, we presented an application of
semi-supervised learning of deep neural networks to malware data.
At the beginning, we recalled the current state of detecting
malware with artificial neural networks and introduced the
principles of neural semi-supervised learning. Then we
outlined four semi-supervised approaches to deep
learning. We covered two semi-supervised algorithms,
Pseudolabeling and P-model in more detail and compared them
with the fully supervised baseline. We evaluated the
classification accuracy on a real-world malware dataset divided
to 380 weeks by the time of the first recording of the
respective binary file. Despite having been developed for
the classification of image data, the results showed that
both methods could improve the performance of a neural
network on malware data. However, implemented
algorithms have the limitation of being beneficial only when
the proportion of labeled data is low, ideally around 1 %.
We have also found that these semi-supervised methods
can increase the accuracy on data newer than the training
set, for which drift in structure is likely to occur, but only
to a certain extent. Based on our experiments, the slightly
more complex algorithm P-model has got slightly better
results than Pseudo-labeling in most cases.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Acknowledgement</title>
        <p>The research reported in this paper has been supported by
the Czech Science Foundation (GACˇ R) grant 18-18080S.
For the employed data and the work of M. Krcˇál, his
support through Avast fellowship is appreciated. Access
to computing and storage facilities owned by parties and
projects contributing to the National Grid Infrastructure
MetaCentrum provided under the program "Projects of
Large Research, Development, and Innovations
Infrastructures" (CESNET LM2015042), is greatly appreciated.</p>
        <p>Training weeks
51–55 101–105
P , S , P P , – , P
S , P, P P , – , P
S , S , P P , P, P
– , S , P P , P, P
P , S , P S , S , P
S , S , P S , P , P
– , S , P S , P, P
S , P, – S , – , P
151–155
P , – , P
P , P, P
S , S , –
P , S , P
S , S , P
S , S , P
S , S , P
S , S , P</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Abolhasanzadeh</surname>
          </string-name>
          .
          <article-title>Nonlinear dimensionality reduction for intrusion detection using auto-encoder bottleneck features</article-title>
          .
          <source>In IKT: IEEE 7th Conference on Information and Knowledge Technology</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.M.</given-names>
            <surname>Cansian</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.P.L.F. De Carvalho</surname>
            , and
            <given-names>E.S.</given-names>
          </string-name>
          <string-name>
            <surname>Moreira</surname>
          </string-name>
          .
          <article-title>Neural networks applied in intrusion detection systems</article-title>
          .
          <source>In IEEE International Joint Conference on Neural Networks</source>
          , pages
          <fpage>205</fpage>
          -
          <lpage>210</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cannady</surname>
          </string-name>
          .
          <article-title>Artificial neural networks for misuse detection</article-title>
          .
          <source>In National Information Systems Security Conference</source>
          , pages
          <fpage>368</fpage>
          -
          <lpage>381</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Debar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Becker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Siboni</surname>
          </string-name>
          .
          <article-title>A neural network component for an intrusion detection system</article-title>
          .
          <source>In IEEE Computer Society Symposium on Research in Security and Privacy</source>
          , pages
          <fpage>240</fpage>
          -
          <lpage>250</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Depren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Topallar</surname>
          </string-name>
          , E. Amarim, and
          <string-name>
            <given-names>M.K.</given-names>
            <surname>Ciliz</surname>
          </string-name>
          .
          <article-title>An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>29</volume>
          :
          <fpage>713</fpage>
          -
          <lpage>722</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>The use of ranks to avoid the assumption of normality implicit in the analysis of variance</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>32</volume>
          (
          <issue>200</issue>
          ):
          <fpage>675</fpage>
          -
          <lpage>701</lpage>
          ,
          <year>1937</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>An intrusion detection model based on deep belief networks</article-title>
          .
          <source>In IEEE Second International Conference on Advanced Cloud and Big Data</source>
          , pages
          <fpage>247</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <article-title>Explaining and harnessing adversarial examples</article-title>
          .
          <source>In ICLR</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Holm</surname>
          </string-name>
          .
          <article-title>A simple sequentially rejective multiple test procedure</article-title>
          .
          <source>Scandinavian Journal of Statistics</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , H. Le T. Thu, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Long short term memory recurrent neural network classifier for intrusion detection</article-title>
          .
          <source>In PlatCon: IEEE International Conference on Platform Technology and Service</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>Preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Laine</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Aila</surname>
          </string-name>
          .
          <article-title>Temporal ensembling for semisupervised learning</article-title>
          .
          <source>In ICLR</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.H.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Pseudo-label: The simple and efficient semisupervised learning method for deep neural networks</article-title>
          .
          <source>In WREPL: ICML Workshop Challenges in Representation Learning</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Linda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vollmer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Manic</surname>
          </string-name>
          .
          <article-title>Neural network based intrusion detection system for critical infrastructures</article-title>
          .
          <source>In International Joint Conference on Neural Networks</source>
          , pages
          <fpage>1827</fpage>
          -
          <lpage>1834</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.F.</given-names>
            <surname>Lunt</surname>
          </string-name>
          . IDES:
          <article-title>An intelligent system for detecting intruders</article-title>
          .
          <source>In Symposium on Computer Security, Threat and Countermeasures</source>
          , pages
          <fpage>30</fpage>
          -
          <lpage>45</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          , J. Cheng, Y. Yu, and
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>A hybrid spectral clustering and deep neural network ensemble algorithm for intrusion detection in sensor networks</article-title>
          .
          <source>Sensors</source>
          ,
          <volume>16</volume>
          :article no.
          <issue>1701</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miyato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.I.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyarna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ishii</surname>
          </string-name>
          .
          <article-title>Virtual adversarial training: A regularization method for supervised and semi-supervised learning</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>41</volume>
          :
          <fpage>1979</fpage>
          -
          <lpage>1993</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukkamala</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Janoski, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sung</surname>
          </string-name>
          .
          <article-title>Intrusion detection using neural networks and support vector machines</article-title>
          .
          <source>In International Joint Conference on Neural Networks</source>
          , pages
          <fpage>1702</fpage>
          -
          <lpage>1707</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Yuan</surname>
          </string-name>
          .
          <article-title>Semi-supervised deep neural network for network intrusion detection</article-title>
          .
          <source>In Conference on Cybersecurity Education, Research and Practice</source>
          , pages
          <fpage>0</fpage>
          -
          <lpage>11</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Soh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          . Apk2vec:
          <article-title>Semi-supervised multi-view representation learning for profiling Android applications</article-title>
          .
          <source>In IEEE International Conference on Data Mining</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>366</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Odena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Raffel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          .
          <article-title>Realistic evaluation of deep semi-supervised learning algorithms</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Valpola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Honkala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Berglund</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Raiko</surname>
          </string-name>
          .
          <article-title>Semi-supervised learning with ladder networks</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.C.</given-names>
            <surname>Rhodes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Mahaffey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.D.</given-names>
            <surname>Cannady</surname>
          </string-name>
          .
          <article-title>Multiple self-organizing maps for intrusion detection</article-title>
          .
          <source>In 23rd National Information Systems Security Conference</source>
          , pages
          <fpage>16</fpage>
          -
          <lpage>19</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Miikkulainen</surname>
          </string-name>
          .
          <article-title>Intrusion detection with neural networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>10</volume>
          , pages
          <fpage>943</fpage>
          -
          <lpage>949</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Saxe</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Berlin</surname>
          </string-name>
          . Expose:
          <article-title>A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys</article-title>
          .
          <source>Arxiv preprint arXiv:1702.08568</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.Z.Y.</given-names>
            <surname>Soh</surname>
          </string-name>
          .
          <article-title>Program Analysis and Machine Learning Techniques for Mobile Security</article-title>
          .
          <source>PhD thesis</source>
          , Nanyang Technological University, Singapore,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tarvainen</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Valpols</surname>
          </string-name>
          .
          <article-title>Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tavallaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bagheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Ghorbani</surname>
          </string-name>
          .
          <article-title>A detailed analysis of the KDD cup 99 data set</article-title>
          .
          <source>In IEEE Symposium on Computational Intelligence for Security and Defense Applications</source>
          , pages
          <fpage>288</fpage>
          -
          <lpage>293</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>P.</given-names>
            <surname>Torres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Catania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.G</given-names>
            <surname>Garino</surname>
          </string-name>
          .
          <article-title>An analysis of recurrent neural networks for botnet detection behavior</article-title>
          .
          <source>In ARGENCON: IEEE biennial congress of Argentina</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu.</surname>
          </string-name>
          HAST-IDS:
          <article-title>Learning hierarchical spatialtemporal features using deep neural networks to improve intrusion detection</article-title>
          .
          <source>IEEE Access</source>
          ,
          <volume>6</volume>
          :
          <fpage>1792</fpage>
          -
          <lpage>1806</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ye</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          .
          <article-title>Malware traffic classification using convolutional neural network for representation learning</article-title>
          .
          <source>In ICOIN: IEEE International Conference on Information Networking</source>
          , pages
          <fpage>712</fpage>
          -
          <lpage>717</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.N.</given-names>
            <surname>Manikopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jorgenson</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Ucles.</surname>
          </string-name>
          <article-title>HIDE: A hierarchical network intrusion detection system using statistical preprocessing and neural network classification</article-title>
          .
          <source>In IEEE Workshop on Information Assurance and Security</source>
          , pages
          <fpage>85</fpage>
          -
          <lpage>90</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>