<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-supervised and Active Learning in Video Scene Classification from Statistical Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tom´aˇs Sˇabata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Pulc</string-name>
          <email>pulc@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <email>martin@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University in Prague</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science of the Czech Academy of Sciences</institution>
          ,
          <addr-line>Prague, Czech republic</addr-line>
        </aff>
      </contrib-group>
      <fpage>24</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>In multimedia classification, the background is usually considered an unwanted part of input data and is often modeled only to be removed in later processing. Contrary to that, we believe that a background model (i.e., the scene in which the picture or video shot is taken) should be included as an essential feature for both indexing and followup content processing. Information about image background, however, is not usually the main target in the labeling process and the number of annotated samples is very limited. Therefore, we propose to use a combination of semi-supervised and active learning to improve the performance of our scene classifier, specifically a combination of self-training with uncertainty sampling. As a result, we utilize a combination of statistical features extractor, a feed-forward neural network and support vector machine classifier, which consistently achieves higher accuracy on less diverse data. With the proposed approach, we are currently able to achieve precision over 80% on a dataset trained on a single series of a popular TV show.</p>
      </abstract>
      <kwd-group>
        <kwd>video data</kwd>
        <kwd>scene classification</kwd>
        <kwd>semi-supervised learning</kwd>
        <kwd>active learning</kwd>
        <kwd>colour statistics</kwd>
        <kwd>feedforward neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Automatic multimedia content labeling is still a comparatively difficult domain
for machine learning. High input data dimensionality requires large training data
sets, especially for approaches that are designed without prior assumptions on
the data properties.</p>
      <p>Moreover, the increasing resolution of image sensors brings higher detail (and
thus, at least in theory, more information), but poses a significant issue for
training phases of virtually all machine learning algorithms.</p>
      <p>
        Many approaches, therefore, have to introduce a trade-off concerning the
number of involved parameters, the number of distinct output labels (classes)
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and the resolution of the input imagery [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Alternatively, they have to use
only the statistical properties of the input data (as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and many others).
We also need to tackle the limitation on the amount of labeled training data.
      </p>
      <p>
        Recent trends in video content processing include a task usually called Video
to Text. The primary objective of such processing is to take multimedia content
and describe its main features in a human-comprehensible text. Such
representation may contain gathered information on the scene, actors, objects and actions
in which they are involved. Such as the single image description “baseball player
is throwing ball in game,” as presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Current approaches, however, commonly omit the information concerning
the visual appearance of the background in complex multimedia content – even
though such information might provide substantial contextual information for
the object detection and event description itself. Approaches that use neural
networks are mostly data-driven and require large amounts of data to adapt to each
selected class. This requirement is, however, seldom met in smaller multimedia
collections, such as home video, university lecture recordings, movie studios or
corporate media databases.</p>
      <p>We also want to reflect that a particular scene can be recalled by a human
from a couple of static frames. Therefore, manual scene labeling is a relatively
easy task as opposed to event labeling that may need the full video sequence
or object labeling that commonly requires drawing a bounding box around the
annotated object.</p>
      <p>To use the limited human involvement in scene labeling as efficiently as
possible, we employ semi-supervised learning to allow making use of unlabeled data,
which are substantially easier to obtain, whereas simultaneously selecting the
data for annotation using active learning methods.</p>
      <p>The rest of this paper is organized as follows: In Section 2, we briefly
summarize the state of the art in scene classification in the context of single images
without significant obstruction by foreground objects, as well as the state of the
art in combining semi-supervised learning (SL) and active learning (AL). Section
3 describes our approach to scene recognition in video content. In Section 4, we
compare the accuracy of our method for different approaches to feature selection
and different classifiers.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the Art</title>
      <p>Scene recognition is rather simple from the human perspective. Whether the
scene is the same as one previously visited is recognized by the overall layout of
the space, presence, and distribution of distinct objects, their texture, and color.
Other sensory organs can provide even more information and allow faster recall.
Scenes not visited beforehand may fall after a thorough exploration into one of
broader categories based on similarity of such features.</p>
      <p>Multimedia content, however, does not allow such space exploration directly.
It is constrained to the color information of individual pixels at a rather small
resolution. Video content resolves this issue only partially with a motion of
the camera, which, on the other hand, introduces more degrees of freedom in
background modeling and increases its complexity.</p>
      <sec id="sec-2-1">
        <title>Single Image Scene Classifiers Based on Colour Statistics</title>
        <p>
          The early scene classifiers, including the Indoor/Outdoor problem [
          <xref ref-type="bibr" rid="ref22 ref27">22,27</xref>
          ], and
also the more recent approaches mentioned below are directly based on the
overall color information contained in the picture. The vital decision in this
particular case is the selection of color space and the granularity of the considered
histograms.
        </p>
        <p>
          RGB (red, green and blue components) is the primary color space of
multimedia acquisition and processing. However, it does not directly encode the
quality of the color perceived by a human. By qualities of color, we primarily
mean the color shade (hue). In HSV encoding (hue, saturation and value of the
black/white range components, the last of them related to the overall lightness
of the color), hue is commonly sampled with finer precision (narrower bins in
histogram approaches) than saturation and lightness [
          <xref ref-type="bibr" rid="ref5 ref8">5,8</xref>
          ].
        </p>
        <p>
          Mainly because of memory consumption and model size, statistical features
of the individual images are commonly used for image processing, including basic
scene classification. Other approaches are based on object detection [
          <xref ref-type="bibr" rid="ref11 ref15">11,15</xref>
          ], on
interest point description [
          <xref ref-type="bibr" rid="ref2 ref3">3,2</xref>
          ], or in recent years they use deep convolutional
neural networks [
          <xref ref-type="bibr" rid="ref26 ref29 ref32">26,29,32</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Multi-label Extension</title>
        <p>Often, a single image contains multiple semantic features – such as sea, beach
and mountains. A crisp classification into only one class would, however, have
to take only the dominant class, which might be different from the selection of
the annotator. A somewhat possible extension is to create a new crisp class for
each encountered combination of the labels, but this would have a substantial
impact in the areas where the amount of labeled content is not sufficient to
enable proper training on such sub-classes.</p>
        <p>Another possibility is to organize the labels into a hierarchical structure. If
the described scenery shares multiple features, the parent label may be preferred
for content description. When the scene classifier detects only a specific part of
the scenery, we should not consider it a full miss.</p>
        <p>Statistical approach One of common assumptions in scene classification is
that, during a single shot, the background will be visible for a more extended
period than the foreground object. Therefore, we may process each frame in a
single shot by a scene recognition algorithm and vote among the proposed labels.
The statistical approach to background modeling applies if we assume a static
camera shot. When such an assumption is met, all frames are perfectly aligned,
and the background model can be extracted from the long-term pixel averages.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Semi-supervised Learning and Active Learning</title>
        <p>
          Semi-supervised learning (cf. the survey [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]) is a technique that benefits from
making use of easily obtainable unlabeled data for training. In this paper, we
mainly focus on the self-training aproach to semi-supervised learning [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It is a
simple and efficient method, in which we add samples with the most confidently
predicted labels (pseudo-labels) to the training dataset. This can be done so the
model is retrained in each iteration. Other aproaches to semi-supervised learning
are co-training [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and multiview training [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] thath benefit from agreement among
multiple learners.
        </p>
        <p>
          Active learning (cf. the survey [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]) is related to semi-supervised learning
through being also used in machine learning problems where obtaining unlabeled
data is cheap and manual labeling is expensive but possible. Its goal is to spend a
given annotation budget only on the most informative instances of the unlabeled
data. Most commonly, it is performed as pool-based sampling [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], assuming a
small set of labeled data and a large set of unlabeled data. Samples that were
found to be the most informative, are given to an annotator and are moved
into the labeled set. The considered machine learning model (e.g., a classifier)
is retrained and the algorithm iterates until the budget is exhausted or the
performeance of the model is satisfactory.
        </p>
        <p>
          Pool based sampling needs to evaluate an utility function that estimates some
kind of usefulness of knowing the label of a particular sample. There are various
ways of defining the utility function: for example, as a measure of uncertainty
in uncertainty sampling [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], as a number of disagreements within an ensemble
of diverse models in a method called query-by-committee [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], as the expected
model change [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], the expected error [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] or only the variance part of the model
error [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Semi-supervised and active learning can be quite naturally combined since
they address unlabeled data set from opposite ends. For example, self-training
uses the most certain samples to be turned to labeled samples and uncertainty
sampling queries the most uncertain samples and obtains its label from an
annotator. Such a combination was used for various problems [
          <xref ref-type="bibr" rid="ref16 ref21 ref31">16,21,31</xref>
          ]. Successful
combinations with active learning exist also for multiview training [
          <xref ref-type="bibr" rid="ref17 ref18 ref30">17,18,30</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Multimedia Histogram Processing with Feed-Forward</title>
    </sec>
    <sec id="sec-4">
      <title>Neural Network using SVM</title>
      <p>In the reported research, our main concern is to enable an automatic
annotation of small datasets with a generally small variation within the individual
classes. For example, we are not particularly interested in recognition of a broader
scenery concept (such as a living room), but we aim at the classification that
the video shot was captured in one specific living room.</p>
      <p>One of the possible applications, on which we will demonstrate our approach
in the next section, is the classification of individual scenes in long-running shows
and sit-coms. However, our approach is designed to be versatile and enable,
for example, disambiguation of individual television news studios or well-known
sites.</p>
      <p>Another concern of us is that the training of the classifier should require a
minimal amount of resources to enable connection into more complex systems
of multimedia content description as a simple high-level scene disambiguation
module.</p>
      <p>Therefore, we revise the traditional approaches in scene classification and
propose the use of color histograms, possibly with partial spatial awareness. To
demonstrate our reasoning behind this step, we refer to Figure 1.
0.20
t0.15
ililtcxvoaeeeunpR00..1005
0.00
(a) Room 4A</p>
      <p>(b) Room 4B
Histogram comparison
44AB
0
200
400
800
1000</p>
      <p>1200
600</p>
      <p>Histogrambin
(c) Histogram comparison
Fig. 1: Representative frames from two distinct living rooms and comparison
of the proposed histograms. Although both of these pictures depict a living
room, the distribution of colours is different. Source images courtesy of CBS
Entertainment.</p>
      <p>We choose a feed-forward neural network as the base classifier. In particular,
we use a network with two hidden layers of 100 and 50 neurons and logistic
sigmoid as activation function. The output layer uses the softmax activation
function. The network is trained using backpropagation with a negative
loglikelihood loss function and a stochastic gradient descent optimizer. The network
topology, activation function and optimizer were found through a simple grid
search, in which we considered also other the activation functions such as ReLU
or hyperboilic tangent, and another optimizer, based on an adaptive sestimates
of first and second moments of the gradients [?].</p>
      <p>
        For the scene classification task, we can use the trained neural network
directly. However, we introduce an improvement inspired by transfer learning.
Transfer learning is usually used in deep convolution neural nets where the
convergence of all parameters is slower [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. However, we would like to demonstrate,
that the transfer learning can bring a substantial benefit also in shallow
neural networks. Especially in combination with a support vector machine (SVM)
classifier.
      </p>
      <p>In our scenario, we freeze the parameters of first layers and use the network
as a feature extractor. For the classification stage, the original softmax layer is
then replaced with a linear support vector machine. This brings us a rather small
but consistent improvement in the final accuracy.</p>
      <p>For an overall structure of our proposed network, please refer to Figure 2.
In the figure, red arrows represent the first learning phase in which parameters
of the net are found using a backpropagation. Blue arrows represent the second
learning phase – transfer learning. In the second phase, the first two layers of the
already trained neural net are used for training dataset generation. After that,
a linear SVM classifier is trained. Green arrows represent the prediction of new
samples.</p>
      <p>Data
(InputDim) </p>
      <p>Linear Layer
(InputDim x 100) </p>
      <p>Sigmoid</p>
      <p>Linear Layer
(100 x 50) </p>
      <p>Sigmoid</p>
      <p>Linear Layer
(50 x outputDim) </p>
      <p>Linear SVM Trainer
Softmax</p>
      <p>Negative log
likelihood loss</p>
      <p>Legend:</p>
      <p>Linear SVM
Prediciton
Back propagation
SVM learning
Prediction</p>
      <p>Finally, the model performance was improved by using a combination SL+AL.
We have chosen a combination of uncertainty sampling with pseudo-labeling
through self-training. In the experimental evaluation, the utility functions least
uncertain (eq. 1), margin (eq. 2) and entropy (eq. 3) were included.
φLC (x) = Pθ(y1∗|x),
φM (x) = Pθ(y1∗|x) − Pθ(y2∗|x)),
φE (x) = −</p>
      <p>N
X Pθ(yi|x)logPθ(yi|x),
i
(1)
(2)
(3)</p>
      <p>In each iteration, n samples with the lowest utility function were queried
to be annotated. At the same time, samples with the utility function higher
than a threshold were predicted using the current version of the model, and
these predictions were then used to train the next version of the model. Utility
functions were calculated from the output of softmax layer of the neural net.
The number of samples n was chosen to be 5 in each iteration. The threshold
value was tuned to keep the number of wrong labels getting into training data
as low as possible.
3.1</p>
      <sec id="sec-4-1">
        <title>Weighted accuracy</title>
        <p>The scene description in our experiment is constructed hierarchically so there
are three different levels of the label. The first level describes building name, the
second level describes a room, and the last level describes detail in the room. For
instance, if the camera shot captures the whole living room of the flat “4A” in
the “main” building, we use a label such as main.4a. If only a specific portion of
the room is shown, we use a more detail level of the label such as main.4a.couch.</p>
        <p>To take into account the label hierarchy, we introduce weighted accuracy of
a classifier F predicting yˆ1, . . . , yˆn for training data (x1, y1), . . . , (xn, yn):
n
WA(F ) = 1 X f (yi, yˆi),
n</p>
        <p>i=1
f (yi, yˆi) =
1

</p>
        <p>0.5
0
if 1(yi = yˆi, 3)
if 1(yi = yˆi, 2) ,
otherwise.
where 1(yi = yˆi, k) is the truth function of equality of all components of yi and
yˆi on the k-th or a higher level of the component hierarchy.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental evaluation</title>
      <p>
        For the evaluation of all the following approaches, we prepared our dataset [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
from the first series of a sit-com The Big Bang Theory. This particular show
uses only a couple of scenes and by 2018 new series are still being produced.
The dataset is chosen for the proof of concept experiment and new datasets
should follow in future experiments. The multimedia content was automatically
segmented into individual camera shots by PySceneDetect [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using the content
detector.
      </p>
      <p>A middle frame from the detected shot was stored as a reference for human
annotation and convolutional neural network processing. Due to the copyright
protection, these stored frames are not contained in the dataset. They were
divided into 80% training and 20% test data along the time axis.</p>
      <p>For statistical approach experiments, the following histograms averaged by
the respective frame area and shot duration were obtained: RGB 8x8x8
(flattened histogram over 8 × 8 × 8 bins), H (hue histogram with 180 bins), HSV (
concatenation of 180 bins H, 256 bins S and 256 bins V histograms) and HSV
20x4x4 2*2 (flattened histogram over 20 × 4 × 4 bins in each of 4 parts of the
frame introduced by its prior division in 2 × 2 grid).
4.1</p>
      <sec id="sec-5-1">
        <title>Combinations of histograms and classifiers</title>
        <p>We have compared combinations of the above described histograms with the
following classifiers: linear SVM, k nearest neigbours (k-NN), naive Bayes (NB)
and the feedforward neural nets (FNNs) described in section 3, i.e., FNN alone
and FNN+SVM. A full comparison of the unweighted accuracy of all 16
combinations is carried out in Table 1.</p>
        <p>It is noticeable that HSV 20x4x4 2*2 feature dominates over all other
variants. Therefore, we were using HSV 20x4x4 2*2 in the subsequent experiments.
On the other hand, adding an SVM as the last layer of the FNN brings only a
smaller improvement.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Comparison with an inception style neural network</title>
        <p>State-of-the-art approaches in image scene classification usually use the residual
deep convolutional neural networks with inception-style layers. They are
typically combinded with multi-scale processing of the input imagery.</p>
        <p>
          With these key features in mind, we used the winner of the 2016 LSUN
challenge [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] as the reference method for scene classification on our dataset.
        </p>
        <p>The results are, however, worse than expected. The accuracy progress (see
Figure 3) shows that the network training is very unstable. The testing accuracy
achieves a maximum of 32.4% in the 801st epoch.</p>
        <p>Evolution of the test-data accuracy of the 2016 LSUN winner during training
Accuracy
0.30</p>
        <p>As we are unable to interpret the inner state of the neural network directly,
we may only assume that the main issue with using the multi-resolution
convolutional neural network is the small dataset size. However, this is exactly the
issue we need to mitigate.
As was shown in Subsection 4.1, the use of feed-forward neural network itself
brings a substantial increase in classification metrics. As Table 2 indicates, the
SVM layer provides an additional improvement as well as using part of the
unlabeled dataset with SL+AL. Although the improvement is not high, we believe
that using the more sophisticated combination of SL+AL could bring us even
further.</p>
        <p>The initial labeled dataset contained 5315 samples. An unlabeled dataset
with 26528 samples was used for both active and semi-supervised learning. A
human annotator was asked five queries at each of ten iterations.
In this paper, we sketched how semi-supervised learning combined with active
learning can be applied to scene recognition In addition, we propose to use neural
networks for further feature enhancement.</p>
        <p>The resulting features extracted from the proposed neural network provide
a substantial improvement over the engineered features on input. Especially, if
the extracted features are used as a data embedding for a linear SVM classifier.</p>
        <p>This allows us to achieve an accuracy of almost 79% on a small dataset that
is significantly higher than reference method (32.4%).</p>
        <p>Several descriptors are, however, still hard to recognize even for a human
annotator (e.g. staircase floor number). In these situations, one may benefit from
the context of the previous and following shot and consequently improve the
classification accuracy. Therefore, we would like to try context-based classifiers,
such as HMM, CRF or BI-LSTM-CRF as a next step of our research.</p>
        <p>Last but not least, we would like to use transductive SVM in the top layer
of the final classifier and provide further experiments in the combination with
semi-supervised and active learning, primarily with active multiview training.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The reported research has been supported by the grant 18-18080S of the Czech
Science Foundation (GACˇ R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blum</surname>
          </string-name>
          , A., Mitchell, T.:
          <article-title>Combining labeled and unlabeled data with co-training</article-title>
          .
          <source>In: Proceedings of the eleventh annual conference on Computational learning theory</source>
          . pp.
          <fpage>92</fpage>
          -
          <lpage>100</lpage>
          . ACM (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mun˜oz, X.:
          <article-title>Scene classification via plsa</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <fpage>517</fpage>
          -
          <lpage>530</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mun˜oz, X.:
          <article-title>Scene classification using a hybrid generative/discriminative approach</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <fpage>712</fpage>
          -
          <lpage>727</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Castellano</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Pyscenedetect. https://github.com/Breakthrough/ PySceneDetect (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>Y.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>H.Y.M.:</given-names>
          </string-name>
          <article-title>Movie scene segmentation using background information</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1056</fpage>
          -
          <lpage>1065</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atlas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladner</surname>
          </string-name>
          , R.:
          <article-title>Improving generalization with active learning</article-title>
          .
          <source>Machine learning 15(2)</source>
          ,
          <fpage>201</fpage>
          -
          <lpage>221</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          . IEEE Conference on. pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . IEEE (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aref</surname>
            ,
            <given-names>W.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Classview: hierarchical video shot classification, indexing, and accessing</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>70</fpage>
          -
          <lpage>86</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Farquhar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardoon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Shawe-taylor, J.S.,
          <string-name>
            <surname>Szedmak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Two view learning: Svm-2k, theory and practice</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>355</fpage>
          -
          <lpage>362</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Grandvalet</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised learning by entropy minimization</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>529</fpage>
          -
          <lpage>536</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , J.:
          <article-title>Video scene change detection using convolution neural network</article-title>
          .
          <source>In: Proceedings of the 2017 International Conference on Information Technology</source>
          . pp.
          <fpage>116</fpage>
          -
          <lpage>119</lpage>
          . ACM (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Karpathy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep visual-semantic alignments for generating image descriptions</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>3128</fpage>
          -
          <lpage>3137</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catlett</surname>
          </string-name>
          , J.:
          <article-title>Heterogeneous uncertainty sampling for supervised learning</article-title>
          .
          <source>In: Machine Learning Proceedings</source>
          <year>1994</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>156</lpage>
          . Elsevier (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          :
          <article-title>A sequential algorithm for training text classifiers</article-title>
          .
          <source>In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . Springer-Verlag New York, Inc. (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          :
          <article-title>Object bank: A high-level image representation for scene classification &amp; semantic feature sparsification</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>1378</fpage>
          -
          <lpage>1386</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jun</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A self-training approach to cost sensitive uncertainty sampling</article-title>
          .
          <source>Machine Learning</source>
          <volume>76</volume>
          (
          <issue>2</issue>
          ),
          <fpage>257</fpage>
          -
          <lpage>270</lpage>
          (
          <year>Sep 2009</year>
          ). https://doi.org/10.1007/s10994-009-5131-9, https://doi.org/10.1007/ s10994-009-5131-9
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , S.Y.:
          <article-title>Semi-supervised co-training and active learning based approach for multi-view intrusion detection</article-title>
          .
          <source>In: Proceedings of the 2009 ACM Symposium on Applied Computing</source>
          . pp.
          <fpage>2042</fpage>
          -
          <lpage>2048</lpage>
          . SAC '09,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2009</year>
          ). https://doi.org/10.1145/1529282.1529735, http://doi.acm.
          <source>org/10</source>
          .1145/ 1529282.1529735
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Active + semi-supervised learning = robust multi-view learning</article-title>
          .
          <source>In: Proceedings of the Nineteenth International Conference on Machine Learning</source>
          . pp.
          <fpage>435</fpage>
          -
          <lpage>442</lpage>
          . ICML '
          <fpage>02</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2002</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>645531</volume>
          .
          <fpage>655845</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Pulc</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Replication data for: Feed-forward neural networks for video scene classification from statistical features (</article-title>
          <year>2018</year>
          ). https://doi.org/10.7910/DVN/MPZGWO, https://doi.org/10.7910/DVN/MPZGWO
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Toward optimal active learning through monte carlo estimation of error reduction</article-title>
          . ICML, Williamstown pp.
          <fpage>441</fpage>
          -
          <lpage>448</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sabata</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borovicka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holena</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>K-best viterbi semi-supervized active learning in sequence labelling (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Serrano</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savakis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A computationally efficient approach to indoor/outdoor scene classification</article-title>
          .
          <source>In: Pattern Recognition</source>
          ,
          <year>2002</year>
          .
          <source>Proceedings. 16th International Conference on. vol. 4</source>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>149</lpage>
          . IEEE (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Active learning</article-title>
          .
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>114</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craven</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Multiple-instance active learning</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>1289</fpage>
          -
          <lpage>1296</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Seung</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Opper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sompolinsky</surname>
          </string-name>
          , H.:
          <article-title>Query by committee</article-title>
          .
          <source>In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory</source>
          . pp.
          <fpage>287</fpage>
          -
          <lpage>294</lpage>
          . COLT '92,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>1992</year>
          ). https://doi.org/10.1145/130385.130417, http://doi.acm.
          <source>org/10</source>
          .1145/130385. 130417
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>F.Y.Y.Z.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>A.S.J.</given-names>
          </string-name>
          :
          <article-title>Construction of a large-scale image dataset using deep learning with humans in the loop</article-title>
          .
          <source>arXiv preprint arXiv:1506.03365</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Szummer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Picard</surname>
          </string-name>
          , R.W.:
          <article-title>Indoor-outdoor image classification</article-title>
          .
          <source>In: ContentBased Access of Image and Video Database</source>
          ,
          <year>1998</year>
          . Proceedings., 1998 IEEE International Workshop on. pp.
          <fpage>42</fpage>
          -
          <lpage>51</lpage>
          . IEEE (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep learning using linear support vector machines</article-title>
          .
          <source>arXiv preprint arXiv:1306.0239</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>26</volume>
          (
          <issue>4</issue>
          ),
          <fpage>2055</fpage>
          -
          <lpage>2068</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.H.</given-names>
          </string-name>
          :
          <article-title>On multi-view active learning and the combination with semi-supervised learning</article-title>
          .
          <source>In: Proceedings of the 25th international conference on Machine learning</source>
          . pp.
          <fpage>1152</fpage>
          -
          <lpage>1159</lpage>
          . ACM (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Combining self learning and active learning for chinese named entity recognition</article-title>
          .
          <source>Journal of software 5(5)</source>
          ,
          <fpage>530</fpage>
          -
          <lpage>537</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapedriza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Places: A 10 million image database for scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised learning literature survey (</article-title>
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>