=Paper=
{{Paper
|id=Vol-2192/ialatecml_paper1
|storemode=property
|title=Semi-supervised and Active Learning in Video Scene Classification from Statistical Features
|pdfUrl=https://ceur-ws.org/Vol-2192/ialatecml_paper1.pdf
|volume=Vol-2192
|authors=Tomáš Šabata,Petr Pulc,Martin Holena
|dblpUrl=https://dblp.org/rec/conf/pkdd/SabataPH18
}}
==Semi-supervised and Active Learning in Video Scene Classification from Statistical Features==
Semi-supervised and Active Learning in Video Scene Classification from Statistical Features Tomáš Šabata1 , Petr Pulc2 , and Martin Holeňa2 1 Faculty of Information Technology, Czech Technical University in Prague, Prague, Czech Republic tomas.sabata@fit.cvut.cz 2 Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech republic {pulc,martin}@cs.cas.cz Abstract. In multimedia classification, the background is usually con- sidered an unwanted part of input data and is often modeled only to be removed in later processing. Contrary to that, we believe that a back- ground model (i.e., the scene in which the picture or video shot is taken) should be included as an essential feature for both indexing and follow- up content processing. Information about image background, however, is not usually the main target in the labeling process and the number of annotated samples is very limited. Therefore, we propose to use a combination of semi-supervised and active learning to improve the performance of our scene classifier, specifically a combination of self-training with uncertainty sampling. As a result, we utilize a combination of statistical features extractor, a feed-forward neural network and support vector machine classifier, which consistently achieves higher accuracy on less diverse data. With the proposed ap- proach, we are currently able to achieve precision over 80% on a dataset trained on a single series of a popular TV show. Keywords: video data, scene classification, semi-supervised learning, active learning, colour statistics, feedforward neural networks 1 Introduction Automatic multimedia content labeling is still a comparatively difficult domain for machine learning. High input data dimensionality requires large training data sets, especially for approaches that are designed without prior assumptions on the data properties. Moreover, the increasing resolution of image sensors brings higher detail (and thus, at least in theory, more information), but poses a significant issue for training phases of virtually all machine learning algorithms. Many approaches, therefore, have to introduce a trade-off concerning the number of involved parameters, the number of distinct output labels (classes) [26] and the resolution of the input imagery [7]. Alternatively, they have to use only the statistical properties of the input data (as [3] and many others). 24 Semi-supervised 2 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification We also need to tackle the limitation on the amount of labeled training data. Recent trends in video content processing include a task usually called Video to Text. The primary objective of such processing is to take multimedia content and describe its main features in a human-comprehensible text. Such representa- tion may contain gathered information on the scene, actors, objects and actions in which they are involved. Such as the single image description “baseball player is throwing ball in game,” as presented in [12]. Current approaches, however, commonly omit the information concerning the visual appearance of the background in complex multimedia content – even though such information might provide substantial contextual information for the object detection and event description itself. Approaches that use neural net- works are mostly data-driven and require large amounts of data to adapt to each selected class. This requirement is, however, seldom met in smaller multimedia collections, such as home video, university lecture recordings, movie studios or corporate media databases. We also want to reflect that a particular scene can be recalled by a human from a couple of static frames. Therefore, manual scene labeling is a relatively easy task as opposed to event labeling that may need the full video sequence or object labeling that commonly requires drawing a bounding box around the annotated object. To use the limited human involvement in scene labeling as efficiently as possi- ble, we employ semi-supervised learning to allow making use of unlabeled data, which are substantially easier to obtain, whereas simultaneously selecting the data for annotation using active learning methods. The rest of this paper is organized as follows: In Section 2, we briefly sum- marize the state of the art in scene classification in the context of single images without significant obstruction by foreground objects, as well as the state of the art in combining semi-supervised learning (SL) and active learning (AL). Section 3 describes our approach to scene recognition in video content. In Section 4, we compare the accuracy of our method for different approaches to feature selection and different classifiers. 2 State of the Art Scene recognition is rather simple from the human perspective. Whether the scene is the same as one previously visited is recognized by the overall layout of the space, presence, and distribution of distinct objects, their texture, and color. Other sensory organs can provide even more information and allow faster recall. Scenes not visited beforehand may fall after a thorough exploration into one of broader categories based on similarity of such features. Multimedia content, however, does not allow such space exploration directly. It is constrained to the color information of individual pixels at a rather small resolution. Video content resolves this issue only partially with a motion of the camera, which, on the other hand, introduces more degrees of freedom in background modeling and increases its complexity. 25 Semi-supervised and Active Semi-supervised andLearning in Video Active Learning in Scene Video Classification Scene Classification 3 2.1 Single Image Scene Classifiers Based on Colour Statistics The early scene classifiers, including the Indoor/Outdoor problem [22,27], and also the more recent approaches mentioned below are directly based on the overall color information contained in the picture. The vital decision in this particular case is the selection of color space and the granularity of the considered histograms. RGB (red, green and blue components) is the primary color space of mul- timedia acquisition and processing. However, it does not directly encode the quality of the color perceived by a human. By qualities of color, we primarily mean the color shade (hue). In HSV encoding (hue, saturation and value of the black/white range components, the last of them related to the overall lightness of the color), hue is commonly sampled with finer precision (narrower bins in histogram approaches) than saturation and lightness [5,8]. Mainly because of memory consumption and model size, statistical features of the individual images are commonly used for image processing, including basic scene classification. Other approaches are based on object detection [11,15], on interest point description [3,2], or in recent years they use deep convolutional neural networks [26,29,32]. 2.2 Multi-label Extension Often, a single image contains multiple semantic features – such as sea, beach and mountains. A crisp classification into only one class would, however, have to take only the dominant class, which might be different from the selection of the annotator. A somewhat possible extension is to create a new crisp class for each encountered combination of the labels, but this would have a substantial impact in the areas where the amount of labeled content is not sufficient to enable proper training on such sub-classes. Another possibility is to organize the labels into a hierarchical structure. If the described scenery shares multiple features, the parent label may be preferred for content description. When the scene classifier detects only a specific part of the scenery, we should not consider it a full miss. Statistical approach One of common assumptions in scene classification is that, during a single shot, the background will be visible for a more extended period than the foreground object. Therefore, we may process each frame in a single shot by a scene recognition algorithm and vote among the proposed labels. The statistical approach to background modeling applies if we assume a static camera shot. When such an assumption is met, all frames are perfectly aligned, and the background model can be extracted from the long-term pixel averages. 2.3 Semi-supervised Learning and Active Learning Semi-supervised learning (cf. the survey [33]) is a technique that benefits from making use of easily obtainable unlabeled data for training. In this paper, we 26 Semi-supervised 4 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification mainly focus on the self-training aproach to semi-supervised learning [10]. It is a simple and efficient method, in which we add samples with the most confidently predicted labels (pseudo-labels) to the training dataset. This can be done so the model is retrained in each iteration. Other aproaches to semi-supervised learning are co-training [1] and multiview training [9] thath benefit from agreement among multiple learners. Active learning (cf. the survey [23]) is related to semi-supervised learning through being also used in machine learning problems where obtaining unlabeled data is cheap and manual labeling is expensive but possible. Its goal is to spend a given annotation budget only on the most informative instances of the unlabeled data. Most commonly, it is performed as pool-based sampling [14], assuming a small set of labeled data and a large set of unlabeled data. Samples that were found to be the most informative, are given to an annotator and are moved into the labeled set. The considered machine learning model (e.g., a classifier) is retrained and the algorithm iterates until the budget is exhausted or the performeance of the model is satisfactory. Pool based sampling needs to evaluate an utility function that estimates some kind of usefulness of knowing the label of a particular sample. There are various ways of defining the utility function: for example, as a measure of uncertainty in uncertainty sampling [13], as a number of disagreements within an ensemble of diverse models in a method called query-by-committee [25], as the expected model change [24], the expected error [20] or only the variance part of the model error [6]. Semi-supervised and active learning can be quite naturally combined since they address unlabeled data set from opposite ends. For example, self-training uses the most certain samples to be turned to labeled samples and uncertainty sampling queries the most uncertain samples and obtains its label from an anno- tator. Such a combination was used for various problems [16,21,31]. Successful combinations with active learning exist also for multiview training [17,18,30]. 3 Multimedia Histogram Processing with Feed-Forward Neural Network using SVM In the reported research, our main concern is to enable an automatic anno- tation of small datasets with a generally small variation within the individual classes. For example, we are not particularly interested in recognition of a broader scenery concept (such as a living room), but we aim at the classification that the video shot was captured in one specific living room. One of the possible applications, on which we will demonstrate our approach in the next section, is the classification of individual scenes in long-running shows and sit-coms. However, our approach is designed to be versatile and enable, for example, disambiguation of individual television news studios or well-known sites. Another concern of us is that the training of the classifier should require a minimal amount of resources to enable connection into more complex systems 27 Semi-supervised and Active Semi-supervised andLearning in Video Active Learning in Scene Video Classification Scene Classification 5 of multimedia content description as a simple high-level scene disambiguation module. Therefore, we revise the traditional approaches in scene classification and propose the use of color histograms, possibly with partial spatial awareness. To demonstrate our reasoning behind this step, we refer to Figure 1. (a) Room 4A (b) Room 4B Histogram comparison 4A 0.20 4B 0.15 Relative pixel count 0.10 0.05 0.00 0 200 400 600 800 1000 1200 Histogram bin (c) Histogram comparison Fig. 1: Representative frames from two distinct living rooms and comparison of the proposed histograms. Although both of these pictures depict a living room, the distribution of colours is different. Source images courtesy of CBS Entertainment. We choose a feed-forward neural network as the base classifier. In particular, we use a network with two hidden layers of 100 and 50 neurons and logistic sigmoid as activation function. The output layer uses the softmax activation function. The network is trained using backpropagation with a negative log- likelihood loss function and a stochastic gradient descent optimizer. The network topology, activation function and optimizer were found through a simple grid search, in which we considered also other the activation functions such as ReLU or hyperboilic tangent, and another optimizer, based on an adaptive sestimates of first and second moments of the gradients [?]. For the scene classification task, we can use the trained neural network di- rectly. However, we introduce an improvement inspired by transfer learning. Transfer learning is usually used in deep convolution neural nets where the con- vergence of all parameters is slower [28]. However, we would like to demonstrate, that the transfer learning can bring a substantial benefit also in shallow neu- 28 Semi-supervised 6 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification ral networks. Especially in combination with a support vector machine (SVM) classifier. In our scenario, we freeze the parameters of first layers and use the network as a feature extractor. For the classification stage, the original softmax layer is then replaced with a linear support vector machine. This brings us a rather small but consistent improvement in the final accuracy. For an overall structure of our proposed network, please refer to Figure 2. In the figure, red arrows represent the first learning phase in which parameters of the net are found using a backpropagation. Blue arrows represent the second learning phase – transfer learning. In the second phase, the first two layers of the already trained neural net are used for training dataset generation. After that, a linear SVM classifier is trained. Green arrows represent the prediction of new samples. Data Linear Layer Linear Layer Sigmoid Sigmoid (InputDim) (InputDim x 100) (100 x 50) Linear Layer Linear SVM Trainer Linear SVM (50 x outputDim) Prediciton Negative log Softmax likelihood loss Legend: Back propagation SVM learning Prediction Fig. 2: The architecture of the proposed neural net. Red arrows represent the first learning phase; blue arrows represent a second learning phase with SVM and green arrows represent the prediction phase. Finally, the model performance was improved by using a combination SL+AL. We have chosen a combination of uncertainty sampling with pseudo-labeling through self-training. In the experimental evaluation, the utility functions least uncertain (eq. 1), margin (eq. 2) and entropy (eq. 3) were included. 29 Semi-supervised and Active Semi-supervised andLearning in Video Active Learning in Scene Video Classification Scene Classification 7 φLC (x) = Pθ (y1∗ |x), (1) φM (x) = Pθ (y1∗ |x) − Pθ (y2∗ |x)), (2) N X φE (x) = − Pθ (yi |x)logPθ (yi |x), (3) i In each iteration, n samples with the lowest utility function were queried to be annotated. At the same time, samples with the utility function higher than a threshold were predicted using the current version of the model, and these predictions were then used to train the next version of the model. Utility functions were calculated from the output of softmax layer of the neural net. The number of samples n was chosen to be 5 in each iteration. The threshold value was tuned to keep the number of wrong labels getting into training data as low as possible. 3.1 Weighted accuracy The scene description in our experiment is constructed hierarchically so there are three different levels of the label. The first level describes building name, the second level describes a room, and the last level describes detail in the room. For instance, if the camera shot captures the whole living room of the flat “4A” in the “main” building, we use a label such as main.4a. If only a specific portion of the room is shown, we use a more detail level of the label such as main.4a.couch. To take into account the label hierarchy, we introduce weighted accuracy of a classifier F predicting ŷ1 , . . . , ŷn for training data (x1 , y1 ), . . . , (xn , yn ): n 1X WA(F ) = f (yi , ŷi ), n i=1 1 if 1(yi = ŷi , 3) f (yi , ŷi ) = 0.5 if 1(yi = ŷi , 2) , 0 otherwise. where 1(yi = ŷi , k) is the truth function of equality of all components of yi and ŷi on the k-th or a higher level of the component hierarchy. 4 Experimental evaluation For the evaluation of all the following approaches, we prepared our dataset [19] from the first series of a sit-com The Big Bang Theory. This particular show uses only a couple of scenes and by 2018 new series are still being produced. The dataset is chosen for the proof of concept experiment and new datasets should follow in future experiments. The multimedia content was automatically 30 Semi-supervised 8 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification segmented into individual camera shots by PySceneDetect [4] using the content detector. A middle frame from the detected shot was stored as a reference for human annotation and convolutional neural network processing. Due to the copyright protection, these stored frames are not contained in the dataset. They were divided into 80% training and 20% test data along the time axis. For statistical approach experiments, the following histograms averaged by the respective frame area and shot duration were obtained: RGB 8x8x8 (flat- tened histogram over 8 × 8 × 8 bins), H (hue histogram with 180 bins), HSV ( concatenation of 180 bins H, 256 bins S and 256 bins V histograms) and HSV 20x4x4 2*2 (flattened histogram over 20 × 4 × 4 bins in each of 4 parts of the frame introduced by its prior division in 2 × 2 grid). 4.1 Combinations of histograms and classifiers We have compared combinations of the above described histograms with the following classifiers: linear SVM, k nearest neigbours (k-NN), naive Bayes (NB) and the feedforward neural nets (FNNs) described in section 3, i.e., FNN alone and FNN+SVM. A full comparison of the unweighted accuracy of all 16 combi- nations is carried out in Table 1. Table 1: Accuracy of combining the considered four kinds of histograms with the following classifiers: linear SVM, k-NN, NB, FNN and FNN+SVM. For each classifier, the highest accuracy with respect to the different kinds of histograms is in italics, and the highest accuracy with respect to different classifiers is in bold Accuracy [%] Linear SVM k-NN NB FNN FNN+SVM RGB 8x8x8 18.1 32.42 26.1 54.5 60.0 H 12.4 30.7 26.1 56.0 58.9 HSV 14.1 32.6 32.4 63.3 65.7 HSV 20x4x4 2*2 46.0 45.4 33.9 77.2 78.8 It is noticeable that HSV 20x4x4 2*2 feature dominates over all other vari- ants. Therefore, we were using HSV 20x4x4 2*2 in the subsequent experiments. On the other hand, adding an SVM as the last layer of the FNN brings only a smaller improvement. 4.2 Comparison with an inception style neural network State-of-the-art approaches in image scene classification usually use the residual deep convolutional neural networks with inception-style layers. They are typi- cally combinded with multi-scale processing of the input imagery. 31 Semi-supervised and Active Semi-supervised andLearning in Video Active Learning in Scene Video Classification Scene Classification 9 With these key features in mind, we used the winner of the 2016 LSUN challenge [29] as the reference method for scene classification on our dataset. The results are, however, worse than expected. The accuracy progress (see Figure 3) shows that the network training is very unstable. The testing accuracy achieves a maximum of 32.4% in the 801st epoch. Evolution of the test-data accuracy of the 2016 LSUN winner during training Accuracy 0.30 0.25 Accuracy 0.20 0.15 0.10 0.05 0.00 0 200 400 600 800 1000 Epochs Fig. 3: Accuracy of the inception-style winner of the LSUN challenge [29] on the testing set As we are unable to interpret the inner state of the neural network directly, we may only assume that the main issue with using the multi-resolution con- volutional neural network is the small dataset size. However, this is exactly the issue we need to mitigate. 4.3 Including supervised and active learning As was shown in Subsection 4.1, the use of feed-forward neural network itself brings a substantial increase in classification metrics. As Table 2 indicates, the SVM layer provides an additional improvement as well as using part of the un- labeled dataset with SL+AL. Although the improvement is not high, we believe that using the more sophisticated combination of SL+AL could bring us even further. The initial labeled dataset contained 5315 samples. An unlabeled dataset with 26528 samples was used for both active and semi-supervised learning. A human annotator was asked five queries at each of ten iterations. 32 Semi-supervised 10 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification Table 2: Final achieved accuracy, weighted accuracy, precision, Recall and F1 score with the HSV 20x4x4 2*2 histogram. For each of these classifier perfor- mance measures, the highest value among the considered classifiers is in bold Acc Weighted acc Precision Recall F1 FNN 0.7723 0.8518 0.7813 0.7590 0.7578 FNN-SVM 0.7883 0.8626 0.8026 0.7837 0.7842 FNN-SVM with SL+AL 0.7895 0.8617 0.8037 0.8022 0.7978 5 Conclusions and Future Work In this paper, we sketched how semi-supervised learning combined with active learning can be applied to scene recognition In addition, we propose to use neural networks for further feature enhancement. The resulting features extracted from the proposed neural network provide a substantial improvement over the engineered features on input. Especially, if the extracted features are used as a data embedding for a linear SVM classifier. This allows us to achieve an accuracy of almost 79% on a small dataset that is significantly higher than reference method (32.4%). Several descriptors are, however, still hard to recognize even for a human annotator (e.g. staircase floor number). In these situations, one may benefit from the context of the previous and following shot and consequently improve the classification accuracy. Therefore, we would like to try context-based classifiers, such as HMM, CRF or BI-LSTM-CRF as a next step of our research. Last but not least, we would like to use transductive SVM in the top layer of the final classifier and provide further experiments in the combination with semi-supervised and active learning, primarily with active multiview training. Acknowledgements The reported research has been supported by the grant 18-18080S of the Czech Science Foundation (GAČR). References 1. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory. pp. 92–100. ACM (1998) 2. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via plsa. In: European conference on computer vision. pp. 517–530. Springer (2006) 3. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification using a hybrid genera- tive/discriminative approach. IEEE transactions on pattern analysis and machine intelligence 30(4), 712–727 (2008) 4. Castellano, B.: Pyscenedetect. https://github.com/Breakthrough/ PySceneDetect (2017) 33 Semi-supervised and Active Semi-supervised andLearning in Video Active Learning in Scene Video Classification Scene Classification 11 5. Chen, L.H., Lai, Y.C., Liao, H.Y.M.: Movie scene segmentation using background information. Pattern Recognition 41(3), 1056–1065 (2008) 6. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Machine learning 15(2), 201–221 (1994) 7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009) 8. Fan, J., Elmagarmid, A.K., Zhu, X., Aref, W.G., Wu, L.: Classview: hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multime- dia 6(1), 70–86 (2004) 9. Farquhar, J., Hardoon, D., Meng, H., Shawe-taylor, J.S., Szedmak, S.: Two view learning: Svm-2k, theory and practice. In: Advances in neural information process- ing systems. pp. 355–362 (2006) 10. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems. pp. 529–536 (2005) 11. Han, S., Kim, J.: Video scene change detection using convolution neural network. In: Proceedings of the 2017 International Conference on Information Technology. pp. 116–119. ACM (2017) 12. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image de- scriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015) 13. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learn- ing. In: Machine Learning Proceedings 1994, pp. 148–156. Elsevier (1994) 14. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 3–12. Springer-Verlag New York, Inc. (1994) 15. Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: A high-level image repre- sentation for scene classification & semantic feature sparsification. In: Advances in neural information processing systems. pp. 1378–1386 (2010) 16. Liu, A., Jun, G., Ghosh, J.: A self-training approach to cost sensi- tive uncertainty sampling. Machine Learning 76(2), 257–270 (Sep 2009). https://doi.org/10.1007/s10994-009-5131-9, https://doi.org/10.1007/ s10994-009-5131-9 17. Mao, C.H., Lee, H.M., Parikh, D., Chen, T., Huang, S.Y.: Semi-supervised co-training and active learning based approach for multi-view intrusion detection. In: Proceedings of the 2009 ACM Symposium on Applied Com- puting. pp. 2042–2048. SAC ’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1529282.1529735, http://doi.acm.org/10.1145/ 1529282.1529735 18. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the Nineteenth International Conference on Machine Learning. pp. 435–442. ICML ’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2002), http://dl.acm.org/citation.cfm?id=645531. 655845 19. Pulc, P.: Replication data for: Feed-forward neural networks for video scene classifi- cation from statistical features (2018). https://doi.org/10.7910/DVN/MPZGWO, https://doi.org/10.7910/DVN/MPZGWO 20. Roy, N., McCallum, A.: Toward optimal active learning through monte carlo esti- mation of error reduction. ICML, Williamstown pp. 441–448 (2001) 34 Semi-supervised 12 and Active Tomáš Šabata, Learning Petr Pulc, MatininHoleňa Video Scene Classification 21. Sabata, T., Borovicka, T., Holena, M.: K-best viterbi semi-supervized active learn- ing in sequence labelling (2017) 22. Serrano, N., Savakis, A., Luo, A.: A computationally efficient approach to in- door/outdoor scene classification. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on. vol. 4, pp. 146–149. IEEE (2002) 23. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Ma- chine Learning 6(1), 1–114 (2012) 24. Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: Advances in neural information processing systems. pp. 1289–1296 (2008) 25. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. pp. 287–294. COLT ’92, ACM, New York, NY, USA (1992). https://doi.org/10.1145/130385.130417, http://doi.acm.org/10.1145/130385. 130417 26. Song, F.Y.Y.Z.S., Xiao, A.S.J.: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015) 27. Szummer, M., Picard, R.W.: Indoor-outdoor image classification. In: Content- Based Access of Image and Video Database, 1998. Proceedings., 1998 IEEE In- ternational Workshop on. pp. 42–51. IEEE (1998) 28. Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013) 29. Wang, L., Guo, S., Huang, W., Xiong, Y., Qiao, Y.: Knowledge guided disambigua- tion for large-scale scene classification with multi-resolution cnns. IEEE Transac- tions on Image Processing 26(4), 2055–2068 (2017) 30. Wang, W., Zhou, Z.H.: On multi-view active learning and the combination with semi-supervised learning. In: Proceedings of the 25th international conference on Machine learning. pp. 1152–1159. ACM (2008) 31. Yao, L., Sun, C., Wang, X., Wang, X.: Combining self learning and active learning for chinese named entity recognition. Journal of software 5(5), 530–537 (2010) 32. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 33. Zhu, X.: Semi-supervised learning literature survey (2005) 35