=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_19
|storemode=property
|title=BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_19.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/XuFJ16
}}
==BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos==
BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos Baohan Xu13 , Yanwei Fu23 , Yu-Gang Jiang13 1 School of Computer Science, Fudan University, China 2 School of Data Science, Fudan University, China 3 Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, China {bhxu14, yanweifu, ygj}@fudan.edu.cn ABSTRACT Both visual features and high-level attributes were explored Despite growing research interest, the tasks of predicting the in our framework. We also compared SVM with deep neu- interestingness of images and videos remain as an open chal- ral networks to further study the relations between different lenge. The main obstacles come from both the diversity and features. complexity of video content and highly subjective and vary- ing judgements of interestingness of different persons. In the 2. SYSTEM DESCRIPTION MediaEval 2016 Predicting Media Interestingness Task, our Figure 1 gives an overview of our system. The whole sys- team of BigVid@Fudan had submitted five runs exploring tem is composed of two key components: feature extraction various methods of extraction, and modeling the low-level and classifiers. features (from visual and audio modalities) and hundreds of high-level semantic attributes; and fusing these features 2.1 Feature Extraction for classification. We not only investigated the use of the There are several pre-computed features provided by the SVM (Support Vector Machine) model; but the recent deep organizers, such as denseSIFT [3], pre-trained CNN fc7 layer learning methods were explored as well. We had submitted features using ImageNet model and face features. To en- 5 runs using SVM/Ranking-SVM (Run1, Run3 and Run4) large the useful information in data, we also consider two and Deep Neural Networks (Run2 and Run5) respectively. other types of high-level features. These features have been We achieved a mean average precision of 0.23 for the image shown very useful in the tasks of aesthetics and interesting- subtask and 0.15 for the video subtask. Furthermore, our ness prediction in [5] and [1]. The average pooling of all experiments revealed some insights of this task which are the descriptors from all sampled frames is used to form the interesting and potential useful. For example, our results video-level representation for each feature modality. show that the visual features and high-level attributes are complementary to each other. Style Attributes: We have considered the photographic style attributes [5] as high-level descriptors. These attributes have been shown highly related to aesthetics and interest- 1. INTRODUCTION ingness in [5]. To compute these high-level features, the The problem of automatically predicting the interesting- descriptor is formed by concatenating the classification out- ness of images and videos has started to receive increasing puts of 14 photographic styles (e.g., Complementary Colors, attention. Interestingness prediction has a number of real- Duotones, Rule of Thirds, Vanishing Point, etc). world applications, such as interestingness-based video rec- SentiBank: There are 1,200 concepts in SentiBank, and ommendation system for social media platform. each is defined as an adjective-noun pair, e.g., ”crazy cat” MediaEval introduced the “2016 Predicting Media Inter- and ”lovely girl ”, where the adjective is strongly related to estingness Task”. This task requires participants to auto- emotions and the noun corresponds to objects and scenes matically select images and/or video segments which are that are expected to be automatically detectable. Models considered to be the most interesting for a common viewer. for detecting the concepts were trained on Flickr images [1]. Interestingness of the media is to be judged based on visual This set of attributes is intuitively effective on the emotion- appearance, audio information and text accompanying the related objects and scenes. Since interesting images/videos data. To solve the task, participants are strongly encour- often related with strong emotions, the attribute is expected aged to deploy multimodal approaches. For the definitions, to be a very helpful clue for predicting interestingness. dataset and evaluation of the task, please refer to the official document [2]. 2.2 Classifiers This paper describes the first participation of MediaE- Several classifiers are investigated here in order to be ro- val 2016 from the team of BigVid@Fudan. For this task bustness to the diversity and complexity of similar visual we developed an approach to investigate how features and content. Particularly, we discussed the SVM, Ranking-SVM classifiers affect the interestingness in images and videos. and Deep Neural Networks (DNN) for feature fusion and classification. We explain them as follows, Copyright is held by the author/owner(s). MediaEval 2016 Workshop, October. 20-21, 2016, Hilversum, SVM: χ2 kernel was adopted for the bag-of-words features Netherlands. (denseSIFT), and Gaussian RBF kernel was used for the oth- 0.25 Images Video Frames 0.229 0.2 0.179 0.148 0.151 0.154 0.15 MAP 0.1 Feature Extraction 0.05 SentiBank Style Face SIFT CNN 0 VideoTask-Run1 VideoTask-Run2 VideoTask-Run3 ImageTask-Run4 ImageTask-Run5 Figure 2: Performance of our 5 submitted runs on both video and image subtasks. AP is computed on a per trailer basis over the top N best ranked Fusion Ranking- images/video shots. And MAP averaged over all DNN SVM SVM trailers. Figure 1: An overview of the key components in our subtasks respectively, Run 2 and Run 5 used DNN for video proposed methods. We use DNN and SVM for both and image subtasks respectively. Run 3 used SVM fusion subtask, while Ranking-SVM is only used in video with Ranking-SVM for video subtask. subtask. The face feature is computed according to Figure 2 summarized the results of all the submissions. the movement of face in the video shot, which is also The official performance measure is MAP for both video and only used in video subtask. image subtasks. For image subtask, the DNN (Run5) signif- icantly outperforms the SVM classifier (Run4) since feature correlation plays an important role in feature fusion for the ers. For feature fusion, kernel-level average fusion was used interestingness task. This also clearly confirms the effective- for the features, which linearly combines kernels computed ness of our proposed deep networks. Our experiments also on different features. verify that the high-level attributes are complementary to Ranking-SVM: As the interestingness level also affects the visual features and CNN features. classification result, we consider training a model to compare For the video subtask, besides the visual and high-level the interestingness of different images/videos. We therefore features, we combined the face features. The experiments adopt Joaquims’ Ranking SVM [4] to enhance the final re- show these features are complementary with each other, sults. To fully use the training data, we organized them which means visual and high-level attribute both make con- in form of pairs, with ground-truth labels indicating which tribution to determine whether a video clip is interesting or one is more interesting for each pair. Score-level average not. We found that adding the audio feature such as MFCC late fusion was adopted to combine the results of SVM and (Mel-Frequency Cepstrum Coefficient) may cause worse re- Ranking-SVM. sults. This is possibly due to the fact that the video shots are DNN: We also adopted a DNN-based classifier proposed in very short and cannot provide continuous and useful audio our recent work [6]. The fusion methods for the SVM classi- information. We also considered adding ranking information fiers may take advantage of different features; however, they for video tasks (Run3); it shows slightly improvement over often neglect the hidden relations shared among features. SVM (Run1) and DNN (Run2). The result also indicates We proposed a regularized DNN to explore the relationship that interestingness level may further improve the result. of distinct features, which is found useful for image/video It’s also worth mentioning that the results of the image classification. Specifically, for each input feature, a layer of subtask are better than for the video subtask. It may be neurons was first used to perform feature abstraction. Then, caused by the fact that the average of frame features weaken feature fusion is performed by another layer with carefully the weights of interesting information. How to fully use the designed structural-norm regularization on network weights. effective information in video clips is a future direction. The feature relationships is also considered in the regular- ized DNN. And the fused representation was finally used to 4. CONCLUSIONS construct a classification model in the last layer. With this special network, we are able to fuse features by consider- We have explored both SVM model and DNN to achieve ing both feature correlation and feature diversity, as well as better classification on image and video interestingness. Our perform classification simultaneously. Please see [6] for more experiments have shown that DNN-based method outper- details. forms the SVM model by considering feature correlation. Additionally, the high-level attributes are complementary to visual and CNN features on predicting interestingness. Nev- 3. SUBMITTED RUNS AND RESULTS ertheless, our experimental results indicate that the visual There are two subtasks in this year’s evaluation, namely and audio features may lack of discrimination about inter- predicting video interestingness and predicting image inter- estingness. Thus, as the future work of predicting the inter- estingness. We submitted 5 runs for official evaluation, among estingness, we will consider extracting from the image and which 2 runs for the image subtask and 3 runs for the video videos the text information which may contain the textual subtask. Run1 and Run4 used SVM for video and image descriptions of interestingness (from linguistic perspective). 5. REFERENCES [1] D. Borth, T. Chen, R. Ji, and S. F. Chang. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In ACM MM, 2013. [2] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 20-21, 2016. [3] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event recognition in internet videos. IEEE TMM, 17(8):1–13, 2015. [4] T. Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD, 2002. [5] N. Murray, L. Marchesotti, and F. Perronnin. Ava: A large-scale database for aesthetic visual analysis. In CVPR, 2012. [6] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM MM, 2014.