=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_29
|storemode=property
|title=EURECOM @MediaEval 2017: Media Genre Inference for Predicting Media Interestingness
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_29.pdf
|volume=Vol-1984
|authors=Olfa Ben-Ahmed,Jonas Wacker,Alessandro Gaballo,Benoit Huet
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AhmedWGH17
}}
==EURECOM @MediaEval 2017: Media Genre Inference for Predicting Media Interestingness==
EURECOM @MediaEval 2017: Media Genre Inference for Predicting Media Interestingness Olfa Ben-Ahmed, Jonas Wacker, Alessandro Gaballo, Benoit Huet EURECOM, Sophia Antipolis, France olfa.ben-ahmed@eurecom.fr,jonas.wacker@eurecom.fr,alessandro.gaballo@eurecom.fr,benoit.huet@eurecom.fr ABSTRACT 2 METHOD In this paper, we present EURECOM’s approach to address the Extracting genre information from movie scenes results in an in- MediaEval 2017 Predicting Media Interestingness Task. We developed termediate representation that may be quite useful for further clas- models for both the image and video subtasks. In particular, we sification tasks. In this section, we briefly present our method for investigate the usage of media genre information (i.e., drama, hor- media interestingness prediction. Figure 1 gives a brief overview ror, etc.) to predict interestingness. Our approach is related to the over the entire framework. At first, we extract deep visual and affective impact of media content and is shown to be effective in acoustic features for each shot. We then obtain a genre prediction predicting interestingness for both video shots and key-frames. for each modality to finally use this prediction for the training of an interestingness classifier. 1 INTRODUCTION Multimedia interestingness prediction aims to automatically an- alyze media data and identify the most attractive content. Previ- ous works have been focused on predicting media interestingness directly from the multimedia content [3, 6–8]. However, media interestingness prediction is still an open challenge in the com- puter vision community [4, 5] due to the gap between low-level perceptual features and high-level human perception of the data. Figure 1: Framework of the proposed interestingness predic- Recent research proved that perceived interestingness is highly tion method correlated with data emotional content [9, 14]. Indeed, humans may prefer "affective decisions" to find interesting content because 2.1 Media Genre representation emotional factors directly reflect the viewer’s attention. Hence, an The genre prediction model is based on audio-visual deep features. affective representation of video content will be useful for iden- Using these features, we trained two genre classifiers, a Deep Neu- tifying the most important parts in a movie. In this work, we hy- ral Network (DNN) on deep visual features and an SVM on deep pothesize that the emotional impact of the movie genre can be acoustic features. a factor for the perceived interestingness of a video for a given The dataset [12] used to train our genre model contains originally viewer. Therefore, we adopt a mid-level representation based on 4 different movie genres: action, drama, horror and romance. We video genre recognition. We propose to represent each sample as extended the dataset with an additional genre to obtain a more a distribution over genres (action, drama, horror, romance, sci-fi). sophisticated genre representation for each movie trailer shot. Our For instance, a high confidence for the horror label inside the shot final dataset comprises 415 movie trailers of 5 genres (69 trailers genre distribution could be perceived as more emotional (scary in for action, 95 for drama, 99 for horror, 80 for romance and 72 for this case). Therefore, this shot might be more characteristic and sci-fi). Each movie trailer is segmented into visual shots using the therefore more interesting than a neutral genre that could appear PySceneDetect tool1 . The visual shots are automatically obtained in any shot. by comparing HSV histograms of consecutive video frame (a high The media interestingness challenge is organized at MediaEval histogram distance results in a shot boundary). We also segment 2017. The task consists of two subtasks for the prediction of image each video into audio shots using the OpenSmile Voice Activity and video interestingness respectively. The first one involves pre- Detection tool2 . The tool automatically determines speaker cues in dicting the most interesting key frames. The second one involves the audio stream which we use as acoustic shot boundaries. In total, the automatic prediction of interestingness for different shots in a we trained our two genre predictor models on 29151 visual and on trailer. For more details about the task description, related dataset 26144 audio shots. The visual shots are represented by key-frames. and experimental setting, we refer the reader to the task overview We select the middle frame in a shot as a key-frame. Visual features paper [2]. The rest of the paper is organized as follows: Section are extracted from these key-frames using a pretrained VGG-16 2 describes our proposed method, Section 3 presents experiments network [11]. By removing the last 2 layers, the output results in and results and finally Section 4 concludes the work and gives some a 4096-dimensional feature vector for each keyframe. This single perspectives. feature vector represents the visual information that we obtained for each shot/key-frame. Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 http://pyscenedetect.readthedocs.io/en/latest/ 2 https://github.com/naxingyu/opensmile/tree/master/scripts/vad MediaEval’17, 13-15 September 2017, Dublin, Ireland O. Ben-Ahmed et al. training by giving a larger penalty to samples with high confidence scores, and a small penalty to samples with low confidence scores. Figure 2: DNN architecture for key-frame genre prediction. 3 EXPERIMENTS AND RESULTS The evaluation results of our models on the test data provided by the organizers are shown below. We submitted two runs for the 2.1.1 Visual feature learning. We use the DNN architecture pro- image classification and five for the video classification task. Table 1 posed by [12] to make genre predictions on visual features. The reports the MAP and the MAP@10 scores for our various model architecture is shown in Figure 2. The Dropout regularization is configurations returned by the task organizers. used to avoid overfitting and to optimize training performance. The output is squashed into a probability vector over the 5 genres using Softmax. We use mini-batch stochastic gradient descent on a batch Task Run Classifier MAP MAP@10 size of 32 to train the network. Categorical cross entropy is used as Image 1 SVM - Sigmoid kernel 0.2029 0.0587 a loss function and we train the network over 50 epochs. 2 SVM - Linear kernel 0.2016 0.0579 Video 1 Sigmoid kernel: 0.2034 0.0717 2.1.2 Acoustic feature learning. Adding the audio information gamma=0.5, C=100 surely plays an important role for content analysis in videos. Most 2 Polyn. kernel: degree=3 0.1960 0.0732 of the approaches in related work only focus on hand-crafted audio 3 Polyn. kernel: degree=2 0.1964 0.0640 features such as the Mel Frequency Cepstrum Coefficients (MFCC) 4 Sigmoid kernel: 0.2094 0.0827 or spectrograms, with either traditional or deep classifiers. How- gamma=0.2, C=100 ever, those audio features are rather low-level representations and 5 Sigmoid kernel: 0.2002 0.0774 are not designed for semantic video analysis. Instead of using such gamma=0.3 , C=100 classical audio features, we extract deep audio features from a pre- Table 1: Official evaluation results on test data trained model called Soundnet [1]. The latter has been learned by transferring knowledge from vision to sound to ultimately recog- nize objects and scenes in sound data. According to the work of For the image subtask, the MAP values are quite similar for both Ayter et al. [1], an audio feature representation using Soundnet linear and sigmoid SVM kernels. For the video subtask, decent re- reaches state-of-the-art accuracy on three standard acoustic scene sults in MAP values are already achieved with visual key-frame classification datasets. In our work, features are extracted from the classification (run 2 and 3). When using both modalities (run 1, 4 fifth convolutional layer of the 8-layers version of the Soundnet and 5), averaging audio and video genre predictions, results show a model. For the training on audio features, we used a probabilistic slight performance gain. However, we obtain a larger improvement SVM with a linear kernel and a regularization value of C = 1.0. when looking at the MAP@10 scores. Here, employing both modali- ties outperforms the pure key-frame classification. Overall, an SVM 2.2 Interestingness classification with a sigmoid kernel seems more effective for the audio-visual sub- mission than using a linear or polynomial kernel. Yet, we have only Our genre model can be used for both the image and video subtasks. looked at SVM models in our experiments. Further improvements Indeed, we train two separate genre classifiers (i.e., one based on could be done by trying out different models as it has been done audio and one based on visual features). Therefore, we end up with in related work [10, 13, 15]. Also, it would be interesting to apply two probability vector outputs for respectively the visual and audio genre prediction on all/multiple shot frames instead of employing inputs. In order to obtain the final genre distribution for the video a single key-frame. In general, we have shown that our approach shots, we simply take the mean of both probability vectors. This is capable of making useful scene suggestions even if we do not probabilistic genre distribution is our mid-level representation and consider it ready for commercial use yet. thus serves as the input for the actual interestingness classifier. A Support Vectors Machine (SVM) binary classifier is then trained on these features to predict with a confidence score whether a 4 CONCLUSION shot/image is considered interesting or not. For the video subtask, In this paper, we presented a framework for predicting image and we also performed experiments using only the visual information video interestingness that includes a genre recognition system as a of the video shots. For this we used the genre prediction model mid-level representation for the data. Our best results on the testset based on the extracted VGG features from the video key-frames. To were 20.29 and 20.94 of MAP for respectively the image and video evaluate the performance of our interestingness model, we tested subtasks. Obtained results are promising especially for the video several SVM kernels (linear, RBF and sigmoid) with different param- subtask. Future works include the joint learning of audio-visual eters on the development dataset. A high number of experiments features and the integration of temporal information to describe with a grid search in order to optimize kernel parameters tended to the evolution of audio-visual features over video frames. classify almost all the samples as non interesting. This may be due to the imbalanced labels of the training data. Hence, we opted for a ACKNOWLEDGMENTS weighted version of SVM classification where the minority class The research leading to this paper was partially supported by receives a higher misclassification penalty. We also take into ac- Bpifrance within the NexGenTV Project (F1504054U). The Titan count the confidence scores of the development set samples during Xp used for this research was donated by the NVIDIA Corporation. Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Proceedings of Advances in Neural Information Processing Systems. 892–900. [2] Claire-Hélène Demarty, Mats Viktor Sjöberg, Bogdan Ionescu, Thanh- Toan Do, Hanli Wang, Ngoc QK Duong, Frédéric Lefebvre, and others. Media interestingness at Mediaeval 2017. In Proceedings of MediaEval 2017 Workshop, Dublin, Ireland, September 13-15, 2017. [3] Yanwei Fu, Timothy M Hospedales, Tao Xiang, Shaogang Gong, and Yuan Yao. 2014. Interestingness prediction by robust learning to rank. In Proceedings of the European Conference on Computer Vision. Zurich, Switzerland, September 6-12, 488–503. [4] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool. 2013. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, December 1-8, 2013. 1633–1640. [5] Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summa- rization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, December 11-18, 2015. 3090–3098. [6] Michael Gygli and Mohammad Soleymani. 2016. Analyzing and Pre- dicting GIF Interestingness. In Proceedings of ACM Multimedia, Am- sterdam, The Netherlands, October 15-19, 2016. New York, NY, USA, 122–126. [7] Yu-Gang Jiang, Yanran Wang, Rui Feng, Xiangyang Xue, Yingbin Zheng, and Hanfang Yang. Understanding and Predicting Interesting- ness of Videos. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, Bellevue, Washington, July 14-18, 2013. [8] Yang Liu, Zhonglei Gu, Yiu-ming Cheung, and Kien A. Hua. 2017. Multi-view Manifold Learning for Media Interestingness Prediction. In Proceedings of ACM on International Conference on Multimedia Re- trieval, Bucharest, Romania, June 6-9, 2017. New York, NY, USA, 308– 314. [9] Soheil Rayatdoost and Mohammad Soleymani. 2016. Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features. In Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, 2016. [10] G. S. Simoes, J. Wehrmann, R. C. Barros, and D. D. Ruiz. 2016. Movie genre classification with Convolutional Neural Networks. In 2016 International Joint Conference on Neural Networks (IJCNN). 259–266. [11] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Net- works for Large-Scale Image Recognition, Technical report. CoRR abs/1409.1556 (2014). [12] K S Sivaraman and Gautam Somappa. 2016. MovieScope: Movie trailer classification using Deep Neural Networks. University of Virginia (2016). [13] John R. Smith, Dhiraj Joshi, Benoit Huet, Hsu Winston, and Jozef Cota. 2017. Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation. In Proceedings of ACM Multimedia. October 23-27, 2017, Mountain View, CA, USA. [14] Mohammad Soleymani. 2015. The Quest for Visual Interest. In Proceed- ings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, October 26-30, 2015. New York, NY, USA, 919–922. [15] Sejong Yoon and Vladimir Pavlovic. 2014. Sentiment Flow for Video Interestingness Prediction. In Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia (HuEvent ’14). New York, NY, USA, 29–34.