Technicolor@MediaEval 2016 Predicting Media Interestingness Task Yuesong Shen1,2 , Claire-Hélène Demarty2 , Ngoc Q. K. Duong2 1 École polytechnique, France 2 Technicolor, France yuesong.shen@polytechnique.edu claire-helene.demarty@technicolor.com quang-khanh-ngoc.duong@technicolor.com ABSTRACT malization purpose. This paper presents the work done at Technicolor regarding 3. IMAGE SUBTASK the MediaEval 2016 Predicting Media Interestingness Task, For the image subtask, the philosophy was to experiment which aims at predicting the interestingness of individual im- using several DNN structures and to compare with a SVM- ages and video segments extracted from Hollywood movies. based baseline. For all system types and both subtasks (im- We participated in both the image and video subtasks. age and video), the best parameter configurations were cho- 1. INTRODUCTION sen, by splitting the development set (either MediaEval data or external data) into some training (80%) and validation The MediaEval 2016 Predicting Media Interestingness Task sets (20%). As the MediaEval development dataset is not aims at predicting the level of interestingness of multimedia very large, we proceeded to a final training on the whole de- content, i.e., frames and/or video excerpts from Hollywood- velopment set, when building the final models, except for the type movies. The task is divided in two subtasks depending CSP-based runs, for which this final training was omitted. on the type of content, i.e., images or video segments. A SVM-based system We tested SVM with different ker- complete description of the task can be found in [2]. nels: linear, RBF and polynomial and with different pa- For the image subtask, Technicolor’s contribution is dou- rameter settings on the development dataset. We observed ble: a Support Vector Machine (SVM)-based system is com- that the validation accuracy varies from one run to another pared with several deep neural network (DNN) structures: (even for the same parameter setting) as the training sam- Multi-layer perceptrons (MLP), Residual networks (ResNet), ples change due to the random partition into the training Highway networks. For the video subtask, we compare dif- and validation sets. This may suggest that the dataset is ferent systems built on DNN including both the existing not large enough. Also, because of the class imbalance, the Long Short Term Memory (LSTM) with ResNet blocks, and validation accuracy used for training makes it difficult to the proposed architecture named Circular State-Passing Re- choose the best SVM configuration during grid search, that current Neural Network (CSP-RNN). targets the optimization of the official MAP metric. The paper is divided in two main parts corresponding to MLP-based system Several variations of network struc- the two subtasks. Before this, section 2 gives insight on tures have been tested, with different number of layers, layer the features used. In each subtask’s section, we present the sizes, activation functions and topologies among which sim- systems built, then give some details on the derived runs. ple MLP, residual [3] and highway [7] networks. These struc- The results for the two subtasks are discussed in Section 5. tures were first trained on a balanced dataset of 200,000 2. FEATURES images, extracted using the Flickr interestingness API [1]. For both subtasks, input features for the visual modality As this API uses some social metadata associated to con- are the CNN features extracted from the fc7 layer of the tent, it may lead to a social definition of interestingness, pre-trained CaffeNet model [5]. They were computed for instead of a more content-driven interestingness, which may each image of the dataset, after being resized to 227 for bias the system performance on the official dataset. The its smaller dimension and center-cropped so that the aspect best performance in terms of accuracy for the Flickr dataset ratio is preserved. The mean image that comes with the was obtained by a simple structure: a first dense layer of Caffe model was also subtracted for normalization purpose. size 1000 and rectified linear unit (ReLU) activation, with The final feature dimension is 4096 per image, or per video a dropout of 0.5, followed by a final softmax layer. This frame for the video subtask. structure was then re-trained on the MediaEval image de- For the audio modality, 60 Mel Frequency Cepstral Coef- velopment dataset, but with the addition of some resampling ficients (MFCCs) concatenated with their first and second or upsampling steps of the training data, to cope with the derivatives were extracted from windows of size 80 ms cen- imbalance of the two classes in the official dataset. Dur- tered around each frame, resulting in an audio feature vector ing resampling, a training sample is selected randomly from of size 180 per video frame. The mean value of the feature one class or another depending on a probability fixed before- vector over the whole dataset is then subtracted for the nor- hand. Upsampling consists of putting multiple occurrences of each interesting sample into the list of training data, re- Copyright is held by the author/owner(s). sulting in potentially interesting samples being used multiple MediaEval 2016 Workshop October 20-21, 2016, Hilversum, Netherlands. times during training. In both cases, different probabilities (0.3 to 0.6 for the interesting class) or upsampling propor- Runs - image MAP Runs - video MAP tions (5 to 13 times more interesting samples) were tested. Run #1 0.2336 Run #3 0.1465 Run #1: SVM-based SVM in Python Scikit-learn pack- Run #2 0.2315 Run #4 0.1365 age1 with RBF kernel, gamma = 1/n f eatures, c = 0.1 and Run #5 0.1618 default parameter settings elsewhere, is used. Upsampling Random baseline 0.1656 Random baseline 0.1496 strategy to enlarge interesting samples by factor of 11 is used. Table 1: Results in terms of MAP values. Run #2: MLP-based A simple structure with 2 layers of temporal events in videos by considering several succes- of sizes (1000, 2) (cf. section 3) was selected for its perfor- sive temporal samples and states to produce an output at mance on the Flickr dataset. Best performances with this each time instance (while RNN takes one input sample and structure were obtained with a learning rate of 0.1, decay one state to produce an output at each time instance). The rate of 0.1, ReLu activation function, Adadelta optimizing CSP-RNN was trained directly on the MediaEval dataset, method and a maximum of 10 epochs. Resampling with and for three fixed configurations only: audio, video and probability of 0.6 for the interesting class gives the best MAP multimodal. value on the MediaEval development set. Run #4: video and CSP-based For the audio and video configurations, the features were used directly as input 4. VIDEO SUBTASK to the CSP network, with some dimension reduction from Different DNN structures capable of handling the tempo- 4096 to 256 for the video, thanks to a simple dense layer. ral evolution of the data were tested with variation of size No upsampling/resampling was applied. During validation and depth. We also investigated the performances of differ- on the MediaEval dataset, the audio-only CSP configura- ent modalities separately vs. in a multimodal approach. tion has given lower performances than the video and mul- Systems based on existing structures timodal configurations. We therefore kept the video system Different simple RNN and LSTM [4] structures were tested for run#4. by varying their number and size of layers, as they are well- known to be able to handle the temporal aspect of the data. We also experimented the idea of ResNet (recently proposed for CNN) in our implementation with RNN and LSTM. Monomodal systems (audio-only, visual-only) were also com- pared to multimodal (audio+visual modality) ones. For the latter, a mid-level multimodal fusion was implemented, i.e., each modality was first processed independently through one or more layers before merging and processing by some addi- Figure 2: Proposed CSP-RNN structure. tional layers, as illustrated in figure 1. The best structures Run #5: multimodal and CSP-based For the mul- and set of parameters were chosen while training on the timodal configuration the same framework proposed in fig- Flickr part of the Jiang dataset [6]. Contrary to the MediaE- ure 1 was kept: the multimodal processing part being re- val dataset, this dataset contains 1200 longer videos, equally placed by a CSP of size 5. No upsampling/resampling was balanced between the two classes. Once the structures and tested. parameters were chosen, some upsampling/resampling pro- 5. RESULTS AND DISCUSSION cess was applied while re-training on the MediaEval dataset. Run #3: LSTM-ResNet-based The best structure The obtained results are reported in terms of MAP in obtained while validating on the Jiang dataset corresponds Table 1, with some baseline values computed from a random to figure 1, with a multimodal processing part composed of ranking of the test data. At least on the image subtask our a residual block built upon 2 LSTM layers of size 436. Af- systems perform significantly higher than the baseline. For ter re-training on the MediaEval dataset, upsampling with the video subtask, MAP values are lower and we may wonder a factor of 9 was applied to the input samples. whether these performances come in part from the difficulty of the task itself or the dataset which contains significant number of very short shots that were certainly difficult to annotate. For the image subtask, we observed that simple SVM sys- tems perform similarly (development set) and even slightly better (test set) than more sophisticated DNNs, leading to Figure 1: Proposed multimodal architecture. the conclusion that the size of the dataset was probably not Systems based on the proposed CSP large enough for DNN training. This is also supported by Figure 2 illustrates the philosophy of this new structure, our test on the external Flickr dataset containing 200,000 which can be seen as a generalization of the traditional RNN, images for which DNN reached more than 80% accuracy. in which at a time instance t, N samples of the input se- For the video subtask, several conclusions may be drawn. quence go through N recurrent nodes arranged in a circular First, multimodality seems to bring benefit to the task (this structure (allowing to take into account both the past and was confirmed by some additional but not submitted runs). the future over a temporal window of size N ) to produce N Second, the new CSP structure seems be able to capture the internal outputs. These outputs are then combined to form a temporal evolution of the videos better than classic RNN final output at t. This architecture targets a better modeling and more sophisticated LSTM-ResNet structures, and this independently of the monomodal branches which were the 1 same in both cases. This very first results support the need http://scikit-learn.org/stable/modules/generated/ sklearn.svm.SVC for further testing of this new structure in the future. 6. REFERENCES [1] Flickr interestingness api. https://www.flickr.com/services/api/flickr. interestingness.getList.html. [2] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 20-21, 2016. [3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In arXiv prepring arXiv:1506.01497, 2015. [4] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997. [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [6] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, pages 1113–1119. AAAI Press, 2013. [7] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.