1. INTRODUCTION

Technicolor@MediaEval 2016 Predicting Media Interestingness Task

Yuesong Shen

Claire-Hélène Demarty

Ngoc Q. K. Duong

École polytechnique

France

Technicolor

2016

20 21

This paper presents the work done at Technicolor regarding the MediaEval 2016 Predicting Media Interestingness Task, which aims at predicting the interestingness of individual images and video segments extracted from Hollywood movies. We participated in both the image and video subtasks.

1. INTRODUCTION

The MediaEval 2016 Predicting Media Interestingness Task aims at predicting the level of interestingness of multimedia content, i.e., frames and/or video excerpts from Hollywoodtype movies. The task is divided in two subtasks depending on the type of content, i.e., images or video segments. A complete description of the task can be found in [ 2 ].

For the image subtask, Technicolor's contribution is double: a Support Vector Machine (SVM)-based system is compared with several deep neural network (DNN) structures: Multi-layer perceptrons (MLP), Residual networks (ResNet), Highway networks. For the video subtask, we compare different systems built on DNN including both the existing Long Short Term Memory (LSTM) with ResNet blocks, and the proposed architecture named Circular State-Passing Recurrent Neural Network (CSP-RNN).

The paper is divided in two main parts corresponding to the two subtasks. Before this, section 2 gives insight on the features used. In each subtask's section, we present the systems built, then give some details on the derived runs. The results for the two subtasks are discussed in Section 5.

FEATURES

For both subtasks, input features for the visual modality are the CNN features extracted from the fc7 layer of the pre-trained Ca eNet model [ 5 ]. They were computed for each image of the dataset, after being resized to 227 for its smaller dimension and center-cropped so that the aspect ratio is preserved. The mean image that comes with the Ca e model was also subtracted for normalization purpose. The nal feature dimension is 4096 per image, or per video frame for the video subtask.

For the audio modality, 60 Mel Frequency Cepstral Coefcients (MFCCs) concatenated with their rst and second derivatives were extracted from windows of size 80 ms centered around each frame, resulting in an audio feature vector of size 180 per video frame. The mean value of the feature vector over the whole dataset is then subtracted for the normalization purpose.

3. IMAGE SUBTASK

For the image subtask, the philosophy was to experiment using several DNN structures and to compare with a SVMbased baseline. For all system types and both subtasks (image and video), the best parameter con gurations were chosen, by splitting the development set (either MediaEval data or external data) into some training (80%) and validation sets (20%). As the MediaEval development dataset is not very large, we proceeded to a nal training on the whole development set, when building the nal models, except for the CSP-based runs, for which this nal training was omitted.

SVM-based system We tested SVM with di erent kernels: linear, RBF and polynomial and with di erent parameter settings on the development dataset. We observed that the validation accuracy varies from one run to another (even for the same parameter setting) as the training samples change due to the random partition into the training and validation sets. This may suggest that the dataset is not large enough. Also, because of the class imbalance, the validation accuracy used for training makes it di cult to choose the best SVM con guration during grid search, that targets the optimization of the o cial MAP metric.

MLP-based system Several variations of network structures have been tested, with di erent number of layers, layer sizes, activation functions and topologies among which simple MLP, residual [ 3 ] and highway [ 7 ] networks. These structures were rst trained on a balanced dataset of 200,000 images, extracted using the Flickr interestingness API [ 1 ]. As this API uses some social metadata associated to content, it may lead to a social de nition of interestingness, instead of a more content-driven interestingness, which may bias the system performance on the o cial dataset. The best performance in terms of accuracy for the Flickr dataset was obtained by a simple structure: a rst dense layer of size 1000 and recti ed linear unit (ReLU) activation, with a dropout of 0.5, followed by a nal softmax layer. This structure was then re-trained on the MediaEval image development dataset, but with the addition of some resampling or upsampling steps of the training data, to cope with the imbalance of the two classes in the o cial dataset. During resampling, a training sample is selected randomly from one class or another depending on a probability xed beforehand. Upsampling consists of putting multiple occurrences of each interesting sample into the list of training data, resulting in potentially interesting samples being used multiple times during training. In both cases, di erent probabilities (0:3 to 0:6 for the interesting class) or upsampling proportions (5 to 13 times more interesting samples) were tested.

Run #1: SVM-based SVM in Python Scikit-learn package1 with RBF kernel, gamma = 1=n f eatures, c = 0:1 and default parameter settings elsewhere, is used. Upsampling strategy to enlarge interesting samples by factor of 11 is used.

Run #2: MLP-based A simple structure with 2 layers of sizes (1000, 2) (cf. section 3) was selected for its performance on the Flickr dataset. Best performances with this structure were obtained with a learning rate of 0.1, decay rate of 0.1, ReLu activation function, Adadelta optimizing method and a maximum of 10 epochs. Resampling with probability of 0.6 for the interesting class gives the best MAP value on the MediaEval development set. 4.

VIDEO SUBTASK

Di erent DNN structures capable of handling the temporal evolution of the data were tested with variation of size and depth. We also investigated the performances of di erent modalities separately vs. in a multimodal approach.

Systems based on existing structures

Di erent simple RNN and LSTM [ 4 ] structures were tested by varying their number and size of layers, as they are wellknown to be able to handle the temporal aspect of the data. We also experimented the idea of ResNet (recently proposed for CNN) in our implementation with RNN and LSTM. Monomodal systems (audio-only, visual-only) were also compared to multimodal (audio+visual modality) ones. For the latter, a mid-level multimodal fusion was implemented, i.e., each modality was rst processed independently through one or more layers before merging and processing by some additional layers, as illustrated in gure 1. The best structures and set of parameters were chosen while training on the Flickr part of the Jiang dataset [ 6 ]. Contrary to the MediaEval dataset, this dataset contains 1200 longer videos, equally balanced between the two classes. Once the structures and parameters were chosen, some upsampling/resampling process was applied while re-training on the MediaEval dataset.

Run #3: LSTM-ResNet-based The best structure

obtained while validating on the Jiang dataset corresponds to gure 1, with a multimodal processing part composed of a residual block built upon 2 LSTM layers of size 436. After re-training on the MediaEval dataset, upsampling with a factor of 9 was applied to the input samples.

Systems based on the proposed CSP

Figure 2 illustrates the philosophy of this new structure, which can be seen as a generalization of the traditional RNN, in which at a time instance t, N samples of the input sequence go through N recurrent nodes arranged in a circular structure (allowing to take into account both the past and the future over a temporal window of size N ) to produce N internal outputs. These outputs are then combined to form a nal output at t. This architecture targets a better modeling 1http://scikit-learn.org/stable/modules/generated/ sklearn.svm.SVC

Runs - image Run #1 Run #2 MAP Random baseline

0.1656

Runs - video Run #3 Run #4 Run #5 Random baseline MAP

The obtained results are reported in terms of MAP in Table 1, with some baseline values computed from a random ranking of the test data. At least on the image subtask our systems perform signi cantly higher than the baseline. For the video subtask, MAP values are lower and we may wonder whether these performances come in part from the di culty of the task itself or the dataset which contains signi cant number of very short shots that were certainly di cult to annotate.

For the image subtask, we observed that simple SVM systems perform similarly (development set) and even slightly better (test set) than more sophisticated DNNs, leading to the conclusion that the size of the dataset was probably not large enough for DNN training. This is also supported by our test on the external Flickr dataset containing 200,000 images for which DNN reached more than 80% accuracy.

For the video subtask, several conclusions may be drawn. First, multimodality seems to bring bene t to the task (this was con rmed by some additional but not submitted runs). Second, the new CSP structure seems be able to capture the temporal evolution of the videos better than classic RNN and more sophisticated LSTM-ResNet structures, and this independently of the monomodal branches which were the same in both cases. This very rst results support the need for further testing of this new structure in the future.

[1] Flickr interestingness api . https://www.flickr.com/services/api/flickr. interestingness.getList.html.

[2] C.-H. Demarty , M. Sjoberg, B.

Ionescu , T.-T. Do, H.

Wang , N. Q.

Duong , and F.

Lefebvre . Mediaeval 2016 predicting media interestingness task . In Proc. of the MediaEval 2016 Workshop , Hilversum, Netherlands, Oct. 20 - 21 , 2016 .

[3]

He ,

Zhang , S. Ren, and

Sun . Deep residual learning for image recognition . In arXiv prepring arXiv:1506.01497 , 2015 .

[4]

Hochreiter and

Schmidhuber . Long short-term memory . Neural Comput. , 9 ( 8 ): 1735 { 1780 , Nov . 1997 .

[5]

Jia ,

Shelhamer ,

Donahue ,

Karayev ,

Long ,

Girshick ,

Guadarrama , and T. Darrell. Ca e: Convolutional architecture for fast feature embedding . arXiv preprint arXiv:1408.5093 , 2014 .

[6]

Y.-G.

Jiang ,

Wang ,

Feng ,

Xue ,

Zheng , and

Yang . Understanding and predicting interestingness of videos . In Proceedings of the Twenty-Seventh AAAI Conference on Arti cial Intelligence , AAAI'13 , pages 1113 { 1119 . AAAI Press, 2013 .

[7]

R. K.

Srivastava ,

Gre , and

Schmidhuber . Highway networks . CoRR, abs/1505.00387 , 2015 .