=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_19
|storemode=property
|title=Multimodality and Deep Learning when Predicting Media Interestingness
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_19.pdf
|volume=Vol-1984
|authors=Eloïse Berson,Claire-Hélène Demarty,Ngoc Q.K. Duong
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BersonDD17
}}
==Multimodality and Deep Learning when Predicting Media Interestingness==
Multimodality and Deep Learning when predicting Media Interestingness Eloïse Berson, Claire-Hélène Demarty, Ngoc Q. K. Duong Technicolor, France eloise.berson@gmail.com,{claire-helene.demarty,quang-khanh-ngoc.duong}@technicolor.com ABSTRACT contain semantic information than the CNN features alone. Dimen- This paper summarizes the computational models that Technicolor sion of the ICB feature is 1024. proposes to predict interestingness of images and videos within the To go further in this vein of adding semantic and contextual MediaEval 2017 Predicting Media Interestingness Task. Our systems information, textual metadata was directly extracted from ImDB 1 , are based on deep learning architectures and exploit the use of both exploiting the fact that the MediaEval 2017 dataset was built from semantic and multimodal features. Based on the obtained results, Hollywood-like movie extracts. Except for 3 movies (for 2 of them, we discuss our findings and obtain some scientific perspectives for a short summary was built from descriptions found on the internet; the task. for the last one, description was left empty), ImDB information was available: each movie description and/or storyline was proposed 1 INTRODUCTION at the input of the RAKE algorithm [10], for keyword extraction. Understanding interestingness of media content such as images Thus, several keywords were extracted per movie, from which we and videos, has gained a significant attention from the research derived a textual feature of dimension 300, classically using the community recently as it offers numerous practical applications in Word2Vec [9] representations (pretrained on GoogleNews dataset) e.g., content selection or recommendation [1, 2, 5]. Following the of this batch of keywords and averaging them. success of the 2016 edition [4], the MediaEval 2017 Predicting Media 3 DNN ARCHITECTURES Interestingness Task extends the benchmark to larger datasets, also annotated with a greater human annotation effort. A complete Global workflows for all submitted runs and for both subtasks are description of the task can be found in [3]. shown in Figure 1. As stated in the introduction, most components For both subtasks, Technicolor’s motivation was to build incre- used to build the systems’ architectures for both subtasks were the mentally from last year’s systems [11], i.e., re-use similar features same as in [11]. Thus, to cope with the unbalance of the dataset, and DNN architectures, while adding some contextual information some resampling of the data was applied during training. Several to the content. To this end, two new features were added, so as to parameter configurations were investigated by splitting the dataset capture additional semantic information related to the content, fol- in 80% for training and 20% for validation. A final retraining of the lowing a similar idea as in [8]. These new features (section 2), were best model was then applied on the complete development set. expected to bring contextual information related to the content. In For the image subtask, different concatenations of the features a second step (section 3), and for the video subtask only, several were investigated to understand the contribution of each modality embeddings of this semantic information at different network lev- and to conclude on the input of contextual information to the els were experimented. The aim was to investigate how this was task. Thus each submitted run differs from the others by the input influencing the temporal modeling of this new information. features, and the adaptation of the layer sizes, while the DNN architecture remains the same: a single MLP layer, with rectified 2 MULTIMODALITY AND CONTEXTUAL linear unit (ReLU) activation and a dropout of 0.5. All submitted FEATURES runs are summarized in Figure 1a, with different colors depending As in 2016, CNN coming from the fc7 layer of the pre-trained caf- on the feature concatenation; Run#1, corresponding to 2016 best feNet model (image modality, both subtasks) and MFCCs concate- system, will serve as a baseline. nated with their first and second derivatives (audio modality, video For the video subtask, three levels of embedding for the W2V subtask) were extracted following the protocol described in [11]. features were investigated (see Figure 1b), except for Run#1 which Dimensions for these features were 4096 and 180, respectively. re-uses one of last year’s systems (Run#3 in [11], see Figure 1 for the To capture some additional semantic information, Image-Captioning used layers, each of them with ReLU activation function, followed Based (ICB) features [7] were computed for each image or frame, de- by a dropout equal to 0.5). Run#1’s architecture is kept for the other pending on the subtask. These features correspond to the projection runs, with some adaptation of the multimodal block depending on of an image in a visual-semantic embedding space [7], obtained the input feature sizes (one or two LSTM layers, with a residual from a jointly-trained model for images and captions dedicated block). In Run#1, our baseline, audio and video modalities only are to automatic captioning. In this embedded space, where seman- used. For the image channel, a first modal-specific learning step was tic distances between projected image and captioning features are implemented with a MLP layer followed by a LSTM layer. For the minimized, the resulting representation features are more likely to audio, a single LSTM layer is used. After merging, both channels serve as input to two LSTM layers, with a residual part (ResNet [6]). Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 see http://www.imdb.com MediaEval’17, 13-15 September 2017, Dublin, Ireland E. Berson, C.H. Demarty, N.Q.K. Duong Figure 1: Workflows for all run submissions: (a) image subtask. (b) video subtask.*: block swapping. In Run#2, W2V features were simply merged to the result of the Image Subtask Video Subtask Runs temporal modeling for the other modalities, whereas for Run#3 MAP MAP@10 MAP MAP10 and Run#4, they were duplicated for each frame and merged into Run#1 0.2615 0.1028 0.1856 0.0589 the workflow either in parallel to the audio and video channels Run#2 0.2525 0.1054 0.1768 0.0465 (Run#4) or after a first merging of these two modalities (Run#3) Run#3 0.2244 0.0693 0.1825 0.0563 (See Figure 1). For each run, some potential processing steps were Run#4 0.2382 0.0875 0.1878 0.0641 added to realize the merge with the other modalities thanks to Run#5 0.2347 0.0861 0.1918 0.0609 either additional LSTM-ResNet layers when temporal modeling was Table 1: Results on both subtasks (Official metric: MAP@10). possible (Runs#3 and 4), or simple MLP layers otherwise (Run#2). These steps were followed by a simple concatenation of the features from the different modalities. Run#5 is similar to Run#4 except that further analysis of the differences between the development and the Time Domain Average and Softmax steps were swapped. The test sets should be done to better understand this observation. motivation for this last run was to test whether the location of the For the video subtask, as expected, W2V features slightly im- decision step (softmax) had an influence on the performance. proved MAP and MAP@10 when considered as a frame-based fea- 4 RESULTS AND DISCUSSION ture. Although they are simply repeated for each frame, i.e., each Results are summarized in Table 1. First runs for both subtasks frame of a given video shares the same textual feature, the concate- show slightly improved MAP values compared to last year’s results. nation of this new information did bring some useful information As the systems remain the same for these runs over the two years, for the video subtask. This difference between the two subtasks re- it tends to show that the dataset size increase and/or the refinement inforces the difference between image and video interestingnesses of the annotations had an effect on the modeling performance. Un- which was already stated last year. Run#5 of the video subtask expectedly, the MAP@10 values are very low (lower than during suggests that keeping the classification for the final step of the the validation process, when MAP@10 values were of the same system is maximizing the performance, which is understandable range or slightly lower than MAP for both tasks). Another unex- as it allows to keep continuous values as much as possible before pected result is that contextual features, either ICB or W2V features, switching to a binary classification. It also corresponds better to the did not bring any improvement to the image subtask, although we annotation protocol where the annotation is done for each video had an opposite conclusion during validation on the development segment as a whole; thus the softmax prediction should also be set with MAP values of resp. 0.36 and 0.38 for those features (for done for the whole segment and not for every single frame. comparison, we obtained 0.31 with CNN features). This suggests As a conclusion, a lot of our findings on the evaluation step that using more features might have led to over-fitting, probably were different from those of the test set. We definitely need to because of the small size of the dataset during training. This over- understand what differs from these two sets that is responsible for fitting might have also been reinforced as, because of a lack of the differences in performance. E.g., some new, significantly longer computation resources, cross-validation was done with one fold and thus more meaningful segments (243 out of 2435) were added only. In the future, some cross-validation process with more folds to the test set only, representing a duration of 46min over a total might lead to a better system. However, once the test set is released, duration of 87min, i.e., more than half of the test set. Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Sharon Lynn Chu, Elena A Fedorovskaya, Francis KH Quek, and Jeffrey Snyder. 2013. The effect of familiarity on perceived interestingness of images.. In Human Vision and Electronic Imaging. [2] Claire-Helène Demarty, Mats Sjöberg, Gabriel Constantin, Ngoc Q. K. Duong, Bogdan Ionescu, Thanh-Toan Do, and Hanli Wang. 2017. Pre- dicting Interestingness of Visual Content. [3] Claire-Helène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Michael Gygli, and Ngoc Q. K. Duong. 2017. MediaEval 2017 Predicting Media Interestingness Task. MediaEval 2017 Workshop (September 2017). [4] Claire-Helène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Hanli Wang, Ngoc Q. K. Duong, and Frederic Lefebvre. 2016. MediaEval 2016 Predicting Media Interestingness Task. MediaEval 2016 Workshop (October 2016). [5] Helmut Grabner, Fabian Nater, Michel Druey, and Luc Van Gool. 2013. Visual Interestingness in Image Sequences. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13). ACM, New York, NY, USA, 1017–1026. https://doi.org/10.1145/2502081.2502109 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In arXiv prepring arXiv:1506.01497. [7] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Uni- fying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014). [8] Brian A. Plummer, Matthew Brown, and Svetlana Lazebnik. 2017. Enhancing Video Summarization via Vision-Language Embedding. In Computer Vision and Pattern Recognition, 2017. CVPR 2017. Proceedings of the International Conference on. IEEE. [9] Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014). [10] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Au- tomatic keyword extraction from individual documents. Text Mining: Applications and Theory (2010), 1–20. [11] Yuesong Shen, Claire-Hélène Demarty, and Ngoc Q. K. Duong. 2016. Technicolor@ MediaEval 2016 Predicting Media Interestingness Task.. In MediaEval.