ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings and Video2GIF for Video Interestingness Arun Balajee Vasudevan Michael Gygli Anna Volokitin CVLab, ETH Zurich CVLab, ETH Zurich CVLab, ETH Zurich arunv@student.ethz.ch gygli@vision.ee.ethz.ch anna.volokitin@vision.ee.ethz.ch Luc Van Gool CVLab, ETH Zurich vangool@vision.ee.ethz.ch ABSTRACT This paper presents the methods that underly our submis- sion to the Predicting Media Interestingness Task at Medi- aEval 2016. Our contribution relies on two main approaches: (i) A similarity metric between image and text and (ii) a generic video highlight detector. In particular, we develop a method for learning the similarity of text and images, by projecting them into the same embedding space. This em- bedding allows to find video frames that are both, canonical and relevant w.r.t the title of the video. We present the re- sult of different configurations and give insights into when our best performing method works well and where it has difficulties. Figure 1: Visual Semantic Embedding Model 1. INTRODUCTION The number of online video uploads has been growing for et al. [11], for example, use the title of a video to retrieve many years1 . In the contemporary fast moving world, it photos from Flickr. Then, the video frame interestingness is clearly observable that social media trends a shortened is measured by computing the visual similarity between the or compressed form of videos than their complete versions, frame and retrieved photo set. as they are more easily consumable. This increases the im- In this work, we rely on two models: (i) a frame-based portance of extracting attractive keyframes or automatically model that uses textual side information and (ii) a generic finding the best video segments from the videos. Such an predictor for finding video highlights in the form of segments condensed form of videos may improve the viewer experi- [6]. For our frame-based model, we follow the work of [12] ence [1] as well as video search [2]. and learn a joint embedding space for images and text, which In the following we will detail our approach for tackling allows to measure relevance of a frame w.r.t. some text such this difficult prediction problem and present our results on as the video title. For the video segment selection based on the MediaEval 2016 challenge on Predicting Media Interest- its interestingness, we use the work of Gygli et al.[6] which ingness [3]. The goal of this task is to predict the frame trained a Deep RankNet to rank the video segments of a and segment interestingness of Hollywood like movie trail- video based upon on their suitability as animated GIFs. ers. This, in turn, helps a user to make a better decision about whether he or she might be interested in a movie. The dataset provided for this task consists of a development 2. VISUAL-SEMANTIC EMBEDDING set of 52 trailers and a test set of 26 trailers. More informa- The structure of our Visual Semantic Embedding model tion on the task can be found in [3]. is shown in Figure 1. In our model, we have two paral- There are many conventional works for extracting frames lel networks for the images and texts separately, which are based on the visual content [5, 9, 13, 16]. More recently, jointly trained with a common loss function. The network is several works have presented models that rely on semantic built in an end-to-end fashion for training and inference and information associated with the videos such as the title of the trained on the MSR Clickture dataset [7]. The aim of our video [14] or a user query [12] to find relevant and interesting model is to map images and queries into the same textual- frames. The use of semantic side information allows to build visual embedding space. In this space, semantic proximity a strong, video specific interestingness model [11, 14]. Liu between texts and images can be easily computed by the 1 https://www.youtube.com/yt/press/statistics.html cosine similarity of their representations [4, 10]. We train the network with positive and negative examples of query- image pairs from the MSR dataset and learn to score the Copyright is held by the author/owner(s). positive pair higher than the negative one, i.e. we pose it as MediaEval 2016 Workshop October 20-21, 2016, Hilversum, Netherlands a ranking problem. Thus, we optimize an objective that re- Figure 2: Qualitative Results of 3 pairs of highly ranked keyframes followed by ground truth for following videos with titles: Captives, After Earth, Stonewall (from left). Blue color text depicts Prediction score while Green depicts ground truth score. Tasks Run types mAP Image Run-1 0.1866 Run-2 0.1952 Run-3 0.1858 Video Run-1 0.1362 Run-2 0.1574 (a) Runs Comparison Figure 3: Tabulated Results in (a) and Precision-Recall curve of two subtasks quires the query embedding with respect to a related image 8M query-image pairs. For video interestingness task, we to have a higher cosine similarity compared to the embed- use Video2GIF [6]. However, Video2GIF does not consider ding w.r.t. to the randomly selected image. Let h(q, v) be any meta information for scoring and ranking the video seg- the score from the model for a text query q for some image ments. Hence, we propose to combine Visual Semantic Em- v. Let v + be an image of text-relevant image (positive) and bedding model scores with Video2GIF scores. For this, we v − be image embedding of non-relevant image. Then, our extract the middle frame from each video segment and score objective function is as follows: that frame using Visual Semantic Embedding model. Then, we combine its score with the score from Video2GIF for the h(q, v + ) > h(q, v − ). (1) same segment by averaging. We submit two runs for Video We use a huber rank loss [8] to optimize this objective, Interestingness subtask : 1) Run-1 Video2GIF [6] 2) Run-2 similar to [6]. Averaging the prediction scores of Run-1 of video subtask In the inference stage, for a given movie title and given and Run-2 of image subtask. The combined score seems to keyframes, we embed the title and the keyframes into the rank the segments better than Video2GIF model alone as same space. Then, we rank the list of keyframes based on the seen in Figure 3. proximity of the frame embeddings to the text embedding. 5. RESULTS AND DISCUSSION 3. VIDEO HIGHLIGHT DETECTOR We evaluate our models on the MediaEval 2016 Predicting We use the work of Video2GIF [6] as a generic video high- Media Interestingness Task [3]. light detector. To capture the spatio-temporal visual fea- Figure 3 represents the Precision-Recall curves of image tures of video segments, 3D convolutional neural networks and video interestingness subtasks using our models. We (C3D)[15] are used. The model comprises of C3D followed observe that Run-2 of image interestingness subtask per- by two fully connected layers and finally outputs a score. forms better than the other two runs. Our initial model is The model is trained on the Video2GIF dataset [6] to learn trained on images which differ from video frames in quality to score segments that were used for GIFs higher than the and content. Thus, fine-tuning on the development set for non-selected segments within a video. Thus, it also uses a adapting to video domain improves mAP. Qualitatively, in ranking loss for training. The scores given by the model are Figure 2, we observe that the first two pairs have the model not absolute but are ordinal i.e. a segment with better score selected keyframes quite close to the ground truth. This is more interesting than a low scored segment. These scores is because the movie titles (Captives, After Earth) give a pave the way for ranking of segments for the interestingness. clear visual hint on what an appealing frame should contain. Given the segments of a video, the model ranks all the seg- However, the third is a failure case as the title (Stonewall) ments based on their suitability as a GIF which is generally is misleading: It is about a protest movement, not a wall. a short segment of a video which is appealing to a viewer. Thus, our model has difficulties picking the right keyframes in this case. In the case of video interestingness subtask, we observe that Run-2 performs better than Run-1. Combin- 4. EXPERIMENTS ing the prediction scores of Video2GIF (Run-1) and Run-2 For the Image Interestingness subtask, we use Visual Se- of image interestingness subtask significantly improves the mantic Embedding Model and we then fine-tune the model performance of video interestingness subtask. This is be- using the dev set of MediaEval for domain adaptation. We cause Video2GIF does not take into account the relevance submit three runs for this task 1) Run-1 Visual Semantic of movie titles for scoring the segments in contrast to query Embedding Model trained on 0.5M query(text)-image pairs relevant scoring of keyframes of the Visual Semantic Embed- of MSR Clickture dataset 2) Run-2 Run-1 model finetuned ding model. Hence, the combination of both models outper- on development set 3) Run-3 Run-1 model but trained on forms Video2GIF alone (Run-1). 6. REFERENCES Pattern Recognition (CVPR), June 2016. [1] S. Bakhshi, D. Shamma, L. Kennedy, Y. Song, P. de Juan, and J. Kaye. Fast, Cheap, and Good: Why Animated GIFs Engage Us. In ACM Conference on Human Factors in Computing Systems, 2016. [2] L. Ballan, M. Bertini, G. Serra, and A. Del Bimbo. A data-driven approach for tag refinement and localization in web videos. Computer Vision and Image Understanding, 2015. [3] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. K. Duong, and F. Lefebvre. MediaEval 2016 Predicting Media Interestingness Task. In MediaEval 2016, 2016. [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, 2013. [5] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Van Gool. The Interestingness of Images. In The IEEE International Conference on Computer Vision (ICCV), 2013. [6] M. Gygli, Y. Song, and L. Cao. Video2GIF: Automatic Generation of Animated GIFs from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [7] X.-S. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Wang, Y. Rui, and J. Li. Clickage: towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM international conference on Multimedia, 2013. [8] P. J. Huber et al. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964. [9] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In AAAI, 2013. [10] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [11] F. Liu, Y. Niu, and M. Gleicher. Using Web Photos for Measuring Video Frame Interestingness. In IJCAI, 2009. [12] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [13] M. Soleymani. The quest for visual interest. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015. [14] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes. TVSum: Summarizing Web Videos Using Titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. arXiv preprint arXiv:1412.0767, 2014. [16] Y. Wang, Z. Lin, X. Shen, R. Mech, G. Miller, and G. W. Cottrell. Event-specific image importance. In The IEEE Conference on Computer Vision and