ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings
           and Video2GIF for Video Interestingness

            Arun Balajee Vasudevan                           Michael Gygli                     Anna Volokitin
                 CVLab, ETH Zurich                         CVLab, ETH Zurich                 CVLab, ETH Zurich
             arunv@student.ethz.ch                   gygli@vision.ee.ethz.ch           anna.volokitin@vision.ee.ethz.ch
                                                          Luc Van Gool
                                                           CVLab, ETH Zurich
                                                    vangool@vision.ee.ethz.ch

ABSTRACT
This paper presents the methods that underly our submis-
sion to the Predicting Media Interestingness Task at Medi-
aEval 2016. Our contribution relies on two main approaches:
(i) A similarity metric between image and text and (ii) a
generic video highlight detector. In particular, we develop
a method for learning the similarity of text and images, by
projecting them into the same embedding space. This em-
bedding allows to find video frames that are both, canonical
and relevant w.r.t the title of the video. We present the re-
sult of different configurations and give insights into when
our best performing method works well and where it has
difficulties.
                                                                             Figure 1: Visual Semantic Embedding Model
1.     INTRODUCTION
   The number of online video uploads has been growing for
                                                                      et al. [11], for example, use the title of a video to retrieve
many years1 . In the contemporary fast moving world, it
                                                                      photos from Flickr. Then, the video frame interestingness
is clearly observable that social media trends a shortened
                                                                      is measured by computing the visual similarity between the
or compressed form of videos than their complete versions,
                                                                      frame and retrieved photo set.
as they are more easily consumable. This increases the im-
                                                                         In this work, we rely on two models: (i) a frame-based
portance of extracting attractive keyframes or automatically
                                                                      model that uses textual side information and (ii) a generic
finding the best video segments from the videos. Such an
                                                                      predictor for finding video highlights in the form of segments
condensed form of videos may improve the viewer experi-
                                                                      [6]. For our frame-based model, we follow the work of [12]
ence [1] as well as video search [2].
                                                                      and learn a joint embedding space for images and text, which
   In the following we will detail our approach for tackling
                                                                      allows to measure relevance of a frame w.r.t. some text such
this difficult prediction problem and present our results on
                                                                      as the video title. For the video segment selection based on
the MediaEval 2016 challenge on Predicting Media Interest-
                                                                      its interestingness, we use the work of Gygli et al.[6] which
ingness [3]. The goal of this task is to predict the frame
                                                                      trained a Deep RankNet to rank the video segments of a
and segment interestingness of Hollywood like movie trail-
                                                                      video based upon on their suitability as animated GIFs.
ers. This, in turn, helps a user to make a better decision
about whether he or she might be interested in a movie.
The dataset provided for this task consists of a development          2.   VISUAL-SEMANTIC EMBEDDING
set of 52 trailers and a test set of 26 trailers. More informa-
                                                                         The structure of our Visual Semantic Embedding model
tion on the task can be found in [3].
                                                                      is shown in Figure 1. In our model, we have two paral-
   There are many conventional works for extracting frames
                                                                      lel networks for the images and texts separately, which are
based on the visual content [5, 9, 13, 16]. More recently,
                                                                      jointly trained with a common loss function. The network is
several works have presented models that rely on semantic
                                                                      built in an end-to-end fashion for training and inference and
information associated with the videos such as the title of the
                                                                      trained on the MSR Clickture dataset [7]. The aim of our
video [14] or a user query [12] to find relevant and interesting
                                                                      model is to map images and queries into the same textual-
frames. The use of semantic side information allows to build
                                                                      visual embedding space. In this space, semantic proximity
a strong, video specific interestingness model [11, 14]. Liu
                                                                      between texts and images can be easily computed by the
1
    https://www.youtube.com/yt/press/statistics.html                  cosine similarity of their representations [4, 10]. We train
                                                                      the network with positive and negative examples of query-
                                                                      image pairs from the MSR dataset and learn to score the
Copyright is held by the author/owner(s).                             positive pair higher than the negative one, i.e. we pose it as
MediaEval 2016 Workshop October 20-21, 2016, Hilversum, Netherlands   a ranking problem. Thus, we optimize an objective that re-
Figure 2: Qualitative Results of 3 pairs of highly ranked keyframes followed by ground truth for following videos with titles:
Captives, After Earth, Stonewall (from left). Blue color text depicts Prediction score while Green depicts ground truth score.


      Tasks    Run types      mAP
      Image     Run-1        0.1866
                Run-2        0.1952
                Run-3        0.1858
      Video     Run-1        0.1362
                Run-2        0.1574
          (a) Runs Comparison


                        Figure 3: Tabulated Results in (a) and Precision-Recall curve of two subtasks


quires the query embedding with respect to a related image       8M query-image pairs. For video interestingness task, we
to have a higher cosine similarity compared to the embed-        use Video2GIF [6]. However, Video2GIF does not consider
ding w.r.t. to the randomly selected image. Let h(q, v) be       any meta information for scoring and ranking the video seg-
the score from the model for a text query q for some image       ments. Hence, we propose to combine Visual Semantic Em-
v. Let v + be an image of text-relevant image (positive) and     bedding model scores with Video2GIF scores. For this, we
v − be image embedding of non-relevant image. Then, our          extract the middle frame from each video segment and score
objective function is as follows:                                that frame using Visual Semantic Embedding model. Then,
                                                                 we combine its score with the score from Video2GIF for the
                    h(q, v + ) > h(q, v − ).             (1)     same segment by averaging. We submit two runs for Video
  We use a huber rank loss [8] to optimize this objective,       Interestingness subtask : 1) Run-1 Video2GIF [6] 2) Run-2
similar to [6].                                                  Averaging the prediction scores of Run-1 of video subtask
  In the inference stage, for a given movie title and given      and Run-2 of image subtask. The combined score seems to
keyframes, we embed the title and the keyframes into the         rank the segments better than Video2GIF model alone as
same space. Then, we rank the list of keyframes based on the     seen in Figure 3.
proximity of the frame embeddings to the text embedding.
                                                                 5.    RESULTS AND DISCUSSION
3.   VIDEO HIGHLIGHT DETECTOR                                       We evaluate our models on the MediaEval 2016 Predicting
   We use the work of Video2GIF [6] as a generic video high-     Media Interestingness Task [3].
light detector. To capture the spatio-temporal visual fea-          Figure 3 represents the Precision-Recall curves of image
tures of video segments, 3D convolutional neural networks        and video interestingness subtasks using our models. We
(C3D)[15] are used. The model comprises of C3D followed          observe that Run-2 of image interestingness subtask per-
by two fully connected layers and finally outputs a score.       forms better than the other two runs. Our initial model is
The model is trained on the Video2GIF dataset [6] to learn       trained on images which differ from video frames in quality
to score segments that were used for GIFs higher than the        and content. Thus, fine-tuning on the development set for
non-selected segments within a video. Thus, it also uses a       adapting to video domain improves mAP. Qualitatively, in
ranking loss for training. The scores given by the model are     Figure 2, we observe that the first two pairs have the model
not absolute but are ordinal i.e. a segment with better score    selected keyframes quite close to the ground truth. This
is more interesting than a low scored segment. These scores      is because the movie titles (Captives, After Earth) give a
pave the way for ranking of segments for the interestingness.    clear visual hint on what an appealing frame should contain.
Given the segments of a video, the model ranks all the seg-      However, the third is a failure case as the title (Stonewall)
ments based on their suitability as a GIF which is generally     is misleading: It is about a protest movement, not a wall.
a short segment of a video which is appealing to a viewer.       Thus, our model has difficulties picking the right keyframes
                                                                 in this case. In the case of video interestingness subtask, we
                                                                 observe that Run-2 performs better than Run-1. Combin-
4.   EXPERIMENTS                                                 ing the prediction scores of Video2GIF (Run-1) and Run-2
   For the Image Interestingness subtask, we use Visual Se-      of image interestingness subtask significantly improves the
mantic Embedding Model and we then fine-tune the model           performance of video interestingness subtask. This is be-
using the dev set of MediaEval for domain adaptation. We         cause Video2GIF does not take into account the relevance
submit three runs for this task 1) Run-1 Visual Semantic         of movie titles for scoring the segments in contrast to query
Embedding Model trained on 0.5M query(text)-image pairs          relevant scoring of keyframes of the Visual Semantic Embed-
of MSR Clickture dataset 2) Run-2 Run-1 model finetuned          ding model. Hence, the combination of both models outper-
on development set 3) Run-3 Run-1 model but trained on           forms Video2GIF alone (Run-1).
6.   REFERENCES                                               Pattern Recognition (CVPR), June 2016.
 [1] S. Bakhshi, D. Shamma, L. Kennedy, Y. Song,
     P. de Juan, and J. Kaye. Fast, Cheap, and Good:
     Why Animated GIFs Engage Us. In ACM Conference
     on Human Factors in Computing Systems, 2016.
 [2] L. Ballan, M. Bertini, G. Serra, and A. Del Bimbo. A
     data-driven approach for tag refinement and
     localization in web videos. Computer Vision and
     Image Understanding, 2015.
 [3] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
     H. Wang, N. Q. K. Duong, and F. Lefebvre.
     MediaEval 2016 Predicting Media Interestingness
     Task. In MediaEval 2016, 2016.
 [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio,
     J. Dean, T. Mikolov, et al. Devise: A deep
     visual-semantic embedding model. In Advances in
     neural information processing systems, 2013.
 [5] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
     and L. Van Gool. The Interestingness of Images. In
     The IEEE International Conference on Computer
     Vision (ICCV), 2013.
 [6] M. Gygli, Y. Song, and L. Cao. Video2GIF:
     Automatic Generation of Animated GIFs from Video.
     In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition, 2016.
 [7] X.-S. Hua, L. Yang, J. Wang, J. Wang, M. Ye,
     K. Wang, Y. Rui, and J. Li. Clickage: towards
     bridging semantic and intent gaps via mining click
     logs of search engines. In Proceedings of the 21st ACM
     international conference on Multimedia, 2013.
 [8] P. J. Huber et al. Robust estimation of a location
     parameter. The Annals of Mathematical Statistics,
     35(1):73–101, 1964.
 [9] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng,
     and H. Yang. Understanding and predicting
     interestingness of videos. In AAAI, 2013.
[10] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
     visual-semantic embeddings with multimodal neural
     language models. arXiv preprint arXiv:1411.2539,
     2014.
[11] F. Liu, Y. Niu, and M. Gleicher. Using Web Photos
     for Measuring Video Frame Interestingness. In IJCAI,
     2009.
[12] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo.
     Multi-task deep visual-semantic embedding for video
     thumbnail selection. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern
     Recognition, 2015.
[13] M. Soleymani. The quest for visual interest. In
     Proceedings of the 23rd Annual ACM Conference on
     Multimedia, 2015.
[14] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes.
     TVSum: Summarizing Web Videos Using Titles. In
     Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition, 2015.
[15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and
     M. Paluri. Learning spatiotemporal features with 3d
     convolutional networks. arXiv preprint
     arXiv:1412.0767, 2014.
[16] Y. Wang, Z. Lin, X. Shen, R. Mech, G. Miller, and
     G. W. Cottrell. Event-specific image importance. In
     The IEEE Conference on Computer Vision and