=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_18 |storemode=property |title=HUCVL at MediaEval 2016: Predicting Interesting Key Frames with Deep Models |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_18.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/ErdoganEE16 }} ==HUCVL at MediaEval 2016: Predicting Interesting Key Frames with Deep Models== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_18.pdf
      HUCVL at MediaEval 2016: Predicting Interesting Key
                 Frames with Deep Models

                                   Goksu Erdogan, Aykut Erdem, Erkut Erdem
                                    Hacettepe Computer Vision Lab (HUCVL)
                     Department of Computer Engineering, Hacettepe University, Ankara, Turkey
                              {goksuerdogan, aykut, erkut}@cs.hacettepe.edu.tr


ABSTRACT                                                         2.1    AlexNet
In MediaEval 2016, we focus on the image interestingness           For our first model, we fine-tune AlexNet [9], which is
subtask which involves predicting interesting key frames of      trained on ILSVRC 2012 task of ImageNet to classify more
a video in the form of a movie trailer. We specifically pro-     than a thousand object categories. Image interestingness
pose three different deep models for this subtask. The first     requires predicting a single real-valued output, so we re-
two models are based on fine-tuning two pretrained mod-          place the last soft-max layer with regression layer and use a
els, namely AlexNet and MemNet, where we cast the in-            Euclidean loss layer to fine-tune the model. In our experi-
terestingness prediction as a regression problem. Our third      ments, we only fine-tune the last fully connected layer, while
deep model, on the other hand, depends on a triplet network      the weights of other layers are not updated. Training lasted
which is comprised of three instances of the same feedfor-       approximately 2000 epochs.
ward network with shared weights, and trained according to
a triplet ranking loss. Our experiments demonstrate that all     2.2    MemNet
these models provide relatively similar and promising results      Our second model is based on the recently proposed Mem-
on the image interestingness subtask.                            Net model [8], which is trained for the image memorability
                                                                 task. Although memorability and interestingness are not ex-
                                                                 actly the same [5], we think that fine-tuning a model related
1.   INTRODUCTION                                                to an intrinsic property of images could help us to learn bet-
  Understanding and predicting interestingness of images         ter high-level features for the interestingness task. In our
or video shots have been proposed as a recent problem in         experiments, we only update the weight of the fully con-
computer vision literature [7, 6, 2], which finds many appli-    nected layers, where the training lasted nearly 3000 epochs.
cations such as video summarization [3] or automatic gen-
eration of animated gifs [4]. The MediaEval 2016 Predict-        2.3    Triplet Loss
ing Media Interestingness Task is introduced as a new task          Our third model also follows the AlexNet architecture,
which consists of two subtasks on image and video levels,        but differs from our previous models in that we employ a
respectively. In our work, we concentrate only on the im-        different training procedure. Specifically, we consider a deep
age subtask, which involves identifying interesting keyframes    triplet network which is composed of three instances of the
of a given video of a movie trailer, and where we process        AlexNet model where the weights are shared across the in-
each frame independently. Details about this subtask in-         stances. We employ a ranking loss similar to that of [4].
cluding the related dataset and the experimental setting can     However, while the authors of [4] consider a siamese net-
be found in the overview paper [1].                              work and a pairwise ranking loss, here we utilize a triplet
                                                                 ranking loss [10] within our network. Once the training is
                                                                 finished, we use a single instance of the feedforward network
2.   METHODS                                                     to predict the interestingness score of a given keyframe.
   Deep convolutional neural networks (CNNs) have revolu-           Considering a triplet network allow us to learn a 1D-
tionized the computer vision field in recent years, obtaining    embedding space for images, where the triplet ranking loss
state-of-the-art results in many different problem domains.      function enforces an interesting frame to be close by to other
In our submission, we tested three different CNN models,         interesting frames and far away from the uninteresting ones:
which are all based on the popular AlexNet architecture [9].        L(x, x+ , x− ) = max(0, D(x, x+ ) − D(x, x− ) + M )      (1)
All of our networks have five convolutional layers and three
fully connected layers with a final layer returning a scalar     where x, x+ , x− denote the anchor, positive and negative
interestingness score. The detailed descriptions of our mod-     samples given as inputs, D(·) represents the distance be-
els are respectively given in Sections 2.1-2.3, and in Section   tween the interestingness scores and M represents the mar-
2.4, we explain how we convert interestingness scores into       gin. In terms of optimization, one critical point is the triplet
labels.                                                          selection procedure since we observe that using all possible
                                                                 triplets in the training is costly and might lead to a local
                                                                 minima. For this reason, we use a hard negative mining
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-    strategy, which is commonly used in similar works. Dur-
lands.                                                           ing training, we only fine-tune the fully connected layers
as in our previous models, while all the weights are initial-      weights of only the fully connected layers of the pretrained
ized with AlexNet weights. We interrupted training at the          models.
10000th epoch.                                                        The performances of our models are evaluated by consid-
                                                                   ering the accuracy and the mean average precision (mAP)
2.4    Interestingness Classification                              scores. Table 2 summarizes our results on the test set. As
   Thus far, we have described our CNN models which can            can be seen, the mAP scores of all the proposed models are
be used to compute real valued interestingness scores for          not very high, demonstrating that interestingness prediction
each key frame of a given video sequence, where these inter-       is not a trivial task. We note that the accuracy values be-
estingness scores correspond to confidence values. However,        ing high are somewhat misleading since the training data
the task also requires classifying a frame as interesting or       is highly unbalanced (see Table 1). Hence, we additionally
not, in addition to predicting their interestingness scores.       show the confusion matrices for all of our runs in Table 3.
   A simple and straightforward way to convert real valued
outputs to class labels is to introduce a thresholding pro-
cedure. However, choosing a single appropriate threshold                Table 2: Evaluation results on the test set.
                                                                                 Runs     mAP accuracy
value is not easy; in fact, we observe that it is very video
                                                                                 1      0.2125     0.8224
sequence dependent. Figure 1 shows the ground truth dis-
                                                                                 2       0.2121   0.8275
tributions of the confidence values for the interesting (blue)
                                                                                 3       0.2001    0.8249
and uninteresting (orange) frames over all video sequences
in the training set. As can be seen, these distributions have
a large overlap, demonstrating that a single threshold value
won’t work.
                                                                        Table 3: Confusion matrices for our runs.
                                                                             Run 1       Run 2       Run 3
                                                                           1890 211    1896 205    1893 208
                                                                            205   36    199  42     202   39

                                                                      To sum up, the growth of the visual media on the In-
                                                                   ternet has led to an increased need for understanding and
                                                                   predicting interestingness of images and video shots, and
                                                                   in this work, within the proposed deep models, we treat
                                                                   each key frame of a given video as an independent sample.
                                                                   One possible future direction could be to process each key
                                                                   frame in the context of a local temporal neighborhood or
                                                                   the whole video, by extending our models to process mul-
                                                                   tiple key frames simultaneously. Another extension could
Figure 1: Distributions of the confidence values for               be to consider a multi-task learning scheme, which involves
interesting/uninteresting frames.                                  jointly classifying key frames as interesting or not and esti-
                                                                   mating an interestingness score based on a regression-based
   Next, we analyzed the ratio between interesting and un-         loss function, which eliminates the need for post-processing
interesting key frames per each training video. As shown in        the regression scores.
Table 1, the ratio is, on average, about 1:9. Hence, given a
test video sequence, we sort all its key frames according to       Acknowledgement.
their predicted interesting scores and classify the top 10%          This work is partially supported by the Scientific and
frames as interesting.                                             Technological Research Council of Turkey (Award #113E497).

Table 1: Statistics for the confidence values for inter-
esting and uninteresting frames over training data                 4.   REFERENCES
              frames          mean std
                                                                    [1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
              interesting     0.11 0.08                                 H. Wang, N. Q.K. Duong, and F. Lefebvre. Mediaeval
              uninteresting 0.89 0.08                                   2016 predicting media interestingness task. In Proc. of
                                                                        the MediaEval 2016 Workshop, 2016.
                                                                    [2] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
3.    RESULTS AND DISCUSSION                                            and L. Van Gool. The interestingness of images. Proc.
   We submit three different runs for the image subtask.                International Conference on Computer Vision, pages
While the first run uses our fine-tuned AlexNet, the second             1633–1640, 2013.
one uses predictions from our fine-tuned MemNet model.              [3] M. Gygli, H. Grabner, H. Riemenschneider, and
Lastly, the third run includes the results of our proposed              L. Van Gool. Creating summaries from user videos. In
triplet network. All these models are trained by using the              Proc. European Conference on Computer Vision,
provided training data. However, we split it into two as                pages 505–520, 2014.
training and validation splits using a ratio of 80% and 20%         [4] M. Gygli, Y. Song, and L. Cao. Video2gif: Automatic
to deal with overfitting. In our experiments, as the size of the        generation of animated gifs from video. In Proc.
training data is relatively small, we decided to update the             Computer Vision and Pattern Recognition, 2016.
 [5] P. Isola, J. Xiao, D. Parikh, A. Torralba, and
     A. Oliva. What makes a photograph memorable?
     Pattern Analysis and Machine Intelligence, IEEE
     Transactions on, 36(7):1469–1482, 2014.
 [6] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng,
     and H. Yang. Understanding and predicting
     interestingness of videos. In Proc. Association for the
     Advancement of Artificial Intelligence Conference,
     pages 1113–1119, 2013.
 [7] H. Katti, K. Y. Bin, T. S. Chua, and M. Kankanhalli.
     Pre-attentive discrimination of interestingness in
     images. In Proc. IEEE International Conference on
     Multimedia and Expo, pages 1433–1436, 2008.
 [8] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva.
     Understanding and predicting image memorability at
     a large scale. In Proc. International Conference on
     Computer Vision, pages 2390–2398, 2015.
 [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
     Imagenet classification with deep convolutional neural
     networks. In F. Pereira, C. J. C. Burges, L. Bottou,
     and K. Q. Weinberger, editors, Advances in Neural
     Information Processing Systems, pages 1097–1105,
     2012.
[10] X. Wang and A. Gupta. Unsupervised learning of
     visual representations using videos. In Proc.
     International Conference on Computer Vision, pages
     2794–2802, 2015.