HUCVL at MediaEval 2016: Predicting Interesting Key Frames with Deep Models Goksu Erdogan, Aykut Erdem, Erkut Erdem Hacettepe Computer Vision Lab (HUCVL) Department of Computer Engineering, Hacettepe University, Ankara, Turkey {goksuerdogan, aykut, erkut}@cs.hacettepe.edu.tr ABSTRACT 2.1 AlexNet In MediaEval 2016, we focus on the image interestingness For our first model, we fine-tune AlexNet [9], which is subtask which involves predicting interesting key frames of trained on ILSVRC 2012 task of ImageNet to classify more a video in the form of a movie trailer. We specifically pro- than a thousand object categories. Image interestingness pose three different deep models for this subtask. The first requires predicting a single real-valued output, so we re- two models are based on fine-tuning two pretrained mod- place the last soft-max layer with regression layer and use a els, namely AlexNet and MemNet, where we cast the in- Euclidean loss layer to fine-tune the model. In our experi- terestingness prediction as a regression problem. Our third ments, we only fine-tune the last fully connected layer, while deep model, on the other hand, depends on a triplet network the weights of other layers are not updated. Training lasted which is comprised of three instances of the same feedfor- approximately 2000 epochs. ward network with shared weights, and trained according to a triplet ranking loss. Our experiments demonstrate that all 2.2 MemNet these models provide relatively similar and promising results Our second model is based on the recently proposed Mem- on the image interestingness subtask. Net model [8], which is trained for the image memorability task. Although memorability and interestingness are not ex- actly the same [5], we think that fine-tuning a model related 1. INTRODUCTION to an intrinsic property of images could help us to learn bet- Understanding and predicting interestingness of images ter high-level features for the interestingness task. In our or video shots have been proposed as a recent problem in experiments, we only update the weight of the fully con- computer vision literature [7, 6, 2], which finds many appli- nected layers, where the training lasted nearly 3000 epochs. cations such as video summarization [3] or automatic gen- eration of animated gifs [4]. The MediaEval 2016 Predict- 2.3 Triplet Loss ing Media Interestingness Task is introduced as a new task Our third model also follows the AlexNet architecture, which consists of two subtasks on image and video levels, but differs from our previous models in that we employ a respectively. In our work, we concentrate only on the im- different training procedure. Specifically, we consider a deep age subtask, which involves identifying interesting keyframes triplet network which is composed of three instances of the of a given video of a movie trailer, and where we process AlexNet model where the weights are shared across the in- each frame independently. Details about this subtask in- stances. We employ a ranking loss similar to that of [4]. cluding the related dataset and the experimental setting can However, while the authors of [4] consider a siamese net- be found in the overview paper [1]. work and a pairwise ranking loss, here we utilize a triplet ranking loss [10] within our network. Once the training is finished, we use a single instance of the feedforward network 2. METHODS to predict the interestingness score of a given keyframe. Deep convolutional neural networks (CNNs) have revolu- Considering a triplet network allow us to learn a 1D- tionized the computer vision field in recent years, obtaining embedding space for images, where the triplet ranking loss state-of-the-art results in many different problem domains. function enforces an interesting frame to be close by to other In our submission, we tested three different CNN models, interesting frames and far away from the uninteresting ones: which are all based on the popular AlexNet architecture [9]. L(x, x+ , x− ) = max(0, D(x, x+ ) − D(x, x− ) + M ) (1) All of our networks have five convolutional layers and three fully connected layers with a final layer returning a scalar where x, x+ , x− denote the anchor, positive and negative interestingness score. The detailed descriptions of our mod- samples given as inputs, D(·) represents the distance be- els are respectively given in Sections 2.1-2.3, and in Section tween the interestingness scores and M represents the mar- 2.4, we explain how we convert interestingness scores into gin. In terms of optimization, one critical point is the triplet labels. selection procedure since we observe that using all possible triplets in the training is costly and might lead to a local minima. For this reason, we use a hard negative mining Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- strategy, which is commonly used in similar works. Dur- lands. ing training, we only fine-tune the fully connected layers as in our previous models, while all the weights are initial- weights of only the fully connected layers of the pretrained ized with AlexNet weights. We interrupted training at the models. 10000th epoch. The performances of our models are evaluated by consid- ering the accuracy and the mean average precision (mAP) 2.4 Interestingness Classification scores. Table 2 summarizes our results on the test set. As Thus far, we have described our CNN models which can can be seen, the mAP scores of all the proposed models are be used to compute real valued interestingness scores for not very high, demonstrating that interestingness prediction each key frame of a given video sequence, where these inter- is not a trivial task. We note that the accuracy values be- estingness scores correspond to confidence values. However, ing high are somewhat misleading since the training data the task also requires classifying a frame as interesting or is highly unbalanced (see Table 1). Hence, we additionally not, in addition to predicting their interestingness scores. show the confusion matrices for all of our runs in Table 3. A simple and straightforward way to convert real valued outputs to class labels is to introduce a thresholding pro- cedure. However, choosing a single appropriate threshold Table 2: Evaluation results on the test set. Runs mAP accuracy value is not easy; in fact, we observe that it is very video 1 0.2125 0.8224 sequence dependent. Figure 1 shows the ground truth dis- 2 0.2121 0.8275 tributions of the confidence values for the interesting (blue) 3 0.2001 0.8249 and uninteresting (orange) frames over all video sequences in the training set. As can be seen, these distributions have a large overlap, demonstrating that a single threshold value won’t work. Table 3: Confusion matrices for our runs. Run 1 Run 2 Run 3 1890 211 1896 205 1893 208 205 36 199 42 202 39 To sum up, the growth of the visual media on the In- ternet has led to an increased need for understanding and predicting interestingness of images and video shots, and in this work, within the proposed deep models, we treat each key frame of a given video as an independent sample. One possible future direction could be to process each key frame in the context of a local temporal neighborhood or the whole video, by extending our models to process mul- tiple key frames simultaneously. Another extension could Figure 1: Distributions of the confidence values for be to consider a multi-task learning scheme, which involves interesting/uninteresting frames. jointly classifying key frames as interesting or not and esti- mating an interestingness score based on a regression-based Next, we analyzed the ratio between interesting and un- loss function, which eliminates the need for post-processing interesting key frames per each training video. As shown in the regression scores. Table 1, the ratio is, on average, about 1:9. Hence, given a test video sequence, we sort all its key frames according to Acknowledgement. their predicted interesting scores and classify the top 10% This work is partially supported by the Scientific and frames as interesting. Technological Research Council of Turkey (Award #113E497). Table 1: Statistics for the confidence values for inter- esting and uninteresting frames over training data 4. REFERENCES frames mean std [1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, interesting 0.11 0.08 H. Wang, N. Q.K. Duong, and F. Lefebvre. Mediaeval uninteresting 0.89 0.08 2016 predicting media interestingness task. In Proc. of the MediaEval 2016 Workshop, 2016. [2] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, 3. RESULTS AND DISCUSSION and L. Van Gool. The interestingness of images. Proc. We submit three different runs for the image subtask. International Conference on Computer Vision, pages While the first run uses our fine-tuned AlexNet, the second 1633–1640, 2013. one uses predictions from our fine-tuned MemNet model. [3] M. Gygli, H. Grabner, H. Riemenschneider, and Lastly, the third run includes the results of our proposed L. Van Gool. Creating summaries from user videos. In triplet network. All these models are trained by using the Proc. European Conference on Computer Vision, provided training data. However, we split it into two as pages 505–520, 2014. training and validation splits using a ratio of 80% and 20% [4] M. Gygli, Y. Song, and L. Cao. Video2gif: Automatic to deal with overfitting. In our experiments, as the size of the generation of animated gifs from video. In Proc. training data is relatively small, we decided to update the Computer Vision and Pattern Recognition, 2016. [5] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1469–1482, 2014. [6] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In Proc. Association for the Advancement of Artificial Intelligence Conference, pages 1113–1119, 2013. [7] H. Katti, K. Y. Bin, T. S. Chua, and M. Kankanhalli. Pre-attentive discrimination of interestingness in images. In Proc. IEEE International Conference on Multimedia and Expo, pages 1433–1436, 2008. [8] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Understanding and predicting image memorability at a large scale. In Proc. International Conference on Computer Vision, pages 2390–2398, 2015. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, pages 1097–1105, 2012. [10] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proc. International Conference on Computer Vision, pages 2794–2802, 2015.