=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_22
|storemode=property
|title=TCNJ-CS@MediaEval 2017 Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_22.pdf
|volume=Vol-1984
|authors=Sejong Yoon
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Yoon17
}}
==TCNJ-CS@MediaEval 2017 Predicting Media Interestingness Task==
TCNJ-CS @ MediaEval 2017 Predicting Media Interestingness Task Sejong Yoon The College of New Jersey, USA yoons@tcnj.edu ABSTRACT For the image prediction task, we vectorized each feature per In this paper, we present our approach and investigation on the frame. For the video prediction task, we took the mean of raw fea- MedialEval 2017 Predicting Media Interestingness Task. We used ture values of all frames in the segment. Given the original feature most of the visual and audiotory features provided. The standard ft,n for n-th frame in t-th segment, we compute the summarized kernel fusion technique was applied to combine features and we feature for the segment t as used the ranking support vector machine to learn the classification N 1 Õ model. No extra data was introduced to train the model. Official xt = ft,n (1) N n=1 results, as well as our investigation on the task data is provided at the end. where N denotes the total number of frames in the segment. We used prob (fc8) layer to incorporate semantic information of the training data that can be extracted from the deep neural 1 INTRODUCTION network. MediaEval 2017 Predicting Media Interestingness [2] consists of 2.2 Classification two subtasks. In the first task, the system should predict whether the viewer will consider a given image to be interesting or not to We applied the standard kernel fusion approach: we compute a the common viewers. In the second task, a similar task should be kernel for each type of features, and combine the kernels either performed given a video segment. In both tasks, the system should by additions or multiplications. We used the multiplication within predict both the binary decision whether the media is interesting the same modality and we used the addition across the different or not, and the ranking of the image frame/video segment among modalities. For the image prediction subtask, we used the following all image frame/video segments within the same movie. The data combination of kernels: consists of 108 video clips. In total 7,396 key-frames and the same K 1 = Kchist · Kдist , (2) number of video segments are provided in the development set, K 2 = Kdhist · Khoд · Klbp , (3) and 2,436 key-frames and the same number of video segments Kall = K 1 + K 2 + Kpr ob . (4) are reserved for the test set. In this work, we used most of the features provided by the task organizers and we did not introduce The rational behind this choice was to consider features with global any external data, e.g., meta-data, rating, reviews of the movies. histograms and features using the spatial pyramids [6] as different modalities. We present the results on different kernel combinations 2 APPROACH for development set in the following section. The CNN probability layer, Kpr ob , is also considered as another modality since it con- In this section, we first describe the features we employed and then veys semantic information (objects in the images). For the video present our classification method. prediction subtask, we used the following combination of kernels: 2.1 Features Kall = K 1 + K 2 + Kpr ob + Kc3d + Kmf cc . (5) We used features from different modalities. All features were pro- Since C3D and MFCC features model temporal aspect of input, we vided by the task organizers. consider them as different modalities from the visual features. For Visual Features We used nearly all features provided, including the kernel choice, we used RBF kernel with the median of training Color histogram in HSV space, GIST [9], Dense SIFT [7], HOG data for the hyper-parameter choice. 2x2 [1], Linear Binary Pattern (LBP) [8], prob (fc8, probabilities of For the classification model, we used the ranking support vector predicted labels of 1,000 objects) layer of AlexNet [5], and C3D [10]. machine. We used SV M r ank [4] to learn pair-wise ranking patterns Audio Features We used the provided Mel-frequency Cepstral from the development set data, following a prior work [3]. Coefficients (MFCC) features. An MFCC descriptor (60 dimensions) is computed over every 32ms temporal window with 16ms shift. 3 RESULTS AND ANALYSIS The first and second derivatives of the cepstral vectors are also The official metric for evaluation is the mean average precision at included in the MFCC descriptors. 10 (MAP@10) computed over all videos, and over the top 10 best ranked images/video segments. First, we present different kernel Copyright held by the owner/author(s). combinations we tested on the development set. Table 1 describes MediaEval’17, 13-15 September 2017, Dublin, Ireland the different kernel fusion formula we used in the experiments. We report both MAP and MAP@10 results in Table 2. As one can see, MediaEval’17, 13-15 September 2017, Dublin, Ireland S. Yoon Table 1: Different visual feature combinations Table 3: Result of all subtasks in testset Combined kernel Fusion formula Subtask Measure Result Kernel V1 K 1 · K 2 · Kpr ob Image MAP 0.1331 V3 V2 K 1 · K 2 + Kpr ob MAP@10 0.0126 V3 V3 K 1 + K 2 + Kpr ob Video MAP 0.1774 V3 + Kc3d + Kmf cc MAP@10 0.0524 V3 + Kc3d + Kmf cc Table 2: Result of all subtasks in development set Table 4: Key-frames of most interesting segments in some development set movies categorized into types of interest Subtask Measure Result Kernel stimuli Image MAP 0.3065 V1 Subtask Key-frames MAP@10 0.0123 V1 Image MAP 0.3013 V2 MAP@10 0.0094 V2 Image MAP 0.3003 V3 Violence MAP@10 0.0074 V3 Video MAP 0.3052 V1 + Kc3d + Kmf cc Nudity MAP@10 0.0084 V1 + Kc3d + Kmf cc Video MAP 0.3055 V2 + Kc3d + Kmf cc MAP@10 0.0082 V2 + Kc3d + Kmf cc Horror / Surprise Video MAP 0.3038 V3 + Kc3d + Kmf cc MAP@10 0.0082 V3 + Kc3d + Kmf cc Romantic mood there is no significant differences among kernel fusion choices. We Facial expression used 50-50 split, i.e. 39 movies each for train and test splits of the development set. Joyful, Fun, Humor We also report both MAP and MAP@10 results on the testset in Table 3 provided by the task organizers. As described in the previous section, we used the visual feature combination, Eq. 4 for the image Open view / scenery prediction task, and we used the multi-modal combination, Eq. 5 for the video prediction task. SV M r ank takes the ranking information as the label of input data and generates pairwise constraints. All Others (context) provided ranking information in the development set was used for training the SV M r ank model, with image snapshots and video segments in each movie grouped together. of the most interesting segments in each movie clip we gathered As it can be seen, in both image and video subtasks, the system during the progress. As it can be seen, many of the categories are shows low performance. This is not surprising given the very sim- closely related to key emotional states that modern and existing ple nature of the approach we applied to the task. What was not affect prediction methods can predict. This is particularly true for expected is that the video prediction result is much better (although violence, horror, and joy which consist in large proportion of the still not reaching the level of good performance) than the image most interesting video segments. On the other hand, there are many prediction result, which was not observable in the development other video segments for which one cannot readily identify the set. This is interesting because we used the same set of features for root of interest stimuli. These typically require a higher level of image and video prediction subtasks, and the only differences are understanding of the context. The best example is the third movie the two additional features modeling the temporal aspect of data in the Others category which requires fusion of all modalities plus (C3D, MFCC). We believe this reiterates a known understanding on reading of a sentence shown on the image frame. the task: we must somehow incorporate temporal information to In the future, we hope to challenge the media interestingness improve video interestingness prediction. prediction problem in this direction. Maybe the most promising approach at this point is to understand human activities and link 4 DISCUSSION AND OUTLOOK them to emotions and the interestingness. One of the major challenges in video interestingness prediction is to fill the semantic gap. Initially, we intended to fill this gap by ACKNOWLEDGMENTS capturing expected emotional status of viewers and connect it to This work was supported in part by The College of New Jersey the notion of interestingness. Table 4 shows our categorization under Support Of Scholarly Activity (SOSA) 2017-2019 grant. Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (CVPR’05), Vol. 1. 886–893 vol. 1. https://doi.org/10.1109/CVPR.2005.177 [2] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Michael Gygli, and Ngoc Q. K. Duong. Predicting Media Interest- ingness Task at MediaEval 2017. In Proc. of MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017. [3] Yu-Gang Jiang, Yanran Wang, Rui Feng, Xiangyang Xue, Yingbin Zheng, and Hanfang Yang. 2013. Understanding and Predicting Inter- estingness of Videos. In AAAI. [4] Thorsten Joachims. 2006. Training Linear SVMs in Linear Time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). [5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima- geNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NIPS). [6] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. 2169–2178. https://doi.org/10. 1109/CVPR.2006.68 [7] David Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (2004). [8] Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. 2002. Multireso- lution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 7 (July 2002), 971–987. https://doi.org/10.1109/TPAMI.2002.1017623 [9] Aude Oliva and Antonio Torralba. 2001. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. Int. J. Comput. Vision 42, 3 (May 2001), 145–175. https://doi.org/10.1023/A: 1011139631724 [10] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV.