=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_17
|storemode=property
|title=DA-IICT at MediaEval 2017: Objective Prediction of Media Interestingness
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_17.pdf
|volume=Vol-1984
|authors=Rashi Gupta,Manish Narwaria
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GuptaN17
}}
==DA-IICT at MediaEval 2017: Objective Prediction of Media Interestingness==
DA-IICT at MediaEval 2017: Objective prediction of media interestingness Rashi Gupta, Manish Narwaria Dhirubhai Ambani Institute of Information and Communication Technology rashi.8496@gmail.com,manish_narwaria@daiict.ac.in ABSTRACT 2 APPROACH Interestingness is defined as the power of engaging and holding 2.1 Interestingness Features Computation the curiosity. While humans can almost effortlessly rank and judge Following are the features extracted from the image for image interestingness of a scene, automated prediction of interestingness subtask: colorfulness, contrast, complexity and visual attention. for an arbitrary scene is a challenging problem. In this work, we For video subtask, along with these features, audio feature Mel- attempt to develop a computational model for the said problem. Our frequency cepstral coefficients (MFCCs) is also computed to take approach is based on identifying and extracting context-specific audio of the clip in the account. features from video clips. These features are subsequently utilized in a predictor model to provide continuous scores that can be related Colorfulness: We measure colorfulness as proposed by [6], Red- to the interestingness of the scene in question. Such computational green and yellow-blue spaces are used where α = R − G and models can be useful in a automated analysis of videos (eg. movie, β = 0.5(R + G) − B where σα2 , σ β2 , µ α , µ β represent the variance a CCTV footage or a clip from an advertisement). and mean values along these two opponent color axes defined as: 1 INTRODUCTION µ α = N1 Σp=1 N α and σ 2 = 1 Σ N (α 2 − µ 2 ) P α N p=1 P α The aim of the task is to select content (image and video clips) which are considered to be the most interesting for a common viewer. This The equation formulates the ratio of the variance to the average is a challenging task because interestingness of media is highly chrominance in each of the opponent component: σα2 σ2 subjective, and can depend on multiple aspects including personal color f ulness = 0.02 × loд( ) × loд( β0.2 ) |µ α | 0.2 |µβ | preferences, emotional state and the content itself. Therefore, as a first step, our goal in this task is to understand and extract signal Contrast: We measure contrast as proposed by [5]. The main idea related features which may, for instance, quantify visual appearance is to compute local contrast factors at various resolutions, and then and audio information. Such features can then be mapped into an to build a weighted average in order to get the global contrast factor. interestingness score via machine learning. Further details about Let us denote the original pixel value with k, k = 0, 1,.. 254, 255. the task and dataset can be retrieved from [1]. The first step is to apply gamma correction with γ = 2.2 , and The said task falls under the broad areas of multimedia signal scale the input values to the [0,1] range. The corrected values linear processing (image, video and audio) and machine learning. The for- k )γ . The perceptual luminance L can be approx- luminance is l = ( 255 mer focuses on analysis and extraction of context-specific features √ imated with the square root of the linear luminance: L = 100 × l. from the signal. These may include color, contrast, complexity, au- Once the perceptual luminances are computed we have to compute dio characteristics etc. The primary goal of feature extraction is to local contrast. For each pixel we compute the average difference of obtain a more meaningful signal representation from the view point L between the pixel and four neighboring pixels. of capturing useful information pertaining to media interestingness. In the task, these features will be subsequently used as input to |L i −L i −1 |+|L i −L i +1 |+|L i −L i −w |+ |L i −L i +w | lc i = 4 a regressor (eg. linear regression and multilayer perceptron). As the target value of such regression problem is known (equal to The average local contrast for current resolution Ci is computed interestingness score given by a panel of human subjects), this is a as the average local contrast lc i over the whole image, where the supervised learning problem. image is w pixels wide and h pixels high. We note that a similar approach has been used in previous works such as [2], [3]. However, the key difference lies in the features Ci = w 1×h × Σw ×h i=1 lc i used, and this is one of the contributions from the task. Also, the results shed light on some interesting aspects of interestingness that We have to compute the Ci for various resolutions. Once the Ci for may not be fully captured by the current set of features. This can original image is computed, we build a smaller resolution image, so obviously be used to improve feature extraction, and in the process that we combine 4 original pixels into one super pixel. The image predict objective media interestingness scores that are closer to width is half the original width and the image height is half the human judgments. original height now. The Ci for various resolution can easily be Copyright held by the owner/author(s). computed and the process continues until we have only few huge MediaEval’17, 13-15 September 2017, Dublin, Ireland superpixels in the image. Now that we have computed average local MediaEval’17, 13-15 September 2017, Dublin, Ireland Rashi Gupta, Manish Narwaria contrasts Ci , we can compute the global contrast factor. 3.2 Analysis For image subtask, the maximum value of average precision@10 is GCF = Σi=1 N w ×C i i for those videos in which the top 10 images which are interesting in common view are more colorful, have high variation and contrast. Complexity: We measure contrast by calculating Spatial Infor- It also includes the images which are eye catchy because of the mation as proposed by [7]. Let sh and sv denote gray-scale im- visual attention feature. Example being the type of images that have ages filtered q with horizontal and vertical Sobel kernels, respec- many people gathering like in a rebellion or maybe in a meeting, tively. SIr = sh2 + sv2 represents the magnitude of spatial infor- or wearing clothes with vibrant setting say in a party. mation at every pixel. Mean and standard deviation of SIr is used Similarly, for the video subtask, the maximum value of average to calculate the complexity of an image. SImean = P1 ΣSIr and precision@10 is for those videos in which the top 10 clips which are q interesting in common view are more colorful, have high complexity SIstdev = P1 Σ(SIr2 − SImean 2 ) and traps attention. A clip which has high audio seems to attract where P is the number of pixels in the image. more viewer. Example being a clip which shows a blast or people screaming are of more significance than a silent clip. Visual attention: We propose a method to calculate attention On the contrary, the lowest average precision@10 is for those of an image by computing saliency maps for the corresponding kinds of videos in which the most interesting image are the ones image. [4] implementation is used for saliency map computation. which are less colorful and has very fewer variations. Example of The mean of this saliency map at every pixel is the attention value. such scenes includes the one in which say a dark wall with some strange symbols is painted. This may be because these symbols Audio Extraction: For audio features, Mel-frequency cepstral co- have some back story in the movie and hence are interesting in efficients (MFCCs) are computed and mean and standard deviation common view. Other being the scene where some explicit content of a time frame is calculated. This is the feature vector for audio is shown, it is usually shown in dark with very less variation and extraction. is arousing for humans. In such cases, the model tends to predict following kinds of images more interesting: a crowded place which Novelty: We propose a method to calculate novelty by firstly calcu- may have no greater significance or complex buildings and road of lating saliency maps for the images. 8 X 8 average filter is convoluted no greater importance or just a lighted empty room. and the mean is calculated for the consecutive saliency map images. For video subtask, the lowest average precision@10 is because, in If for both of the blocks, the average is less than the threshold (0.1), these scenes, the audio is also negligible be it a moment of suspicion the block is avoided. Otherwise, for the two consecutive saliency or any of the examples mentioned for the low value in image sub- map images mean squared error (MSE) is calculated. If MSE is less task and would instead predict those clips to be interesting which than the threshold, this block is also ignored, else it is considered. have higher audio along with the other features as in image subtask. Hence, the mean of the MSE is calculated. Higher the value, more action has happened in the two consecutive frames. 4 CONCLUSION Interestingness of a scene is a subjective aspect and one that in- 2.2 Interestingness Prediction volves complex cognitive processes. However, certain features such For image subtask, we have used five features namely, colorfulness, as contrast, colorfulness and novelty of the scene can be assumed contrast, complexity (mean and standard deviation) and attention. to play a part in the way humans quantify interestingness, irre- With the help of these features, we would learn our model for spective of the type of scene. Therefore, in this work, we extracted interestingness using linear regression. and used such audio and visual features to develop a model for For video subtask, along with these five features, we also added predicting interestingness. Such approach is, of course, an initial the feature vector for the audio. As there are more features in this step towards building a more comprehensive model. The novelty case, we used multilayer perceptron. We used the mean image feature proposed in this paper is not used for the current task due provided in the image subtask for computation of the feature vector to time constraint and can be exploited in the future work. of the video subtask as computation for all frames of the video was not feasible, given the time constraints. ACKNOWLEDGMENTS We thank Karan Thakkar for his fruitful help. 3 EXPERIMENTAL RESULTS 3.1 Evaluation For image subtask, using the five features with the linear regression as the learning algorithm the MAP@10 is coming up to be 0.0406 for the test set. For video subtask, along with these features and the audio feature vector, MAP@10 is 0.0636 for the test set where multilayer perceptron is the learning algorithm. DA-IICT at MediaEval 2017: Objective prediction of media interestingness MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Demarty et al. 2017. Mediaeval 2017 predicting media interestingness task. (2017). [2] Helmut Grabner, Fabian Nater, Michel Druey, and Luc Van Gool. 2013. Visual interestingness in image sequences. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 1017–1026. [3] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool. 2013. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision. 1633–1640. [4] Jonathan Harel, C Koch, and P Perona. 2006. A saliency implementa- tion in matlab. URL: http://www. klab. caltech. edu/˜ harel/share/gbvs. php (2006). [5] Kresimir Matkovic, László Neumann, Attila Neumann, Thomas Psik, and Werner Purgathofer. 2005. Global Contrast Factor-a New Ap- proach to Image Contrast. Computational Aesthetics 2005 (2005), 159– 168. [6] Karen Panetta, Chen Gao, and Sos Agaian. 2013. No reference color image contrast and quality measures. IEEE transactions on Consumer Electronics 59, 3 (2013), 643–651. [7] Honghai Yu and Stefan Winkler. 2013. Image complexity and spatial information. In Quality of Multimedia Experience (QoMEX), 2013 Fifth International Workshop on. IEEE, 12–17.