Predicting Video Saliency Using Crowdsourced Mouse-Tracking Data V.A. Lyudvichenko1, D.S. Vatolin1 vlyudvichenko@graphics.cs.msu.ru|dmitriy@graphics.cs.msu.ru 1 Lomonosov Moscow State University, Moscow, Russia This paper presents a new way of getting high-quality saliency maps for video, using a cheaper alternative to eye-tracking data. We designed a mouse-contingent video viewing system which simulates the viewers’ peripheral vision based on the position of the mouse cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse- tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm that improves the quality of mouse-tracking saliency maps. Keywords: saliency, deep learning, visual attention, crowdsourcing, eye tracking, mouse tracking. played in real time in the web-browser in a special video-player 1. Introduction simulating the peripheral vision of the human visual system. The player unevenly blurs the video in accordance with current When watching videos, humans distribute their attention mouse cursor position, the closer a pixel is to the cursor the less unevenly. Some objects in the video may attract more attention blur that is applied (Fig. 1). While watching the video a than the others. This distribution can be represented by per-frame participant could freely move the cursor to see interesting objects saliency maps defining the importance of each frame region for without blurring. Using the system we collected participants’ viewers. The use of saliency can improve the quality of many mouse-tracking data who were hired on a crowdsourcing video processing applications such as compression [4] and platform. We performed an analysis of the collected data and retargeting etc [2]. showed that it can approximate eye-tracking saliency. In Therefore, many research efforts have been made to develop particular, saliency maps generated from mouse-tracking data of algorithms predicting saliency of images and videos [2]. two observers have the same quality as ones generated from eye- However, the quality of even the most advanced deep learning tracking data from a single observer. algorithms is insufficient for some video applications [1][11]. However, cursor-based approaches, as well as eye-tracking, For example, deep video saliency algorithms slightly outperform become less efficient in terms of added quality per observer when eye-tracking data of a single observer [11], whereas at least 16 the number of observers goes up. The contribution of each observers are required to get ground-truth saliency [12]. following observer to the overall quality is rapidly decreasing Another option to obtain high-quality saliency maps is to because the dependence between the number of observers and generate them from eye fixations of real humans using eye the quality is logarithmic in nature [7]. Thereby, each following tracking. Arbitrarily high quality can be achieved by adding more observer is more and more expensive in terms of cost per added eye-tracking data from more observers. However, collection of quality. the data is costly and laborious because eye-trackers are To tackle this problem the semiautomatic paradigm for expensive devices that are usually available only in special predicting saliency was proposed in [4]. Unlike conventional laboratories. Therefore, the scale and speed of the data collection saliency models, semiautomatic approaches take eye-tracking process is limited. saliency maps as an additional input and postprocess them which Eye-tracking data is not the only way to estimate humans’ enables better saliency maps using less data. visual attention. Recent works [5][9] offered alternative We generalized the semiautomatic paradigm to mouse- methodologies to eye tracking that use mouse clicks or mouse tracking data and proposed a new deep neural network algorithm movement data to approximate eye fixations on static images. To working within this paradigm. The algorithm is based on SAM- collect such data a participant is shown an image on a screen. ResNet [3] architecture, in which two modifications were made. Initially, the image is blurred, but a participant can click on any Since SAM-ResNet was designed to predict saliency in images, area of the image to see the original, sharp image in a small we firstly added an LSTM layer and adapted the SAM’s attention circular region around the mouse cursor. This motivates module to exploit temporal cues of videos. Then, we added a new observers to click on areas of images that are interesting to them. external prior to the network which integrates mouse-tracking Therefore, the coordinates of mouse clicks can approximate real saliency maps into the network. We showed that both eye fixations. modifications applied separately and jointly improve the quality. Of course, such cursor-tracking data of a single observer In particular, we demonstrated that the algorithm can take approximates visual-attention less effectively than eye-tracking mouse-tracking saliency maps that had the quality comparable data. But in general, quality comparable with eye tracking can be with eye-tracking from three observers and improve them to the achieved by adding more data recorded from more observers. quality of eight observers. The main advantage of such cursor-based approaches is that they significantly simplify the process of getting high-quality saliency 2. Related work maps. To collect the data only a consumer computer with a mouse is needed. Thanks to crowdsourcing web-platforms like The paper makes a contribution to two topics: cursor-based Amazon Mechanical Turk, the data can be collected remotely alternatives to eye tracking and semiautomatic saliency and at large scale. It drastically speeds up the collection process modeling. Hereafter we provide a brief overview of these topics. and allows to increase the diversity of participants. Cursor-based alternatives to eye tracking. There were many In this work, we propose a cursor-based method for efforts to use mouse tracking as a cheap alternative to eye approximating saliency in videos and a crowdsourcing system tracking. However, most of these efforts were focused on for collecting such data. To the best of our knowledge, it is the webpage analysis [15]. Therefore we provide an overview of the first attempt to construct saliency maps for video using mouse- most notable universal approaches working with natural images. tracking data. We show participants a video which is being Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Fig. 1. An example of a tutorial page and the mouse-contingent video player used in our system. The video around the cursor is sharp. Huang et al. [5] designed a mouse-contingent paradigm that 𝐋1 , where 𝐋0 is the original frame, 𝐋1 is a blurred frame with 𝜎1 . allowed the use a mouse instead of the eye tracker to record The displayed image is constructed as follows: 𝐈𝑝 = 𝐖𝑝 𝐋0𝑝 + humans’ behaviors of viewing static images. They show (1 − 𝐖𝑝 )𝐋1𝑝 , where 𝑝 is pixel coordinates and 𝐖𝑝 is a blending participants the image for five seconds. The shown image is coefficient dependent on the retina density at 𝑝. Thus, 𝑊𝑝 = adaptively blurred to simulate peripheral vision as though a exp⁡(−‖𝑝 − 𝑔‖2 ⁄2𝜎𝑤2 ), where 𝑔 is the position of the mouse participant’s gaze is focused on the mouse cursor. Participants cursor, 𝜎𝑤 is a parameter. Both parameters 𝜎1 and 𝜎𝑤 represent can freely move the mouse cursor. Cursor coordinates are the size of the foveal area and depend on screen size and the recorded, clustered and filtered to remove outliers. Authors distance between the participant and the screen. Since we record showed that such cursor-based fixations have high similarity the data in uncontrolled conditions and cannot compute these with eye-tracking fixations. Using AMT crowdsourcing platform they estimated saliency of 10000 images which were published parameters exactly we chose 𝜎1 = 0.02𝑤 and 𝜎𝑤 = 0.2𝑤, where as the SALICON dataset. 𝑤 is video width. BubbleView [9] has a similar methodology, but it does not The system consists of front-end and back-end parts. The use the adaptive blurring and reveals the unblurred area of the back-end part allocates videos among participants, stores the image only when a participant clicks on it. recorded data and communicates with a crowdsourcing platform. Sidorov et al. [13] addressed the problem of temporal Before watching videos the system shows three educational saliency of video, i.e. how a whole frame is important for pages explaining how the video player works, Fig. 1 shows the viewers. To estimate the temporal importance they show first page. The front-end part implements the video player using participants a blurred video and allow them to turn off blurring the HTML5 Canvas API. Also, it checks that the participant’s under the cursor when the mouse button is held down. screen size is at least 1024 pixels width and its browser is able to Participants have a limited amount of time when they can see render video at least 20 FPS. We excluded data from participants unblurred frames, therefore they push the button down only on who didn't pass these checks. interesting frames. To the best of our knowledge, our method is the first attempt 4. Semiautomatic deep neural network to estimate spatial saliency of video using mouse-tracking data. To improve saliency maps generated using the cursor Semiautomatic saliency modeling. Lyudvichenko et positions as eye fixations we developed a new neural network al. [10] proposed a semiautomatic visual-attention algorithm for algorithm. The algorithm is based on SAM [3] architecture video. The algorithm takes eye-tracking saliency maps as an which was originally designed to predict saliency of static additional input and performs postprocessing transformations to images. Though SAM is a static model, its retrained ResNet them yielding saliency maps with better quality. The version can outperform the latest temporal-aware models like postprocessing is done in three steps: firstly they propagate ACL [14] and OM-CNN [6][11]. Also, SAM architecture can be fixations from neighboring frames to the current frame according more easily adapted to video because its attentive module already to motion vectors, then they apply brightness correction and add uses LSTM layer to iteratively update the attention. a center prior image to the saliency maps maximizing the We make two modifications to the original SAM-ResNet similarity between the result and ground-truth. architecture: adapt it for more effective video processing and add the external prior to integrate mouse-tracking saliency maps. The 3. Cursor-based saliency for video modified architecture is shown in Fig. 2. We propose a methodology for high-quality visual-attention Saliency models can significantly benefit from using estimation based on mouse-tracking data and a system collecting temporal video cues. Therefore we extract 256 temporal features such data using crowdsourcing platforms. We show a participant in addition to 256 spatial features yielded from 2048 final the video in a special video player in real-time in full-screen features of ResNet subnetwork by 1×1 convolution. The mode. The player simulates the peripheral vision of the human temporal features are produced by additional convolutional visual system by blurring the video as though the participant’s LSTM layer with 3×3 kernels which is fed with the final features gaze is focused on the mouse cursor. The human eye retina of ResNet. Spatial and temporal features are concatenated all consists of receptor cells, which are unevenly distributed together and passed to the Attentive ConvLSTM module. Also, throughout the eye, with a peak at the center of the field of view. we make the Attentive ConvLSTM module truly temporal-aware The central, foveal area is most clearly visible, whereas other, by passing its states from the last iteration of the previous frame peripheral ones are blurrier. We simulate that specificity by to the first iteration of the following frame. It allowed reducing adaptively blurring video in accordance with the position of the the number of per-frame iterations from 4 to 3 without quality mouse cursor. A participant can freely move the cursor loss. simulating shifting of the gaze. Then we integrate the external map priors in three places of To enable real-time rendering of the adaptively blurred the network. Firstly we add this prior to the existing Gaussian priors at the network head. frames we use a simple Gaussian pyramid with two layers 𝐋0 and Learned priors x2 Learned Input mouse-tracking priors Gaussian priors Dilated ResNet 512+1 1+512+16 Conv Conv Predicted 5x5 1x1 saliency maps Conv Conv 1x1 LSTM Input frames Attentive ConvLSTM* Spatial Temporal features features 256 256 Fig. 2. Overview of proposed temporal semiautomatic model based on SAM-ResNet [3]. We introduce the external prior maps and concatenate them with the features of the input layer and three intermediate layers. To make the network temporal-aware we introduce new spatiotemporal features and adapt the attentive ConvLSTM module so that it can pass the states to the following frames. The made modifications are marked by the red color on the schema. To learn more complex dependencies between the prior and and SAVAM [4] datasets, the training set consisted of 297 videos spatiotemporal features we concatenate downsampled prior and with 86440 frames, the validation set contained 65 videos. The the output of the ResNet subnetwork. Also, we concatenate it NSS term was excluded from the original SAM’s loss function with three RGB channels of source frames. Since we use a since optimizing the NSS metric worsens all other saliency pretrained ResNet network that expects the input with three metrics. All other optimization parameters are the same as those channels, we update the weight of the first convolutional layer by used in the original SAM-ResNet. adding a forth input feature initialized by zero weights. 5. Experiments We used our cursor-based saliency system to collect mouse- movement data in 12 random videos from Hollywood-2 video saliency dataset [12] that are each 20–30 seconds long. We hired participants on Subjectify.us crowdsourcing platform, showed them 10 videos and paid them $0.15 if they watched all videos. In total, we collected data of 30 participants resulting in 22–30 views per video. Using the collected data we estimated how good mouse- and eye-tracking fixations from the different number of observers approximate ground-truth saliency maps (generated from eye- tracking fixations). Fig. 3 shows the results and illustrates that mouse-tracking of two observers have the same quality as eye- tracking of the single observer, so the data collected with the proposed system can approximate eye-tracking. Note, when we estimated the eye-tracking performance of 𝑁 observers we compared them with the remaining 𝑀 − 𝑁 observers of total 𝑀 observers. Therefore the eye-tracking curve has stopped increasing since 𝑁 = 8 because Hollywood-2 dataset has data of 16 observers only. All our experiments convert fixation points to saliency maps using the formula 𝐒𝐌𝑝 = ∑𝑖=1..𝑁 𝒩(𝑝, 𝑓𝑖 , 𝜎), where 𝐒𝐌𝑝 is the resulting saliency map value at pixel 𝑝, 𝑓𝑖 is the position of the 𝑖-th fixation point Fig. 3. Objective evaluation of four configurations of our neural of 𝑁 and 𝒩 is a Gaussian with 𝜎 = 0.0625𝑤, 𝑤 is video width. network: two semiautomatic versions using the prior maps We also tested how the previous semiautomatic generated from mouse-tracking data of 10 observers and two algorithm [11] works with mouse-tracking data from a different automatic versions without the prior maps. The networks are number of observers. Fig. 3 illustrates that the algorithm visibly compared with the mean result of 𝑵 mouse- and eye-tracking improves mouse-tracking saliency maps making them observers as well as the SAVAM algorithm [10] using 𝑵 comparable with eye-tracking. In particular, it improves mouse- mouse-tracking observers (MTO). Note, the number of tracking saliency maps of a single observer making them better observers is limited to half of the eye-tracking observers than eye-tracking of a single observer. presented in the Hollywood-2 dataset [12]. Then we tested four configurations of proposed neural network architecture: two versions of the static variant and two The static architecture variants were trained on every 25-th versions of the temporal variant. The static variant processes frame of the videos. When training the temporal versions we frames independently, whereas the temporal one uses temporal composed minibatches from 3 consecutive frames of 5 different cues. Each variant has the semiautomatic version using the videos to use as large of a batch size as possible. Also, we external prior maps and the automatic version not using any disabled training of batch normalization layers to avoid problems external priors. All architectures were trained on DHF1K [14] related to small batch size. Since the collected mouse-tracking data wasn’t enough for [8] Judd, T., Ehinger, K., Durand, F., and Torralba, A. Learning training the semiautomatic architectures we employed transfer to predict where humans look. In International Conference learning technique and used eye-tracking saliency maps for the on Computer Vision (ICCV) (2009), pp. 2106–2113. network’s external prior. The prior maps were eye-tracking [9] Kim, N. W., Bylinskii, Z., Borkin, M. A., Gajos, K. Z., saliency maps of 3 observers which have the same quality as Oliva, A., Durand, F., and Pfister, H. Bubbleview: An mouse-tracking maps of 10 observers (according to Fig. 3). interface for crowdsourcing image importance maps and Fig. 3 shows the performance of all four trained networks tracking visual attention. ACM Trans. Comput.-Hum. where the external prior maps for the semiautomatic networks Interact. 24, 5 (2017), 1–40. were generated from mouse-tracking data of 10 observers. The [10] Lyudvichenko, V., Erofeev, M., Gitman, Y., and Vatolin, D. figure demonstrates that the temporal configurations A semiautomatic saliency model and its application to video significantly outperform the static ones. Thus, the added compression. In 13th IEEE International Conference on temporal cues improved the Similarity Score measure [8] of the Intelligent Computer Communication and Processing original SAM [3] static version from 0.659 to 0.678, and the (2017), pp. 403–410. semiautomatic version from 0.687 to 0.728. [11] Lyudvichenko, V., Erofeev, M., Ploshkin, A., and Vatolin, The semiautomatic versions improve their prior maps and D. Improving video compression with deep visual-attention have better quality than the automatic versions. Also, they models. In International Conference on Intelligent significantly outperform the semiautomatic algorithm proposed Medicine and Image Processing (2019). in [10]. It’s worth noting that the best temporal semiautomatic [12] Mathe, S., and Sminchisescu, C. Actions in the eye: configuration, which uses the prior maps generated from mouse- Dynamic gaze datasets and learnt saliency models for visual tracking data of 10 observers, outperforms eye-tracking of 8 recognition. IEEE Transactions on Pattern Analysis and observers. Since the prior maps have the same quality as 3 eye- Machine Intelligence (2015), 1408–1424. tracking observers, the proposed semiautomatic algorithm [13] Sidorov, O., Pedersen, M., Kim, N. W., and Shekhar, S. Are actually improves saliency maps as though 5 more eye-tracking all the frames equally important? CoRR abs/1905.07984 observers were added. (2019). [14] Wang, W., Shen, J., Guo, F., Cheng, M.- M., and Borji, A. 6. Conclusion Revisiting video saliency: A large-scale benchmark and a new model. IEEE Conference on Computer Vision and In this paper, we proposed a cheap way of getting high- Pattern Recognition (2018). quality saliency maps for video through the use of additional [15] Xu, P., Sugano, Y., and Bulling, A. Spatiotemporal data. We developed a novel system that shows viewers videos in modeling and prediction of visual attention in graphical user a mouse-contingent video player and collects mouse-tracking interfaces. In CHI Conference on Human Factors in data approximating real eye fixations. We showed that mouse- Computing Systems (2016), pp. 3299–3310. tracking data can be used as an alternative to more expensive eye- tracking data. Also, we proposed a new deep semiautomatic algorithm which significantly improves mouse-tracking saliency maps and outperforms traditional automatic algorithms. 7. Acknowledgments This work was partially supported by the Russian Foundation for Basic Research under Grant 19-01-00785 a. 8. References [1] Borji, A. Saliency prediction in the deep learning era: An empirical investigation. CoRR abs/1810.03716 (2018). [2] Borji, A., and Itti, L. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 185–207. [3] Cornia, M., Baraldi, L., Serra, G., and Cucchiara, R. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing 27, 10 (2018), 5142–5154. [4] Gitman, Y., Erofeev, M., Vatolin, D., Andrey, B., and Alexey, F. Semiautomatic visual-attention modeling and its application to video compression. In International Conference on Image Processing (ICIP) (2014), pp. 1105– 1109. [5] Huang, X., Shen, C., Boix, X., and Zhao, Q. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In International Conference on Computer Vision (2015), pp. 262–270. [6] Jiang, L., Xu, M., and Wang, Z. Predicting video saliency with object-to-motion cnn and two-layer convolutional lstm. CoRR abs/1709.06316 (2017). [7] Judd, T., Durand, F., and Torralba, A. A benchmark of computational models of saliency to predict human fixations. Tech. rep., Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 2012.