Predicting Video Saliency Using Crowdsourced Mouse-Tracking Data
                                              V.A. Lyudvichenko1, D.S. Vatolin1
                                 vlyudvichenko@graphics.cs.msu.ru|dmitriy@graphics.cs.msu.ru
                                     1
                                      Lomonosov Moscow State University, Moscow, Russia
    This paper presents a new way of getting high-quality saliency maps for video, using a cheaper alternative to eye-tracking data. We
designed a mouse-contingent video viewing system which simulates the viewers’ peripheral vision based on the position of the mouse
cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze
fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse-
tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking
data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm
that improves the quality of mouse-tracking saliency maps.
    Keywords: saliency, deep learning, visual attention, crowdsourcing, eye tracking, mouse tracking.

                                                                       played in real time in the web-browser in a special video-player
1. Introduction                                                        simulating the peripheral vision of the human visual system. The
                                                                       player unevenly blurs the video in accordance with current
     When watching videos, humans distribute their attention
                                                                       mouse cursor position, the closer a pixel is to the cursor the less
unevenly. Some objects in the video may attract more attention
                                                                       blur that is applied (Fig. 1). While watching the video a
than the others. This distribution can be represented by per-frame
                                                                       participant could freely move the cursor to see interesting objects
saliency maps defining the importance of each frame region for
                                                                       without blurring. Using the system we collected participants’
viewers. The use of saliency can improve the quality of many
                                                                       mouse-tracking data who were hired on a crowdsourcing
video processing applications such as compression [4] and
                                                                       platform. We performed an analysis of the collected data and
retargeting etc [2].
                                                                       showed that it can approximate eye-tracking saliency. In
     Therefore, many research efforts have been made to develop
                                                                       particular, saliency maps generated from mouse-tracking data of
algorithms predicting saliency of images and videos [2].
                                                                       two observers have the same quality as ones generated from eye-
However, the quality of even the most advanced deep learning           tracking data from a single observer.
algorithms is insufficient for some video applications [1][11].
                                                                           However, cursor-based approaches, as well as eye-tracking,
For example, deep video saliency algorithms slightly outperform
                                                                       become less efficient in terms of added quality per observer when
eye-tracking data of a single observer [11], whereas at least 16
                                                                       the number of observers goes up. The contribution of each
observers are required to get ground-truth saliency [12].              following observer to the overall quality is rapidly decreasing
     Another option to obtain high-quality saliency maps is to
                                                                       because the dependence between the number of observers and
generate them from eye fixations of real humans using eye
                                                                       the quality is logarithmic in nature [7]. Thereby, each following
tracking. Arbitrarily high quality can be achieved by adding more
                                                                       observer is more and more expensive in terms of cost per added
eye-tracking data from more observers. However, collection of          quality.
the data is costly and laborious because eye-trackers are
                                                                           To tackle this problem the semiautomatic paradigm for
expensive devices that are usually available only in special
                                                                       predicting saliency was proposed in [4]. Unlike conventional
laboratories. Therefore, the scale and speed of the data collection
                                                                       saliency models, semiautomatic approaches take eye-tracking
process is limited.
                                                                       saliency maps as an additional input and postprocess them which
     Eye-tracking data is not the only way to estimate humans’
                                                                       enables better saliency maps using less data.
visual attention. Recent works [5][9] offered alternative
                                                                           We generalized the semiautomatic paradigm to mouse-
methodologies to eye tracking that use mouse clicks or mouse
                                                                       tracking data and proposed a new deep neural network algorithm
movement data to approximate eye fixations on static images. To
                                                                       working within this paradigm. The algorithm is based on SAM-
collect such data a participant is shown an image on a screen.
                                                                       ResNet [3] architecture, in which two modifications were made.
Initially, the image is blurred, but a participant can click on any
                                                                       Since SAM-ResNet was designed to predict saliency in images,
area of the image to see the original, sharp image in a small
                                                                       we firstly added an LSTM layer and adapted the SAM’s attention
circular region around the mouse cursor. This motivates
                                                                       module to exploit temporal cues of videos. Then, we added a new
observers to click on areas of images that are interesting to them.
                                                                       external prior to the network which integrates mouse-tracking
Therefore, the coordinates of mouse clicks can approximate real
                                                                       saliency maps into the network. We showed that both
eye fixations.
                                                                       modifications applied separately and jointly improve the quality.
     Of course, such cursor-tracking data of a single observer
                                                                       In particular, we demonstrated that the algorithm can take
approximates visual-attention less effectively than eye-tracking
                                                                       mouse-tracking saliency maps that had the quality comparable
data. But in general, quality comparable with eye tracking can be
                                                                       with eye-tracking from three observers and improve them to the
achieved by adding more data recorded from more observers.
                                                                       quality of eight observers.
The main advantage of such cursor-based approaches is that they
significantly simplify the process of getting high-quality saliency    2. Related work
maps. To collect the data only a consumer computer with a
mouse is needed. Thanks to crowdsourcing web-platforms like                The paper makes a contribution to two topics: cursor-based
Amazon Mechanical Turk, the data can be collected remotely             alternatives to eye tracking and semiautomatic saliency
and at large scale. It drastically speeds up the collection process    modeling. Hereafter we provide a brief overview of these topics.
and allows to increase the diversity of participants.                  Cursor-based alternatives to eye tracking. There were many
     In this work, we propose a cursor-based method for                efforts to use mouse tracking as a cheap alternative to eye
approximating saliency in videos and a crowdsourcing system            tracking. However, most of these efforts were focused on
for collecting such data. To the best of our knowledge, it is the      webpage analysis [15]. Therefore we provide an overview of the
first attempt to construct saliency maps for video using mouse-        most notable universal approaches working with natural images.
tracking data. We show participants a video which is being


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 Fig. 1. An example of a tutorial page and the mouse-contingent video player used in our system. The video
                                                         around the cursor is sharp.
    Huang et al. [5] designed a mouse-contingent paradigm that        𝐋1 , where 𝐋0 is the original frame, 𝐋1 is a blurred frame with 𝜎1 .
allowed the use a mouse instead of the eye tracker to record          The displayed image is constructed as follows: 𝐈𝑝 = 𝐖𝑝 𝐋0𝑝 +
humans’ behaviors of viewing static images. They show                 (1 − 𝐖𝑝 )𝐋1𝑝 , where 𝑝 is pixel coordinates and 𝐖𝑝 is a blending
participants the image for five seconds. The shown image is           coefficient dependent on the retina density at 𝑝. Thus, 𝑊𝑝 =
adaptively blurred to simulate peripheral vision as though a
                                                                      exp⁡(−‖𝑝 − 𝑔‖2 ⁄2𝜎𝑤2 ), where 𝑔 is the position of the mouse
participant’s gaze is focused on the mouse cursor. Participants
                                                                      cursor, 𝜎𝑤 is a parameter. Both parameters 𝜎1 and 𝜎𝑤 represent
can freely move the mouse cursor. Cursor coordinates are
                                                                      the size of the foveal area and depend on screen size and the
recorded, clustered and filtered to remove outliers. Authors
                                                                      distance between the participant and the screen. Since we record
showed that such cursor-based fixations have high similarity
                                                                      the data in uncontrolled conditions and cannot compute these
with eye-tracking fixations. Using AMT crowdsourcing platform
they estimated saliency of 10000 images which were published          parameters exactly we chose 𝜎1 = 0.02𝑤 and 𝜎𝑤 = 0.2𝑤, where
as the SALICON dataset.                                               𝑤 is video width.
    BubbleView [9] has a similar methodology, but it does not              The system consists of front-end and back-end parts. The
use the adaptive blurring and reveals the unblurred area of the       back-end    part allocates videos among participants, stores the
image only when a participant clicks on it.                           recorded data and communicates with a crowdsourcing platform.
    Sidorov et al. [13] addressed the problem of temporal             Before watching videos the system shows three educational
saliency of video, i.e. how a whole frame is important for            pages explaining how the video player works, Fig. 1 shows the
viewers. To estimate the temporal importance they show                first page. The front-end part implements the video player using
participants a blurred video and allow them to turn off blurring      the HTML5 Canvas API. Also, it checks that the participant’s
under the cursor when the mouse button is held down.                  screen size is at least 1024 pixels width and its browser is able to
Participants have a limited amount of time when they can see          render video at least 20 FPS. We excluded data from participants
unblurred frames, therefore they push the button down only on         who didn't pass these checks.
interesting frames.
     To the best of our knowledge, our method is the first attempt
                                                                        4. Semiautomatic deep neural network
to estimate spatial saliency of video using mouse-tracking data.            To improve saliency maps generated using the cursor
     Semiautomatic saliency modeling. Lyudvichenko et                   positions as eye fixations we developed a new neural network
al. [10] proposed a semiautomatic visual-attention algorithm for        algorithm. The algorithm is based on SAM [3] architecture
video. The algorithm takes eye-tracking saliency maps as an             which was originally designed to predict saliency of static
additional input and performs postprocessing transformations to         images. Though SAM is a static model, its retrained ResNet
them yielding saliency maps with better quality. The                    version can outperform the latest temporal-aware models like
postprocessing is done in three steps: firstly they propagate           ACL [14] and OM-CNN [6][11]. Also, SAM architecture can be
fixations from neighboring frames to the current frame according        more easily adapted to video because its attentive module already
to motion vectors, then they apply brightness correction and add        uses LSTM layer to iteratively update the attention.
a center prior image to the saliency maps maximizing the                    We make two modifications to the original SAM-ResNet
similarity between the result and ground-truth.                         architecture: adapt it for more effective video processing and add
                                                                        the external prior to integrate mouse-tracking saliency maps. The
3. Cursor-based saliency for video                                      modified architecture is shown in Fig. 2.
    We propose a methodology for high-quality visual-attention              Saliency models can significantly benefit from using
estimation based on mouse-tracking data and a system collecting         temporal video cues. Therefore we extract 256 temporal features
such data using crowdsourcing platforms. We show a participant          in addition to 256 spatial features yielded from 2048 final
the video in a special video player in real-time in full-screen         features of ResNet subnetwork by 1×1 convolution. The
mode. The player simulates the peripheral vision of the human           temporal features are produced by additional convolutional
visual system by blurring the video as though the participant’s         LSTM layer with 3×3 kernels which is fed with the final features
gaze is focused on the mouse cursor. The human eye retina               of ResNet. Spatial and temporal features are concatenated all
consists of receptor cells, which are unevenly distributed              together and passed to the Attentive ConvLSTM module. Also,
throughout the eye, with a peak at the center of the field of view.     we make the Attentive ConvLSTM module truly temporal-aware
The central, foveal area is most clearly visible, whereas other,        by passing its states from the last iteration of the previous frame
peripheral ones are blurrier. We simulate that specificity by           to the first iteration of the following frame. It allowed reducing
adaptively blurring video in accordance with the position of the        the number of per-frame iterations from 4 to 3 without quality
mouse cursor. A participant can freely move the cursor                  loss.
simulating shifting of the gaze.                                            Then we integrate the external map priors in three places of
    To enable real-time rendering of the adaptively blurred             the network. Firstly we add this prior to the existing Gaussian
                                                                        priors at the network head.
frames we use a simple Gaussian pyramid with two layers 𝐋0 and
                                                                                           Learned priors x2
                                                                                                     Learned
                                 Input mouse-tracking priors                                      Gaussian priors


                         Dilated
                         ResNet
                                                                     512+1                     1+512+16        Conv   Conv       Predicted
                                                                                                                5x5    1x1     saliency maps
                                            Conv          Conv
                                             1x1          LSTM
      Input frames                                                                      Attentive
                                                                                       ConvLSTM*
                                            Spatial     Temporal
                                           features      features
                                              256              256
   Fig. 2. Overview of proposed temporal semiautomatic model based on SAM-ResNet [3]. We introduce the external prior maps and
concatenate them with the features of the input layer and three intermediate layers. To make the network temporal-aware we introduce
new spatiotemporal features and adapt the attentive ConvLSTM module so that it can pass the states to the following frames. The made
                                       modifications are marked by the red color on the schema.
    To learn more complex dependencies between the prior and                 and SAVAM [4] datasets, the training set consisted of 297 videos
spatiotemporal features we concatenate downsampled prior and                 with 86440 frames, the validation set contained 65 videos. The
the output of the ResNet subnetwork. Also, we concatenate it                 NSS term was excluded from the original SAM’s loss function
with three RGB channels of source frames. Since we use a                     since optimizing the NSS metric worsens all other saliency
pretrained ResNet network that expects the input with three                  metrics. All other optimization parameters are the same as those
channels, we update the weight of the first convolutional layer by           used in the original SAM-ResNet.
adding a forth input feature initialized by zero weights.

5. Experiments
    We used our cursor-based saliency system to collect mouse-
movement data in 12 random videos from Hollywood-2 video
saliency dataset [12] that are each 20–30 seconds long. We hired
participants on Subjectify.us crowdsourcing platform, showed
them 10 videos and paid them $0.15 if they watched all videos.
In total, we collected data of 30 participants resulting in 22–30
views per video.
    Using the collected data we estimated how good mouse- and
eye-tracking fixations from the different number of observers
approximate ground-truth saliency maps (generated from eye-
tracking fixations). Fig. 3 shows the results and illustrates that
mouse-tracking of two observers have the same quality as eye-
tracking of the single observer, so the data collected with the
proposed system can approximate eye-tracking.
    Note, when we estimated the eye-tracking performance of 𝑁
observers we compared them with the remaining 𝑀 − 𝑁
observers of total 𝑀 observers. Therefore the eye-tracking curve
has stopped increasing since 𝑁 = 8 because Hollywood-2
dataset has data of 16 observers only. All our experiments
convert fixation points to saliency maps using the formula
𝐒𝐌𝑝 = ∑𝑖=1..𝑁 𝒩(𝑝, 𝑓𝑖 , 𝜎), where 𝐒𝐌𝑝 is the resulting saliency
map value at pixel 𝑝, 𝑓𝑖 is the position of the 𝑖-th fixation point          Fig. 3. Objective evaluation of four configurations of our neural
of 𝑁 and 𝒩 is a Gaussian with 𝜎 = 0.0625𝑤, 𝑤 is video width.                   network: two semiautomatic versions using the prior maps
    We also tested how the previous semiautomatic                             generated from mouse-tracking data of 10 observers and two
algorithm [11] works with mouse-tracking data from a different                automatic versions without the prior maps. The networks are
number of observers. Fig. 3 illustrates that the algorithm visibly            compared with the mean result of 𝑵 mouse- and eye-tracking
improves mouse-tracking saliency maps making them                               observers as well as the SAVAM algorithm [10] using 𝑵
comparable with eye-tracking. In particular, it improves mouse-                   mouse-tracking observers (MTO). Note, the number of
tracking saliency maps of a single observer making them better                   observers is limited to half of the eye-tracking observers
than eye-tracking of a single observer.                                                 presented in the Hollywood-2 dataset [12].
    Then we tested four configurations of proposed neural
network architecture: two versions of the static variant and two                 The static architecture variants were trained on every 25-th
versions of the temporal variant. The static variant processes               frame of the videos. When training the temporal versions we
frames independently, whereas the temporal one uses temporal                 composed minibatches from 3 consecutive frames of 5 different
cues. Each variant has the semiautomatic version using the                   videos to use as large of a batch size as possible. Also, we
external prior maps and the automatic version not using any                  disabled training of batch normalization layers to avoid problems
external priors. All architectures were trained on DHF1K [14]                related to small batch size.
    Since the collected mouse-tracking data wasn’t enough for        [8] Judd, T., Ehinger, K., Durand, F., and Torralba, A. Learning
training the semiautomatic architectures we employed transfer             to predict where humans look. In International Conference
learning technique and used eye-tracking saliency maps for the            on Computer Vision (ICCV) (2009), pp. 2106–2113.
network’s external prior. The prior maps were eye-tracking           [9] Kim, N. W., Bylinskii, Z., Borkin, M. A., Gajos, K. Z.,
saliency maps of 3 observers which have the same quality as               Oliva, A., Durand, F., and Pfister, H. Bubbleview: An
mouse-tracking maps of 10 observers (according to Fig. 3).                interface for crowdsourcing image importance maps and
    Fig. 3 shows the performance of all four trained networks             tracking visual attention. ACM Trans. Comput.-Hum.
where the external prior maps for the semiautomatic networks              Interact. 24, 5 (2017), 1–40.
were generated from mouse-tracking data of 10 observers. The         [10] Lyudvichenko, V., Erofeev, M., Gitman, Y., and Vatolin, D.
figure demonstrates that the temporal configurations                      A semiautomatic saliency model and its application to video
significantly outperform the static ones. Thus, the added                 compression. In 13th IEEE International Conference on
temporal cues improved the Similarity Score measure [8] of the            Intelligent Computer Communication and Processing
original SAM [3] static version from 0.659 to 0.678, and the              (2017), pp. 403–410.
semiautomatic version from 0.687 to 0.728.                           [11] Lyudvichenko, V., Erofeev, M., Ploshkin, A., and Vatolin,
    The semiautomatic versions improve their prior maps and               D. Improving video compression with deep visual-attention
have better quality than the automatic versions. Also, they               models. In International Conference on Intelligent
significantly outperform the semiautomatic algorithm proposed             Medicine and Image Processing (2019).
in [10]. It’s worth noting that the best temporal semiautomatic      [12] Mathe, S., and Sminchisescu, C. Actions in the eye:
configuration, which uses the prior maps generated from mouse-            Dynamic gaze datasets and learnt saliency models for visual
tracking data of 10 observers, outperforms eye-tracking of 8              recognition. IEEE Transactions on Pattern Analysis and
observers. Since the prior maps have the same quality as 3 eye-           Machine Intelligence (2015), 1408–1424.
tracking observers, the proposed semiautomatic algorithm             [13] Sidorov, O., Pedersen, M., Kim, N. W., and Shekhar, S. Are
actually improves saliency maps as though 5 more eye-tracking             all the frames equally important? CoRR abs/1905.07984
observers were added.                                                     (2019).
                                                                     [14] Wang, W., Shen, J., Guo, F., Cheng, M.- M., and Borji, A.
6. Conclusion                                                             Revisiting video saliency: A large-scale benchmark and a
                                                                          new model. IEEE Conference on Computer Vision and
    In this paper, we proposed a cheap way of getting high-
                                                                          Pattern Recognition (2018).
quality saliency maps for video through the use of additional
                                                                     [15] Xu, P., Sugano, Y., and Bulling, A. Spatiotemporal
data. We developed a novel system that shows viewers videos in
                                                                          modeling and prediction of visual attention in graphical user
a mouse-contingent video player and collects mouse-tracking
                                                                          interfaces. In CHI Conference on Human Factors in
data approximating real eye fixations. We showed that mouse-
                                                                          Computing Systems (2016), pp. 3299–3310.
tracking data can be used as an alternative to more expensive eye-
tracking data. Also, we proposed a new deep semiautomatic
algorithm which significantly improves mouse-tracking saliency
maps and outperforms traditional automatic algorithms.

7. Acknowledgments
    This work was partially supported by the Russian Foundation
for Basic Research under Grant 19-01-00785 a.

8. References
[1] Borji, A. Saliency prediction in the deep learning era: An
    empirical investigation. CoRR abs/1810.03716 (2018).
[2] Borji, A., and Itti, L. State-of-the-art in visual attention
    modeling. IEEE Transactions on Pattern Analysis and
    Machine Intelligence 35, 1 (2013), 185–207.
[3] Cornia, M., Baraldi, L., Serra, G., and Cucchiara, R.
    Predicting Human Eye Fixations via an LSTM-based
    Saliency Attentive Model. IEEE Transactions on Image
    Processing 27, 10 (2018), 5142–5154.
[4] Gitman, Y., Erofeev, M., Vatolin, D., Andrey, B., and
    Alexey, F. Semiautomatic visual-attention modeling and its
    application to video compression. In International
    Conference on Image Processing (ICIP) (2014), pp. 1105–
    1109.
[5] Huang, X., Shen, C., Boix, X., and Zhao, Q. Salicon:
    Reducing the semantic gap in saliency prediction by
    adapting deep neural networks. In International Conference
    on Computer Vision (2015), pp. 262–270.
[6] Jiang, L., Xu, M., and Wang, Z. Predicting video saliency
    with object-to-motion cnn and two-layer convolutional
    lstm. CoRR abs/1709.06316 (2017).
[7] Judd, T., Durand, F., and Torralba, A. A benchmark of
    computational models of saliency to predict human
    fixations. Tech. rep., Computer Science and Artificial
    Intelligence Lab, Massachusetts Institute of Technology,
    2012.