=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_18
|storemode=property
|title=Predicting Media Memorability Using Deep Features with Attention and Recurrent
            Network
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_18.pdf
|volume=Vol-2670
|authors=Le-Vu Tran,Vinh-Loc
          Huynh,Minh-Triet Tran
|dblpUrl=https://dblp.org/rec/conf/mediaeval/TranHT19
}}
==Predicting Media Memorability Using Deep Features with Attention and Recurrent
            Network==
<pdf width="1500px">https://ceur-ws.org/Vol-2670/MediaEval_19_paper_18.pdf</pdf>
<pre>
                   Predicting Media Memorability
      Using Deep Features with Attention and Recurrent Network
                                                 Le-Vu Tran, Vinh-Loc Huynh, Minh-Triet Tran
             Faculty of Information Technology, University of Science, Vietnam National University-Ho Chi Minh City
                                     tlvu@apcs.vn,hvloc15@apcs.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                            level in predicting memorability on this dataset. In [6], the authors’
In the Predicting Media Memorability Task at the MediaEval Chal-                    deep learning approach has even surpassed human consistency
lenge 2019, our team proposes an approach that uses deep visual                     level with ρ = 0.72.
features with attention, and recurrent network to predict video                        In our work, we explore the effect of videos’ sequential aspect on
memorability. For several frames in each video, attentive regions                   memorability by using LSTM on visual features. To our knowledge,
are marked by utilizing AMNet . Features are then extracted from                    LSTM based approach in VM has only been tried in [1]. However,
those preprocessed frames. We next forward these through an                         the results did not seem promising because of their small dataset.
LSTM network to model the structure of the video and predict its
memorability score.                                                                 3   MEMORABILITY PREDICTING
                                                                                    Attention: For each frame in a particular video, we fed it through
                                                                                    AMNet, by default, it iteratively generates 3 attention maps that
1    INTRODUCTION                                                                   linked to the image regions correlated with the memorability. Then
The Predicting Media Memorability task’s main objective is to au-                   we multiply those heat maps with the original frame to remove
tomatically predict a score which indicates how memorable a video                   completely regions that we don’t want to appear in the frame.
will be [2]. Video memorability can be affected by several factors                  Figure 1 gives a better point of view of what we have done in this
such as semantics, color feature, saliency, etc. In this paper, we                  stage. As a result, after this stage, each frame of a video becomes a
examine the sequential structure of videos with LSTM. We take                       batch of 4 frames (1 original frame + 3 masked frames). We consider
advantage of deep convolutional neural networks to get image fea-                   that batch of 4 frames as the input for the next stage.
tures as our main source of data for predicting video memorability.                    Feature extraction: To resolve the temporal factor, instead of
In our approach, there are three main stages: (i) determine which re-               using C3D [8], we decide to break the video into multiple frames
gions of multiple frames of a video are more remarkable, (ii) extract               and treat those frames as a batch representing that video. At the
image features from those remarkable video frames, (iii) predict                    beginning, we extracted only 3 frames (the beginning, middle, and
each video’s memorability score.                                                    last frames) for processing. After several tests, we figured out that
    In the first stage, we sample 8 frames from each video, each                    we can achieve higher results with more frames extracted. However,
frame is fed through AMNet [3] to determine which regions are                       we ended up with the decision of using 8 frames rather than a
remarkable. For each frame, 3 activation maps are generated to                      greater number. Indeed, the correlation was not substantially better
mark attentive regions. As a result, for each video, we increase
from 8 frames to 8 × 4 = 32 frames (1 original frame + 3 attention
frames).
    In the second stage, all 32 frames are concatenated in the follow-
ing order O 1 , M 11 , M12, M 13 , O 2 , M 21 , M 22 , M 23 ,... where O i is the
i th original frame, and Mi j is the j t h masked frame of the i t h origi-
nal frame; are then fed into a pre-trained Inception-v3 convolution
network [7] to extract their 2048-dimension features.
    Once extracted, each of the video features sequentially becomes
an input of a recurrent neural network concatenating with a dense
layer in the third stage. The memorability score corresponds to the
output of the dense layer mentioned earlier.

2    RELATED WORK
The task of predicting image memorability (IM) has made significant
progress since the release of MIT’s large-scale image memorability
dataset and their MemNet [4]. Recently, in 2018, Fajtl et. al. [3] pro-
posed a method, which benefits from deep learning, visual attention,
and recurrent networks, and achieved nearly human consistency
Copyright 2019 for this paper by its authors. Use
                                                                                    Figure 1: Original frame, its three activation maps (second
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                      row), and its masked frames (third row).
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                    Le-Vu, Vinh-Loc, Minh-Triet

                                                                                    Table 1: Spearman’s rank correlation results

                                                                                                                                              ρ
                                                                                   Task                 Model
                                                                                                                              1,000 test videos   Official test set
                                                                                             Region Attention (1024 units)         0.496               0.445
                                                                                             Region Attention (2048 units)         0.481               0.434
                                                                                Short-term   Region Attention (4096 units)         0.468               0.436
                                                                                             Caption Attention (2048 units)        0.431               0.414
                                                                                             Caption Attention (4096 units)        0.365               0.384
                                                                                             Region Attention (1024 units)         0.249               0.208
                                                                                             Region Attention (2048 units)         0.221               0.202
                Figure 2: The proposed method.                                   Long-term   Region Attention (4096 units)         0.245               0.187
                                                                                             Caption Attention (2048 units)        0.171               0.097
                                                                                             Caption Attention (4096 units)        0.168               0.124

and we want a straightforward extracting process. The length of
each video in the dataset is 7 seconds. We get the very first frame of        To prevent overfitting while training, we apply a dropout rate of
the video, then after each second, one more frame is captured. At          0.5 on the LSTM layer. We found that this rate gives the best results
this stage, for each video, we have 8 original frames. Including the       among 3 dropout rates of 0.25, 0.5, 0.75.
attention stage described above, finally, for each video, we have the         Discussion: According to the groundtruth, the dataset on short-
total number of 32 frames. We then use pre-trained Inception-v3            term memorability does follow a common trend previously stated
Convolutional Neural Network [7] to extract the frames’ features           in [4]. Videos with contents of natural scenes, landscapes, back-
as we want a concise network which can conduct a reasonably                grounds, and exteriors tend to be less memorable. On the other hand,
high accuracy. We use the publicly available model pre-trained on          videos with scenes that have people, interiors, and human-made
ImageNet [5] and extract the output with a dimensionality of 2048          objects are easily remembered.
from the last fully connected layer with average pooling.                     On the contrary, we think predicting long-term memorability on
   Predicting memorability: We considered several approaches               this dataset requires more in-depth research. For all of our methods,
regarding image and video memorability. In our attempts at adapt-          the results are always better when training/validating with short-
ing IM to VM, we simply used only the middle frame of each video           term labels. Long-term labels seem to confuse the model which leads
and train two models with them as input data. We implemented a             to worse performance. One possible reason of the inconsistency
simple model which consists of a CNN for feature extraction and 2          in this particular dataset is that there exist multiple similar videos
fully connected (FC) layers for computing the output score. We also        with opposite scores about or of specific objects.
retrained the model in [3] with those images to see if their model
generalizes well to the task’s dataset. Furthermore, we propose to
use an LSTM model to predict VM score using features extracted
above (figure 2). Each extracted feature vector of every frame of a
video is an input of a time step in our LSTM model. At the last step,
a dense layer takes a 1024-dimension output vector of the LSTM
model and calculates the memorability score of that video.
   For the short-term task, three out of five submitted runs are the
results of our proposed method with three different configurations
                                                                           Figure 3: Similar videos can cause confusion to visual-based
(1024, 2048, and 4096 hidden units). The remaining two are the
                                                                           model in long-term memorability. Long-term scores: 0.727
results of the captioning mechanism from [9] (we use the mech-
                                                                           (left), 0.273 (right).
anism from [9] to generate attention heat maps similarly to the
AMNet mechanism mentioned earlier) with two different config-
urations (2048 and 4096 hidden units). For the long-term task, we              As in Figure 3, both videos are almost identical in terms of visual
repeat the same configurations but trained on different data from          features such as color, angle, actor, etc. These videos might cause
the short-term task.                                                       participants to make mistakes when deciding whether they watched
                                                                           it or not. Hence, their long-term labels give opposite results.
4    RESULTS AND DISCUSSION
In this section, we evaluate our LSTM model on the task’s dataset.
                                                                           5   CONCLUSION AND FUTURE WORK
We present our quantitative results as well as some insight that we        In our approach, we focus on the temporal aspect of videos by
learned from this dataset.                                                 using their frames in an LSTM recurrent network. We have not
   Since we do not have the ground truth of the official test set,         tried using a combination of features in the process, hence, we will
to compare these methods, we divide the development set into 3             try using multiple aspects of a video to measure its performance.
parts: 6,000 videos for training, 1,000 videos for validating, and 1,000
videos for testing. Table 1 shows the results of different methods         ACKNOWLEDGMENTS
that we tested with our 1,000 test videos.                                 Research is supported by Vingroup Innovation Foundation (VINIF)
   With our approach, the very same model with 1024 hidden units           in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
achieved the best result for both subtasks.                                Ltd for supporting our team with computing infrastructure.
Predicting Media Memorability Task                                               MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                      Michael Bernstein, and others. 2015. Imagenet large scale visual recog-
[1] Romain Cohendet, Karthik Yadati, Ngoc QK Duong, and Claire-Hélène           nition challenge. International journal of computer vision 115, 3 (2015),
    Demarty. 2018. Annotating, understanding, and predicting long-term          211–252.
    video memorability. In Proceedings of the 2018 ACM on International     [6] Hammad Squalli-Houssaini, Ngoc QK Duong, Marquant Gwenaëlle,
    Conference on Multimedia Retrieval. ACM, 178–186.                           and Claire-Hélène Demarty. 2018. Deep learning for predicting im-
[2] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,            age memorability. In 2018 IEEE International Conference on Acoustics,
    Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019.            Speech and Signal Processing (ICASSP). IEEE, 2371–2375.
    The Predicting Media Memorability Task at MediaEval 2019. In Proc.      [7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
    of the MediaEval 2019 Workshop, Sophia Antipolis, France, Oct. 27-29,       Zbigniew Wojna. 2016. Rethinking the inception architecture for
    2019.                                                                       computer vision. In Proceedings of the IEEE conference on computer
[3] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re-            vision and pattern recognition. 2818–2826.
    magnino. 2018. Amnet: Memorability estimation with attention. In        [8] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
    Proceedings of the IEEE Conference on Computer Vision and Pattern           Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
    Recognition. 6363–6372.                                                     volutional networks. In Proceedings of the IEEE international conference
[4] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.        on computer vision. 4489–4497.
    Understanding and predicting image memorability at a large scale. In    [9] Viet-Khoa Vo-Ho, Quoc-An Luong, Duy-Tam Nguyen, Mai-Khiem
    Proceedings of the IEEE International Conference on Computer Vision.        Tran, and Minh-Triet Tran. 2018. Personal diary generation from
    2390–2398.                                                                  wearable cameras with concept augmented image captioning and wide
[5] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev                trail strategy. In Proceedings of the Ninth International Symposium on
    Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,           Information and Communication Technology. ACM, 367–374.

</pre>