=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_39
|storemode=property
|title=Multimodal Deep Features Fusion for Video Memorability Prediction
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_39.pdf
|volume=Vol-2670
|authors=Roberto Leyva,Faiyaz Doctor,Alba G. Seco de Herrera,Sohail Sahab
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LeyvaDHS19
}}
==Multimodal Deep Features Fusion for Video Memorability Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-2670/MediaEval_19_paper_39.pdf</pdf>
<pre>
        Multimodal Deep Features Fusion For Video Memorability
                             Prediction
                         Roberto Leyva1,2 , Faiyaz Doctor1 , Alba G. Seco de Herrera1 , Sohail Sahab2
                                        1 University of Essex, Colchester, UK 2 Hub Productions, London, UK

                                                  {r.leyva,fdocto,alba.garcia}@essex.ac.uk,sohail@hub.tv
                                                                                                      3×N×M×T
ABSTRACT
                                                                                                                   FC7   2k×16/T     FV     260k×1   PCA   256×1
This paper describes a multimodal feature fusion approach for pre-                                                                                                                  LASSO


                                                                                                                                                                   Fusion


                                                                                                                                                                                               Score
                                                                                                                                                                   Early
                                                                                                           3×N×M   FC7             1000×1            PCA   256×1            768×1   LARS 1×1
dicting the short and long term video memorability where the goal                                                                                                                    CV
                                                                                                                   FC4             300×1             PCA   256×1
to design a system that automatically predicts scores reflecting the
                                                                                moose-calf-in-the-bushes    1×P
probability of a video being remembered. The approach performs
early fusion of text, image, and video features. Text features are
extracted using a Convolutional Neural Network (CNN), an FBRes-                Figure 1: Video memorability prediction pipeline via three-
Net152 pre-trained on ImageNet is used to extract image features               stream media source information. We early fuse text, image
and video features are extracted using 3DResNet152 pre-trained on              and video features to create a memorability score.
Kinetics 400. We use Fisher Vectors to obtain a single vector associ-
ated with each video that overcomes the need for using a non-fixed
global vector representation for handling temporal information.
The fusion approach demonstrates good predictive performance                   a primary concern in the memorability task. They use Least Abso-
and regression superiority in terms of correlation over standard               lute Shrinkage and Selection Operator (LASSO) [23], Support Vector
features.                                                                      Regression (SVR), and Elastic Network (ENet) for their experiments.
                                                                                  Savii et al. [20] propose using only the video temporal infor-
                                                                               mation employing video features for the memorability task. Here
1    INTRODUCTION                                                              the method is passing Convolution 3D (C3D) [24] and Histogram
Remembering videos is a key aspect of advertising, entertainment,              of Motion Patterns(HMP) [1] features to a Deep Neural Network
and recommendation systems [3]. We are more influenced by videos               (DNN) where the final score is obtained using a DNN+ k-Nearest
that remain fresh in our memory and subsequently share their                   Neighbour (k-NN) regressor. In similar work, Tran-Van et al. [25]
contents with others. Creating memorable video content is cru-                 proposes a solution to capture the temporal information where
cial for generating consumer impact and engaging entertainment                 they combine Image features IV3 with an Long Short Term Mem-
and profitable marketing campaigns. Understanding and predicting               ory (LSTM) to produce the memorability score.
memorability as a function of video features is therefore important
for computational video analysis tasks. In this work, we propose               2        APPROACH
a method for video memorability prediction [4] keeping in mind
                                                                               Multi-source feature fusion usually gives improved results over
that the videos are not necessarily attractive or interesting. Thus,
                                                                               isolated modeling of features as has been shown in [6, 7, 12, 25, 26].
we explore which features provide better regression results. No
                                                                               Chaudhry et al. [2, 26] models used image, text, and video features
assumptions are made on the task’s structure, and we proceed to
                                                                               and achieved better results when fussing them as compared to
analyze text, image, and video features in combinations to deter-
                                                                               modelling them individually [22]. However, fusing multiple fea-
mine their ability to predict long terms and short term memorability
                                                                               tures from the same information source, e.g., image source, can
using different machine learning based regression techniques. Our
                                                                               increase complexity while giving little improvements to the tasks’
findings show that long and short term memorability share the
                                                                               performance [6]. For instance, Joshi et al. [12] propose using the
same feature structure giving better accuracy when fusing features
                                                                               Memorability Network [13] along with Hue Saturation and Value
of a different type for the short memorability task. These outcomes
                                                                               (HSV) 3D [6], colorfulness [10], aesthetics [8], saliency Net [18]
also leave room for future improvements.
                                                                               , C3D [24], and Global Vectors (GloVe) of text features [19]. This
   The works that precede this study have addressed the memora-
                                                                               approach gives little gains over single-feature source selection. For
bility tasks mainly using the provided features or replacing them
                                                                               this reason, we deem appropriate extracting only one feature from
[2, 6, 7, 12, 25, 26, 26]. The memorability task can be done using
                                                                               each of the following information sources: text, image, and video.
single-source or multi-source feature information to train a re-
                                                                               Secondly, modeling the Spatio-temporal domain via recurrent net-
gression model. Gupta et al. [7] propose using images information
                                                                               works may become very computational costly [25]. Because we
source via linear highly regularized models to prevent over-fitting
                                                                               are targeting large-scale video analysis, we consider a less complex
using the provided features, Residual Network (ResNet) features
                                                                               approach. Thirdly, to generate the memorability score, we explore
and Dense Network (DenseNet) features. Over-fitting is potentially
                                                                               linear regularized methods and deep learning models. This con-
                                                                               sideration rests on the assumption that the latter techniques do
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                           not necessarily achieve better generalization, as mentioned in [7].
4.0 International (CC BY 4.0).                                                 Finally, we can improve the provided features’ performance [17].
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                         EssexHubTV


To this end, we use other feature representations following au-          types. Also provided are some pre-computed content descriptors.
thors [20, 26] using ConceptNet [21], skip-thought [15]. Thereby,        Table 1 shows that our approach performs better on STM than on
we consider other deep learning approaches for feature extrac-           LTM. We experimentally found that the regression model has a
tion giving particular importance to the spatio-temporal domain          significant impact on the correlation values. This selection requires
as [20, 25].                                                             further analysis in terms of features as well. Perhaps unsupervised
    Our proposed method uses three primary feature modalities (text,     models may reveal more about the nature of the tasks.
image, and video) for predicting the memorability score, Figure 1
shows the pipeline in detail.                                            Table 1: Memorability task evaluation using Spearman’s
Text Features: we use the provided video captions as an input text       rank correlation for different models.
to a Convolutional Neural Network (TCNN). The text is vectorised
via tokenization and word embedding into 100 dimensions to feed          Task Run                                             Validation Test
the network using the IMDB dataset for sentiment analysis [14].
We use this dataset because of the high accuracy of the network on           TCNN/FBRN152/3DRN152/LassoLarsCV                   0.5149    0.459
this task ultimately gave us confidence that the model is adequately         TCNN/FBRN152/3DRN152/LassoCV                       0.4987    0.463
trained and can be trusted as a feature generator. We use the last       STM FBRN152/LassoLarsCV                                0.4936    0.445
Fully Connected (FC) layer as a feature generator resulting from             TCNN/FBRN152/3DRN152/DNN                           0.4837    0.436
the concatenation of the text input convolution embedding. This              TCNN/DN201/3DRN152/LassoLarsCV                     0.5185    0.467
process results in a 300-dimensions feature vector, i.e., 3× embed-          TCNN/FBRN152/3DRN152/SVR                           0.2394    0.203
ding size.                                                                   TCNN/FBRN152/3DRN152/LassoCV                       0.2321    0.185
Image Features: We extract the middle frame of each video clip           LTM TCNN/FBRN152/3DRN152/DNN                           0.2104    0.159
and apply FBResNet152 [11] pre-trained on ImageNet. To this end,             FBRN152/SVR                                        0.2491    0.189
we feed the model the middle frame to extract a 1000-dimensions              DN152/SVR                                          0.2612    0.196
feature vector from the last FC layer. We also explored selecting
other frames from the sequences without achieving better correla-
tion values.
Video Features: To extract video features, we use 3DResNet152 [9]
                                                                         4    DISCUSSION AND OUTLOOK
pre-trained on Kinetics-400. We feed the video sequence to retrieve      From Table 1, we can see that the best regression model is not the
a feature vector for every 16 frames producing a 2048-dimensions         same for both tasks. For the STM task, LassoLarsCV achieves the
feature vector. Although for this particular case we may have fixed-     best results while SVR for the LTM task, respectively. Although it
length video clips, in practice the number of frames is not fixed        is not the same regression model, we achieve the best correlation
and stacking the produced features may become very computa-              results for the memorability tasks when fusing all three types of
tionally complex. Inspired by the work of Girdhar et al. [5] using       features. It is worth noticing that image-based features achieve
Vector of Locally Aggregated Descriptors (VLAD) vectors for ac-          the second-best results. Regarding the frame selection criterion,
tion recognition, we follow a similar approach using Fisher Vectors      i.e., the middle frame, we observed no significant difference by
(FV) to address this problem. The technique then creates a single        selecting other frames in the Spearman’s rank correlation. This
feature vector for each video sequence using Fisher Vectors. The         aspect may be linked to the short length of the videos. We can
method is to generate a Gaussian Mixture Model (GMM) model               quickly inspect that there is a strong visual relationship between the
from the 16-frame collection features and project them into a high       first and the last frame. Perhaps longer sequences may require more
dimensional space via the soft assignment. As the resulting fea-         elaborate temporal analysis. Thus, for practical purposes, we prefer
ture space is considerably high, we reduced the dimensions via           to incorporate specific video-designed features. We also verified the
Principal Component Analysis (PCA) following an FV-GMM-PCA               PCA effectiveness before the early fusion and by individual feature
fashion [16]. This last step provides a single feature vector for each   selection. We observed an improvement c.a. 4-7% in Spearman’s
video sequence capturing the motion information from the clips.          rank correlation, thus it is a good practice to project the features into
Feature Fusion: We combine the text, image and video features            a lower dimensional space before feed the regression model. The
via early fusion. Prior to this step, we reduce the features’ dimen-     proposed method enables us to capture the memorability associated
sionality using PCA with 256 components aiming for better feature        with videos comprising multimedia features. With this in mind, it
representation. The vectors are then stacked as 3 × 256 = 768 and        is possible to create models for similar tasks in video content for
feed into the regression model, as Figure 1 illustrates. The last step   other computer vision applications. The memorably test, then, can
is to perform the regression using a regularized method. To this end,    extrapolate multimedia analysis for other case studies, e.g. video
we used LassoLarsCV [23] in the pursuit of cross folding that gives      summarization where the scores can be treated as features weights,
the best regression parameters for the final model automatically.        where, naturally, the features are not necessarily visual.

                                                                         ACKNOWLEDGMENTS
3   RESULTS AND ANALYSIS
                                                                         This study has been funded through an Innovate UK Knowledge
The memorability dataset comprises 10000 short soundless videos          Transfer Partnership between Hub Productions Limited and the
split into 8000 videos for the development set and 2000 videos for       School of Computer Science & Electronic Engineering, University
the test set [4]. The videos are varied and contain different scene      of Essex, Partnership No: 11071.
The 2019 Predicting Media Memorability Task                                            MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                      [18] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and
 [1] Jurandy Almeida, Neucimar J Leite, and Ricardo da S Torres. 2011.               Noel E O’Connor. 2016. Shallow and deep convolutional networks for
     Comparison of video sequences with histograms of motion patterns.               saliency prediction. In Proceedings of the IEEE Conference on Computer
     In 2011 18th IEEE International Conference on Image Processing. IEEE,           Vision and Pattern Recognition. 598–606.
     3673–3676.                                                                 [19] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
 [2] Ritwick Chaudhry, Manoj Kilaru, and Sumit Shekhar. 2018. Show and               2014. GloVe: Global Vectors for Word Representation. In Empirical
     Recall@ MediaEval 2018 ViMemNet: Predicting Video Memorability.                 Methods in Natural Language Processing (EMNLP). 1532–1543. http:
     Group 1 (2018), G1.                                                             //www.aclweb.org/anthology/D14-1162
 [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong,                  [20] Ricardo Manhães Savii, Samuel Felipe dos Santos, and Jurandy
     Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, and France Rennes.                 Almeida. 2018. GIBIS at MediaEval 2018: Predicting Media Mem-
     2018. MediaEval 2018: Predicting Media Memorability Task. CoRR                  orability Task.. In MediaEval.
     abs/1807.01052 (2018). arXiv:1807.01052 http://arxiv.org/abs/1807.         [21] Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet
     01052                                                                           5.5: An open multilingual graph of general knowledge. In Thirty-First
 [4] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,                AAAI Conference on Artificial Intelligence.
     Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019.           [22] Wensheng Sun and Xu Zhang. 2018. Video Memorability Prediction
     Predicting Media Memorability Task at MediaEval 2019, Sophia An-                with Recurrent Neural Networks and Video Titles at the 2018 Media-
     tipolis, France, Oct. 27-29, 2019(2019). Proc. of MediaEval 2019 Work-          Eval Predicting Media Memorability Task.. In MediaEval.
     shop (2019).                                                               [23] Robert Tibshirani. 1996. Regression shrinkage and selection via the
 [5] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan              lasso. Journal of the Royal Statistical Society: Series B (Methodological)
     Russell. 2017. Actionvlad: Learning spatio-temporal aggregation for             58, 1 (1996), 267–288.
     action classification. In Proceedings of the IEEE Conference on Computer   [24] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Vision and Pattern Recognition. 971–980.                                        Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
 [6] Ankit Goyal, Naveen Kumar, Tanaya Guha, and Shrikanth S                         volutional networks. In Proceedings of the IEEE international conference
     Narayanan. 2016. A multimodal mixture-of-experts model for dynamic              on computer vision. 4489–4497.
     emotion prediction in movies. In 2016 IEEE International Conference        [25] Duy-Tue Tran-Van, Le-Vu Tran, and Minh-Triet Tran. 2018. Predicting
     on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2822–2826.           Media Memorability Using Deep Features and Recurrent Network. In
 [7] Rohit Gupta and Kush Motwani. 2018. Linear Models for Video Memo-               MediaEval.
     rability Prediction Using Visual and Semantic Features.. In MediaEval.     [26] Shuai Wang, Weiying Wang, Shizhe Chen, and Qin Jin. 2018. RUC at
 [8] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian                    MediaEval 2018: Visual and Textual Features Exploration for Predict-
     Nater, and Luc Van Gool. 2013. The interestingness of images. In                ing Media Memorability.. In MediaEval.
     Proceedings of the IEEE International Conference on Computer Vision.
     1633–1640.
 [9] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spa-
     tiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?.
     In Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR). 6546–6555.
[10] David Hasler and Sabine E Suesstrunk. 2003. Measuring colorfulness in
     natural images. In Human vision and electronic imaging VIII, Vol. 5007.
     International Society for Optics and Photonics, 87–95.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
[12] Tanmayee Joshi, Sarath Sivaprasad, Savita Bhat, and Niranjan
     Pedanekar. 2018. Multimodal Approach to Predicting Media Memora-
     bility.. In MediaEval.
[13] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and predicting image memorability at a large scale. In
     Proceedings of the IEEE International Conference on Computer Vision.
     2390–2398.
[14] Yoon Kim. 2014. Convolutional neural networks for sentence classifi-
     cation. arXiv preprint arXiv:1408.5882 (2014).
[15] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel,
     Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought
     vectors. In Advances in neural information processing systems. 3294–
     3302.
[16] R. Leyva, V. Sanchez, and C. Li. 2019. Compact and Low-Complexity
     Binary Feature Descriptor and Fisher Vectors for Video Analytics.
     IEEE Transactions on Image Processing 28, 12 (Dec 2019), 6169–6184.
     https://doi.org/10.1109/TIP.2019.2922826
[17] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Learning Memora-
     bility Preserving Subspace for Predicting Media Memorability.. In
     MediaEval.

</pre>