=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_19 |storemode=property |title=Multimodality and Deep Learning when Predicting Media Interestingness |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_19.pdf |volume=Vol-1984 |authors=Eloïse Berson,Claire-Hélène Demarty,Ngoc Q.K. Duong |dblpUrl=https://dblp.org/rec/conf/mediaeval/BersonDD17 }} ==Multimodality and Deep Learning when Predicting Media Interestingness== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_19.pdf
        Multimodality and Deep Learning when predicting Media
                           Interestingness
                                       Eloïse Berson, Claire-Hélène Demarty, Ngoc Q. K. Duong
                                                         Technicolor, France
                     eloise.berson@gmail.com,{claire-helene.demarty,quang-khanh-ngoc.duong}@technicolor.com

ABSTRACT                                                                contain semantic information than the CNN features alone. Dimen-
This paper summarizes the computational models that Technicolor         sion of the ICB feature is 1024.
proposes to predict interestingness of images and videos within the        To go further in this vein of adding semantic and contextual
MediaEval 2017 Predicting Media Interestingness Task. Our systems       information, textual metadata was directly extracted from ImDB 1 ,
are based on deep learning architectures and exploit the use of both    exploiting the fact that the MediaEval 2017 dataset was built from
semantic and multimodal features. Based on the obtained results,        Hollywood-like movie extracts. Except for 3 movies (for 2 of them,
we discuss our findings and obtain some scientific perspectives for     a short summary was built from descriptions found on the internet;
the task.                                                               for the last one, description was left empty), ImDB information was
                                                                        available: each movie description and/or storyline was proposed
1    INTRODUCTION                                                       at the input of the RAKE algorithm [10], for keyword extraction.
Understanding interestingness of media content such as images           Thus, several keywords were extracted per movie, from which we
and videos, has gained a significant attention from the research        derived a textual feature of dimension 300, classically using the
community recently as it offers numerous practical applications in      Word2Vec [9] representations (pretrained on GoogleNews dataset)
e.g., content selection or recommendation [1, 2, 5]. Following the      of this batch of keywords and averaging them.
success of the 2016 edition [4], the MediaEval 2017 Predicting Media    3    DNN ARCHITECTURES
Interestingness Task extends the benchmark to larger datasets, also
annotated with a greater human annotation effort. A complete            Global workflows for all submitted runs and for both subtasks are
description of the task can be found in [3].                            shown in Figure 1. As stated in the introduction, most components
   For both subtasks, Technicolor’s motivation was to build incre-      used to build the systems’ architectures for both subtasks were the
mentally from last year’s systems [11], i.e., re-use similar features   same as in [11]. Thus, to cope with the unbalance of the dataset,
and DNN architectures, while adding some contextual information         some resampling of the data was applied during training. Several
to the content. To this end, two new features were added, so as to      parameter configurations were investigated by splitting the dataset
capture additional semantic information related to the content, fol-    in 80% for training and 20% for validation. A final retraining of the
lowing a similar idea as in [8]. These new features (section 2), were   best model was then applied on the complete development set.
expected to bring contextual information related to the content. In        For the image subtask, different concatenations of the features
a second step (section 3), and for the video subtask only, several      were investigated to understand the contribution of each modality
embeddings of this semantic information at different network lev-       and to conclude on the input of contextual information to the
els were experimented. The aim was to investigate how this was          task. Thus each submitted run differs from the others by the input
influencing the temporal modeling of this new information.              features, and the adaptation of the layer sizes, while the DNN
                                                                        architecture remains the same: a single MLP layer, with rectified
2    MULTIMODALITY AND CONTEXTUAL                                       linear unit (ReLU) activation and a dropout of 0.5. All submitted
     FEATURES                                                           runs are summarized in Figure 1a, with different colors depending
As in 2016, CNN coming from the fc7 layer of the pre-trained caf-       on the feature concatenation; Run#1, corresponding to 2016 best
feNet model (image modality, both subtasks) and MFCCs concate-          system, will serve as a baseline.
nated with their first and second derivatives (audio modality, video       For the video subtask, three levels of embedding for the W2V
subtask) were extracted following the protocol described in [11].       features were investigated (see Figure 1b), except for Run#1 which
Dimensions for these features were 4096 and 180, respectively.          re-uses one of last year’s systems (Run#3 in [11], see Figure 1 for the
   To capture some additional semantic information, Image-Captioning    used layers, each of them with ReLU activation function, followed
Based (ICB) features [7] were computed for each image or frame, de-     by a dropout equal to 0.5). Run#1’s architecture is kept for the other
pending on the subtask. These features correspond to the projection     runs, with some adaptation of the multimodal block depending on
of an image in a visual-semantic embedding space [7], obtained          the input feature sizes (one or two LSTM layers, with a residual
from a jointly-trained model for images and captions dedicated          block). In Run#1, our baseline, audio and video modalities only are
to automatic captioning. In this embedded space, where seman-           used. For the image channel, a first modal-specific learning step was
tic distances between projected image and captioning features are       implemented with a MLP layer followed by a LSTM layer. For the
minimized, the resulting representation features are more likely to     audio, a single LSTM layer is used. After merging, both channels
                                                                        serve as input to two LSTM layers, with a residual part (ResNet [6]).
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                        1 see http://www.imdb.com
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                 E. Berson, C.H. Demarty, N.Q.K. Duong




              Figure 1: Workflows for all run submissions: (a) image subtask. (b) video subtask.*: block swapping.

   In Run#2, W2V features were simply merged to the result of the                           Image Subtask       Video Subtask
                                                                                   Runs
temporal modeling for the other modalities, whereas for Run#3                              MAP MAP@10 MAP MAP10
and Run#4, they were duplicated for each frame and merged into                    Run#1 0.2615      0.1028     0.1856 0.0589
the workflow either in parallel to the audio and video channels                   Run#2 0.2525      0.1054     0.1768 0.0465
(Run#4) or after a first merging of these two modalities (Run#3)                  Run#3 0.2244      0.0693     0.1825 0.0563
(See Figure 1). For each run, some potential processing steps were                Run#4 0.2382      0.0875     0.1878 0.0641
added to realize the merge with the other modalities thanks to                    Run#5 0.2347      0.0861    0.1918 0.0609
either additional LSTM-ResNet layers when temporal modeling was          Table 1: Results on both subtasks (Official metric: MAP@10).
possible (Runs#3 and 4), or simple MLP layers otherwise (Run#2).
These steps were followed by a simple concatenation of the features
from the different modalities. Run#5 is similar to Run#4 except that     further analysis of the differences between the development and
the Time Domain Average and Softmax steps were swapped. The              test sets should be done to better understand this observation.
motivation for this last run was to test whether the location of the        For the video subtask, as expected, W2V features slightly im-
decision step (softmax) had an influence on the performance.             proved MAP and MAP@10 when considered as a frame-based fea-
4   RESULTS AND DISCUSSION                                               ture. Although they are simply repeated for each frame, i.e., each
Results are summarized in Table 1. First runs for both subtasks          frame of a given video shares the same textual feature, the concate-
show slightly improved MAP values compared to last year’s results.       nation of this new information did bring some useful information
As the systems remain the same for these runs over the two years,        for the video subtask. This difference between the two subtasks re-
it tends to show that the dataset size increase and/or the refinement    inforces the difference between image and video interestingnesses
of the annotations had an effect on the modeling performance. Un-        which was already stated last year. Run#5 of the video subtask
expectedly, the MAP@10 values are very low (lower than during            suggests that keeping the classification for the final step of the
the validation process, when MAP@10 values were of the same              system is maximizing the performance, which is understandable
range or slightly lower than MAP for both tasks). Another unex-          as it allows to keep continuous values as much as possible before
pected result is that contextual features, either ICB or W2V features,   switching to a binary classification. It also corresponds better to the
did not bring any improvement to the image subtask, although we          annotation protocol where the annotation is done for each video
had an opposite conclusion during validation on the development          segment as a whole; thus the softmax prediction should also be
set with MAP values of resp. 0.36 and 0.38 for those features (for       done for the whole segment and not for every single frame.
comparison, we obtained 0.31 with CNN features). This suggests              As a conclusion, a lot of our findings on the evaluation step
that using more features might have led to over-fitting, probably        were different from those of the test set. We definitely need to
because of the small size of the dataset during training. This over-     understand what differs from these two sets that is responsible for
fitting might have also been reinforced as, because of a lack of         the differences in performance. E.g., some new, significantly longer
computation resources, cross-validation was done with one fold           and thus more meaningful segments (243 out of 2435) were added
only. In the future, some cross-validation process with more folds       to the test set only, representing a duration of 46min over a total
might lead to a better system. However, once the test set is released,   duration of 87min, i.e., more than half of the test set.
Predicting Media Interestingness Task                                          MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Sharon Lynn Chu, Elena A Fedorovskaya, Francis KH Quek, and Jeffrey
     Snyder. 2013. The effect of familiarity on perceived interestingness of
     images.. In Human Vision and Electronic Imaging.
 [2] Claire-Helène Demarty, Mats Sjöberg, Gabriel Constantin, Ngoc Q. K.
     Duong, Bogdan Ionescu, Thanh-Toan Do, and Hanli Wang. 2017. Pre-
     dicting Interestingness of Visual Content.
 [3] Claire-Helène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Michael Gygli, and Ngoc Q. K. Duong. 2017. MediaEval 2017
     Predicting Media Interestingness Task. MediaEval 2017 Workshop
     (September 2017).
 [4] Claire-Helène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Hanli Wang, Ngoc Q. K. Duong, and Frederic Lefebvre. 2016.
     MediaEval 2016 Predicting Media Interestingness Task. MediaEval
     2016 Workshop (October 2016).
 [5] Helmut Grabner, Fabian Nater, Michel Druey, and Luc Van Gool. 2013.
     Visual Interestingness in Image Sequences. In Proceedings of the 21st
     ACM International Conference on Multimedia (MM ’13). ACM, New
     York, NY, USA, 1017–1026. https://doi.org/10.1145/2502081.2502109
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
     Deep Residual Learning for Image Recognition. In arXiv prepring
     arXiv:1506.01497.
 [7] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Uni-
     fying visual-semantic embeddings with multimodal neural language
     models. arXiv preprint arXiv:1411.2539 (2014).
 [8] Brian A. Plummer, Matthew Brown, and Svetlana Lazebnik. 2017.
     Enhancing Video Summarization via Vision-Language Embedding. In
     Computer Vision and Pattern Recognition, 2017. CVPR 2017. Proceedings
     of the International Conference on. IEEE.
 [9] Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint
     arXiv:1411.2738 (2014).
[10] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Au-
     tomatic keyword extraction from individual documents. Text Mining:
     Applications and Theory (2010), 1–20.
[11] Yuesong Shen, Claire-Hélène Demarty, and Ngoc Q. K. Duong. 2016.
     Technicolor@ MediaEval 2016 Predicting Media Interestingness Task..
     In MediaEval.