=Paper= {{Paper |id=Vol-2283/MediaEval_18_paper_25 |storemode=property |title=Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies |pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_25.pdf |volume=Vol-2283 |authors=Khanh-An C. Quan,Vinh-Tiep Nguyen,Minh-Triet Tran |dblpUrl=https://dblp.org/rec/conf/mediaeval/QuanNT18 }} ==Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies== https://ceur-ws.org/Vol-2283/MediaEval_18_paper_25.pdf
             Frame-based Evaluation with Deep Features to Predict
                        Emotional Impact of Movies
                                       Khanh-An C.Quan1 , Vinh-Tiep Nguyen1 , Minh-Triet Tran2
                          1 University of Information Technology, Vietnam National University-Ho Chi Minh city
                                       2 University of Science, Vietnam National University-Ho Chi Minh city

                                      15520006@gm.uit.edu.vn,tiepnv@uit.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT
In this paper, we describe our approach for the Emotional Impact of
Movies Task at the MediaEval 2018 Challenge. Specifically, we em-
ploy features extracted from ResNet-50 from image frames. Then,
a fully connected neural network is used for learning the predic-
tion models. Later, we applied the Window Sliding Technique for
post-processing the results. The experimental results show the
effectiveness of our approach.
                                                                               Figure 1: Overview of Frame-based Prediction Models

1    INTRODUCTION
Analysing the emotional impact of a video clip to viewers can be
utilized to enhance or control psychological effects of media to
people [2, 7], to boost user engagement to media content [6], or to
generate personalized media content [5].
   The MediaEval 2018 Emotional Impact of Movies Task consists
of two subtasks. The first subtask is to predict the score of induced
valence and induced arousal every second along movies. The other
is fear prediction, but we have not worked on it. Both subtasks
are evaluated by Mean Squared Error and Pearson’s Correlation
Coefficient. The dataset used for both is the LIRIS-ACCEDE [1]
dataset. Full details of the challenge tasks and database can be
found in [3].
   There are various sources of information that can be exploited
to predict the emotional impact of a movie clip. Although visual                    Figure 2: Valence/Arousal prediction models
content is an essential source to infer viewers’ emotion, audio and
text are also potential components for this task. Frame-based and
sequence-based approaches can be applied to analyse video frames            2.1    Frame extraction
to evaluate emotional impact.                                               Firstly, we extracted one frame per second of all movies on the train-
   In our method, we follow the frame-based approach to predict             ing and test set. For frame extraction, we use ffmpeg the framework
video emotional impact. From the training dataset, we extract deep          and the extract command provided by the organizers to extract
features of each frame and train two models to predict valence              frames.
and arousal properties of a video frame. Then we apply the two
trained models to evaluate each frame in the test set independently.        2.2    Features extraction
Finally, we employ the sliding window technique to smooth the
                                                                            For the frame extraction, we use pre-trained 50-layer Residual Net-
final results.
                                                                            work (ResNet-50) [4] for ImageNet. The ResNet-50 used as a feature
                                                                            extractor and 2048-dim features vector are extracted from each
2    APPROACH                                                               frame of the movies. In our experiments, we used the Keras ResNet-
In this section, we will describe how we approach the valence-              50 pre-trained model on ImageNet dataset and calculate the features
arousal prediction subtasks. The proposed method contains four              vector from the global average pooling that applied to the output
stages: frame extraction, features extraction, prediction models and        of the last convolutional layer.
post-processing methods. Our system pipeline is shown in Figure 1
below.                                                                      2.3    Prediction models
Copyright held by the owner/author(s).
                                                                            We apportion the training set provided by the organizer into train-
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                  ing and validation sets with a ratio of 80:20. An overview of the
                                                                            prediction models is shown in Figure 2.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                           Khanh-An C.Quan et al.


   We employ 2-layer fully connected neural network to learn the        3.3    Results and Analysis
emotional models. The models take 2048-dim features vectors ex-
tracted from ResNet-50 as input. We experimented with varying                  Table 1: Results of the valence-arousal subtask.
the number of the nodes for the first and the second layer with
128, 256, 512 and training epochs as 10, 15, 20 . We use Root Mean
Square Propagation (RMSProp) with the learning rate 10−4 .                                      Valence                 Arousal
                                                                                Runs
All prediction models are trained separately for valence and arousal.                        MSE        r            MSE        r
                                                                                Run 1      0.11936 0.10665         0.17448 0.05282
                                                                                Run 2      0.11504 0.14565         0.17055 0.07525
2.4    Post-processing                                                          Run 3      0.11943 0.14513         0.17443 0.06978
After get the valence/arousal results, we applied the Average Win-              Run 4      0.11731 0.14097         0.17901 0.01877
dow Sliding Technique to smooth out the random noise. We tested                 Run 5      0.11526 0.14306         0.17282 0.09123
the window size of the algorithm with 3, 5, 7.

3     RESULTS AND ANALYSIS
In this section, we will describe in detail the experimental specifi-
cation, five runs that we have summitted for the valence-arousal
subtask and the result.

3.1    Experimental specification
The experiments are processed on Google Compute Engine with 2
vCPU, 7.5 GB RAM and Nvidia Tesla K80 GPU. The average times
for extracting 93406 frames on the training set about 1 hour, 40
minutes for extracting features by ResNet and 3 minutes for training
each models.
                                                                        Figure 3: Examples of the frame similarity between the tran-
                                                                        ning set and test set on Valence
3.2    Submitted runs
We tested all trained models with the validation set. After we got          As shown in the Table 1, Run 2 obtains the best result for the
the results of all models on the validation set, we sorted descending   valence-arousal subtask. But in general, there is a slight difference
by the mean square error and selected from Top-1 to Top-4 model         in the results. Comparing Run 1 with Run 2, applying Window
to submit. The details of each run are listed below.                    Sliding Technique provides better results. As shown in the Figure
   All runs take ResNet-50 features as input. From Run 2 to Run 5,      3, there is the similarity in frames between the training set and test
the results take the Window Sliding Technique with the window           set on valence.
size = 7.
      • Run 1: For both valence and arousal, 2-layer fully con-         4     CONCLUSION
        nected neural network with 128 nodes on the first layer,        We propose a simple method to evaluate the emotional impact, i.e.
        512 nodes on the second layer trained on 20 epochs.             valence and arousal properties, of a video frame. We study several
      • Run 2: The same models with Run 1 but we also take the          settings of classification modules with 1 to 2 fully connected layers
        Window Sliding Technique with the window size = 7 to            and different numbers of nodes in each layer to select an appropriate
        smooth out the random noise.                                    model for each property. Experimental results demonstrate that
      • Run 3: For valence, 2-layer fully connected neural network      although our method is simple, it achieves promising results for
        with 256 nodes on the first layer, 512 nodes on the second      this task. This is the initial step to develop better method to utilize
        layer trained on 10 epochs. For arousal, 2-layer fully con-     temporal information of frame sequences, and other media types,
        nected neural network with 512 nodes on the first layer,        such as audio and text components.
        512 nodes on the second layer trained on 15 epochs.
      • Run 4: For valence, 2-layer fully connected neural network      ACKNOWLEDGMENTS
        with 256 nodes on the first layer, 512 nodes on the second      We would like to express our appreciation to Multimedia Com-
        layer trained on 15 epochs. For arousal, 2-layer fully con-     munications Laboratory, University of Information Technology,
        nected neural network with 128 nodes on the first layer,        VNU-HCM, Vietnam, and Software Engineering Laboratory, Uni-
        512 nodes on the second layer trained on 10 epochs.             versity of Science, VNU-HCM, Vietnam.
      • Run 5: For valence, 2-layer fully connected neural network
        with 512 nodes on the first layer, 512 nodes on the second      REFERENCES
        layer trained on 10 epochs. For arousal, 2-layer fully con-      [1] Y. Baveye, E. Dellandrea, C. Chamaret, and Liming Chen. 2015. LIRIS-
        nected neural network with 512 nodes on the first layer,             ACCEDE: A Video Database for Affective Content Analysis. IEEE
        512 nodes on the second layer trained on 10 epochs.                  Transactions on Affective Computing 6, 1 (Jan.-March 2015), 43–55.
Task name as it appears on http://multimediaeval.org/mediaeval2018               MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


 [2] L. Canini, S. Benini, and R. Leonardi. 2013. Affective Recommendation
     of Movies Based on Selected Connotative Features. IEEE Transactions
     on Circuits and Systems for Video Technology 23, 4 (April 2013), 636–647.
     https://doi.org/10.1109/TCSVT.2012.2211935
 [3] Emmanuel Dellandréa, Huigsloot Martijn, Liming Chen, Yoann Bav-
     eye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The MediaEval 2018
     Emotional Impact of Movies Task. In MediaEval 2018 Workshop.
 [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     Residual Learning for Image Recognition. In CVPR. IEEE Computer
     Society, 770–778.
 [5] Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR:
     Personalized Video Soundtrack Recommendation by Late Fusion with
     Heuristic Rankings. In Proceedings of the 22Nd ACM International
     Conference on Multimedia (MM ’14). ACM, New York, NY, USA, 607–
     616. https://doi.org/10.1145/2647868.2654919
 [6] K. Yadati, H. Katti, and M. Kankanhalli. 2014. CAVVA: Computational
     Affective Video-in-Video Advertising. IEEE Transactions on Multimedia
     16, 1 (Jan 2014), 15–23. https://doi.org/10.1109/TMM.2013.2282128
 [7] Sicheng Zhao, Hongxun Yao, Xiaoshuai Sun, Xiaolei Jiang, and Pengfei
     Xu. 2013. Flexible Presentation of Videos Based on Affective Content
     Analysis. In Advances in Multimedia Modeling, Shipeng Li, Abdul-
     motaleb El Saddik, Meng Wang, Tao Mei, Nicu Sebe, Shuicheng Yan,
     Richang Hong, and Cathal Gurrin (Eds.). Springer Berlin Heidelberg,
     Berlin, Heidelberg, 368–379.