=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_13 |storemode=property |title=MIC-TJU in MediaEval 2017 Emotional Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_13.pdf |volume=Vol-1984 |authors=Yun Yi,Hanli Wang,Jiangchuan Wei |dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWW17 }} ==MIC-TJU in MediaEval 2017 Emotional Impact of Movies Task== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_13.pdf
MIC-TJU in MediaEval 2017 Emotional Impact of Movies Task

                                          Yun Yi1,2 , Hanli Wang2,* , Jiangchuan Wei2
        1
            Department of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China
                2
                  Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
ABSTRACT                                                                2.1       Feature Extraction
To predict the emotional impact and fear of movies, we                  In this framework, we evaluate four features, including E-
propose a framework which employs four audio-visual fea-                moBase10 feature [5], Mel-Frequency Cepstral Coefficients
tures. In particular, we utilize the features extracted by the          (MFCC) feature [4], Motion Keypoint Trajectory (MKT)
methods of motion keypoint trajectory and convolutional                 feature [15], and Convolutional Networks (ConvNets) fea-
neural networks to depict the visual information, and extract           ture [12, 14].
a global and a local audio features to describe the audio cues.
The early fusion strategy is employed to combine the vectors               2.1.1 MFCC Feature. In affective content analysis, audio
of these features. Then, the linear support vector regression           modality is essential. MFCC is a famous local audio feature.
and support vector machine are used to learn the affective              The time window of MFCC is set to 32 ms, and set
models. The experimental results show that the combination              50% overlap between two adjacent windows. In order to
of these features obtains promising performances.                       promote the performance, we append delta and double-delta
                                                                        of 20-dimensional vectors into the original MFCC vector.
                                                                        Therefore, a 60-dimensional MFCC vector is generated. We
1       INTRODUCTION                                                    apply Principal Component Analysis (PCA) to reduce the
                                                                        dimension of the local feature, and use the Fisher Vector
The 2017 emotional impact of movies task is a challenging
                                                                        (FV) model [10] to represent a whole audio file via a
task, which contains two subtasks (i.e., valence-arousal
                                                                        signature vector. The cluster number of Gaussian Mixture
prediction and fear prediction). A brief introduction about
                                                                        Model (GMM) is set to 512, and the signed square root
this challenge has been given in [3]. In this paper, we mainly
                                                                        and L2 norm are utilized to normalize the vectors. In our
introduce the system architecture and algorithms used in our
                                                                        experiments, we use the toolbox provided by [4] to calculate
framework, and discuss the evaluation results.
                                                                        the vectors of MFCC.
2       FRAMEWORK                                                          2.1.2 EmoBase10 Feature. To depict audio information,
The key components of the proposed framework is shown in                we extract the EmoBase10 feature [5, 11], which is a glob-
Fig. 1, and the highlights of our framework are introduced              al and high-level audio feature. As suggested by [5, 11],
below.                                                                  the default parameters are utilized to extract the 1,582-
                                                                        dimensional vector of EmoBase10. The 1,582-dimensional
                                                                        vector results from: (1) 21 functionals applied to 34 Low-
       sŝĚĞŽ
                                                                        Level Descriptors (LLD) and 34 corresponding delta coeffi-
                       ŵŽĂƐĞϭϬ                                        cients, (2) 19 functionals applied to the 4 pitch-based LLD
                                                                        and their 4 delta coefficient contours, (3) the number of pitch
                         D&                                   Z
                                                        ^sD     ĞƐ      onsets and the total duration of the input [5, 11]. Then, the
                                                         Žƌ      Ƶů
                          D