=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_15 |storemode=property |title=Technicolor@MediaEval 2016 Predicting Media Interestingness Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_15.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/ShenDD16 }} ==Technicolor@MediaEval 2016 Predicting Media Interestingness Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_15.pdf
              Technicolor@MediaEval 2016 Predicting Media
                          Interestingness Task

                         Yuesong Shen1,2 , Claire-Hélène Demarty2 , Ngoc Q. K. Duong2
                                                         École polytechnique, France
                                                              Technicolor, France

ABSTRACT                                                                malization purpose.
This paper presents the work done at Technicolor regarding              3.   IMAGE SUBTASK
the MediaEval 2016 Predicting Media Interestingness Task,                  For the image subtask, the philosophy was to experiment
which aims at predicting the interestingness of individual im-          using several DNN structures and to compare with a SVM-
ages and video segments extracted from Hollywood movies.                based baseline. For all system types and both subtasks (im-
We participated in both the image and video subtasks.                   age and video), the best parameter configurations were cho-
1.    INTRODUCTION                                                      sen, by splitting the development set (either MediaEval data
                                                                        or external data) into some training (80%) and validation
   The MediaEval 2016 Predicting Media Interestingness Task
                                                                        sets (20%). As the MediaEval development dataset is not
aims at predicting the level of interestingness of multimedia
                                                                        very large, we proceeded to a final training on the whole de-
content, i.e., frames and/or video excerpts from Hollywood-
                                                                        velopment set, when building the final models, except for the
type movies. The task is divided in two subtasks depending
                                                                        CSP-based runs, for which this final training was omitted.
on the type of content, i.e., images or video segments. A
                                                                           SVM-based system We tested SVM with different ker-
complete description of the task can be found in [2].
                                                                        nels: linear, RBF and polynomial and with different pa-
   For the image subtask, Technicolor’s contribution is dou-
                                                                        rameter settings on the development dataset. We observed
ble: a Support Vector Machine (SVM)-based system is com-
                                                                        that the validation accuracy varies from one run to another
pared with several deep neural network (DNN) structures:
                                                                        (even for the same parameter setting) as the training sam-
Multi-layer perceptrons (MLP), Residual networks (ResNet),
                                                                        ples change due to the random partition into the training
Highway networks. For the video subtask, we compare dif-
                                                                        and validation sets. This may suggest that the dataset is
ferent systems built on DNN including both the existing
                                                                        not large enough. Also, because of the class imbalance, the
Long Short Term Memory (LSTM) with ResNet blocks, and
                                                                        validation accuracy used for training makes it difficult to
the proposed architecture named Circular State-Passing Re-
                                                                        choose the best SVM configuration during grid search, that
current Neural Network (CSP-RNN).
                                                                        targets the optimization of the official MAP metric.
   The paper is divided in two main parts corresponding to
                                                                           MLP-based system Several variations of network struc-
the two subtasks. Before this, section 2 gives insight on
                                                                        tures have been tested, with different number of layers, layer
the features used. In each subtask’s section, we present the
                                                                        sizes, activation functions and topologies among which sim-
systems built, then give some details on the derived runs.
                                                                        ple MLP, residual [3] and highway [7] networks. These struc-
The results for the two subtasks are discussed in Section 5.
                                                                        tures were first trained on a balanced dataset of 200,000
2.    FEATURES                                                          images, extracted using the Flickr interestingness API [1].
   For both subtasks, input features for the visual modality            As this API uses some social metadata associated to con-
are the CNN features extracted from the fc7 layer of the                tent, it may lead to a social definition of interestingness,
pre-trained CaffeNet model [5]. They were computed for                  instead of a more content-driven interestingness, which may
each image of the dataset, after being resized to 227 for               bias the system performance on the official dataset. The
its smaller dimension and center-cropped so that the aspect             best performance in terms of accuracy for the Flickr dataset
ratio is preserved. The mean image that comes with the                  was obtained by a simple structure: a first dense layer of
Caffe model was also subtracted for normalization purpose.              size 1000 and rectified linear unit (ReLU) activation, with
The final feature dimension is 4096 per image, or per video             a dropout of 0.5, followed by a final softmax layer. This
frame for the video subtask.                                            structure was then re-trained on the MediaEval image de-
   For the audio modality, 60 Mel Frequency Cepstral Coef-              velopment dataset, but with the addition of some resampling
ficients (MFCCs) concatenated with their first and second               or upsampling steps of the training data, to cope with the
derivatives were extracted from windows of size 80 ms cen-              imbalance of the two classes in the official dataset. Dur-
tered around each frame, resulting in an audio feature vector           ing resampling, a training sample is selected randomly from
of size 180 per video frame. The mean value of the feature              one class or another depending on a probability fixed before-
vector over the whole dataset is then subtracted for the nor-           hand. Upsampling consists of putting multiple occurrences
                                                                        of each interesting sample into the list of training data, re-
(0.3 to 0.6 for the interesting class) or upsampling propor-           Runs - image       MAP        Runs - video          MAP
tions (5 to 13 times more interesting samples) were tested.              Run #1           0.2336       Run #3              0.1465
   Run #1: SVM-based SVM in Python Scikit-learn pack-                    Run #2           0.2315       Run #4              0.1365
age1 with RBF kernel, gamma = 1/n f eatures, c = 0.1 and                                               Run #5              0.1618
default parameter settings elsewhere, is used. Upsampling          Random baseline        0.1656    Random baseline        0.1496
strategy to enlarge interesting samples by factor of 11 is
used.                                                                   Table 1: Results in terms of MAP values.
   Run #2: MLP-based A simple structure with 2 layers             of temporal events in videos by considering several succes-
of sizes (1000, 2) (cf. section 3) was selected for its perfor-   sive temporal samples and states to produce an output at
mance on the Flickr dataset. Best performances with this          each time instance (while RNN takes one input sample and
structure were obtained with a learning rate of 0.1, decay        one state to produce an output at each time instance). The
rate of 0.1, ReLu activation function, Adadelta optimizing        CSP-RNN was trained directly on the MediaEval dataset,
method and a maximum of 10 epochs. Resampling with                and for three fixed configurations only: audio, video and
probability of 0.6 for the interesting class gives the best MAP   multimodal.
value on the MediaEval development set.                              Run #4: video and CSP-based For the audio and
                                                                  video configurations, the features were used directly as input
4.    VIDEO SUBTASK                                               to the CSP network, with some dimension reduction from
   Different DNN structures capable of handling the tempo-        4096 to 256 for the video, thanks to a simple dense layer.
ral evolution of the data were tested with variation of size      No upsampling/resampling was applied. During validation
and depth. We also investigated the performances of differ-       on the MediaEval dataset, the audio-only CSP configura-
ent modalities separately vs. in a multimodal approach.           tion has given lower performances than the video and mul-
   Systems based on existing structures                           timodal configurations. We therefore kept the video system
   Different simple RNN and LSTM [4] structures were tested       for run#4.
by varying their number and size of layers, as they are well-
known to be able to handle the temporal aspect of the data.
We also experimented the idea of ResNet (recently proposed
for CNN) in our implementation with RNN and LSTM.
Monomodal systems (audio-only, visual-only) were also com-
pared to multimodal (audio+visual modality) ones. For the
latter, a mid-level multimodal fusion was implemented, i.e.,
each modality was first processed independently through one
or more layers before merging and processing by some addi-              Figure 2: Proposed CSP-RNN structure.
tional layers, as illustrated in figure 1. The best structures      Run #5: multimodal and CSP-based For the mul-
and set of parameters were chosen while training on the           timodal configuration the same framework proposed in fig-
Flickr part of the Jiang dataset [6]. Contrary to the MediaE-     ure 1 was kept: the multimodal processing part being re-
val dataset, this dataset contains 1200 longer videos, equally    placed by a CSP of size 5. No upsampling/resampling was
balanced between the two classes. Once the structures and         tested.
parameters were chosen, some upsampling/resampling pro-           5.    RESULTS AND DISCUSSION
cess was applied while re-training on the MediaEval dataset.
   Run #3: LSTM-ResNet-based The best structure                      The obtained results are reported in terms of MAP in
obtained while validating on the Jiang dataset corresponds        Table 1, with some baseline values computed from a random
to figure 1, with a multimodal processing part composed of        ranking of the test data. At least on the image subtask our
a residual block built upon 2 LSTM layers of size 436. Af-        systems perform significantly higher than the baseline. For
ter re-training on the MediaEval dataset, upsampling with         the video subtask, MAP values are lower and we may wonder
a factor of 9 was applied to the input samples.                   whether these performances come in part from the difficulty
                                                                  of the task itself or the dataset which contains significant
                                                                  number of very short shots that were certainly difficult to
                                                                     For the image subtask, we observed that simple SVM sys-
                                                                  tems perform similarly (development set) and even slightly
                                                                  better (test set) than more sophisticated DNNs, leading to
     Figure 1: Proposed multimodal architecture.                  the conclusion that the size of the dataset was probably not
   Systems based on the proposed CSP                              large enough for DNN training. This is also supported by
   Figure 2 illustrates the philosophy of this new structure,     our test on the external Flickr dataset containing 200,000
which can be seen as a generalization of the traditional RNN,     images for which DNN reached more than 80% accuracy.
in which at a time instance t, N samples of the input se-            For the video subtask, several conclusions may be drawn.
quence go through N recurrent nodes arranged in a circular        First, multimodality seems to bring benefit to the task (this
structure (allowing to take into account both the past and        was confirmed by some additional but not submitted runs).
the future over a temporal window of size N ) to produce N        Second, the new CSP structure seems be able to capture the
internal outputs. These outputs are then combined to form a       temporal evolution of the videos better than classic RNN
final output at t. This architecture targets a better modeling    and more sophisticated LSTM-ResNet structures, and this
                                                                  independently of the monomodal branches which were the
1                                                                 same in both cases. This very first results support the need
sklearn.svm.SVC                                                   for further testing of this new structure in the future.
