=Paper=
{{Paper
|id=None
|storemode=property
|title=Audio-Visual content description for video genre classification in the context of social media
|pdfUrl=https://ceur-ws.org/Vol-807/Ionescu_RAF_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/IonescuSVL11
}}
==Audio-Visual content description for video genre classification in the context of social media==
<pdf width="1500px">https://ceur-ws.org/Vol-807/Ionescu_RAF_Genre_me11wn.pdf</pdf>
<pre>
          Audio-Visual Content Description for Video Genre
            Classification in the Context of Social Media

             Bogdan Ionescu1,3 , Klaus Seyerlehner2 , Constantin Vertan1 , Patrick Lambert3
                        1
                            LAPI - University Politehnica of Bucharest, 061071 Bucharest, Romania
                                             {bionescu,cvertan}@alpha.imag.pub.ro
                                     2
                                         DCP - Johannes Kepler University, A-4040, Linz, Austria
                                                  klaus.seyerlehner@gmail.com
                               3
                                   LISTIC - Polytech Annecy-Chambery, B.P. 80439, 74944 France
                                                  patrick.lambert@univ-savoie.fr

ABSTRACT                                                              vocal aspects. The proposed audio features are block-level
In this paper we address the automatic video genre classi-            based, which compared to classic approaches have the ad-
fication with descriptors extracted from both, audio (block-          vantage of capturing local temporal information by analyz-
based features) and visual (color and temporal based) modal-          ing sequences of consecutive frames in a time-frequency rep-
ities. Tests performed on 26 genres from blip.tv media plat-          resentation. Audio information is described with parame-
form prove the potential of these descriptors to this task.           ters such as: spectral pattern (characterize the soundtrack’s
                                                                      timbre), delta spectral pattern (captures the strength of on-
                                                                      sets), variance delta spectral pattern (captures the variation
Categories and Subject Descriptors                                    of the onset strength over time), logarithmic fluctuation pat-
I.2.10 [Artificial Intelligence]: Vision and Scene Under-             tern (captures the rhythmic aspects), spectral contrast pat-
standing—audio, color and action descriptors; I.5.3 [Pattern          tern (estimates ”tone-ness”) and correlation pattern (cap-
Recognition]: Clustering—video genre.                                 tures the temporal relation of loudness changes over different
                                                                      frequency bands). For more information see [3].
Keywords                                                              Temporal descriptors. The genre specificity is reflected
                                                                      also at temporal level, e.g. music clips tend to have a high
block-based audio features, color perception, action content,         visual tempo, documentaries have a reduced action content,
video genre classification.                                           etc. To address those aspects we detect sharp transitions,
                                                                      cuts and two of the most frequent gradual transitions, fades
1.   INTRODUCTION                                                     and dissolves. Based on this information, we assess rhythm
   In this paper we address the issue of automatic video genre        as movie’s average shot change speed computed over 5s time
classification in the context of social media platforms as part       windows (provides information about the movie’s changing
of the MediaEval 2011 Benchmarking Initiative for Multi-              tempo) and action in terms of high action ratio (e.g. fast
media Evaluation (see http : //www.multimediaeval.org/).              changes, fast motion, visual effects, etc.) and low action
The challenge is to provide solutions for distinguishing be-          ratio (the occurrence of static scenes). Action level is deter-
tween up to 26 common genres, like ”art”, ”autos”, ”busi-             mined based on user ground truth [4].
ness”, ”comedy”, ”food and drink”, ”gaming”, and so on [2].           Color descriptors. Finally, many genres have specific
Validation is to be carried out on video footage from the             color palettes, e.g. sports tend to have predominant hues,
blip.tv media platform (see http : //blip.tv/).                       indoor scenes have different lighting conditions than outdoor
   We approach this task, globally, from the classification           scenes, etc. We assess color perception by projecting colors
point of view and focus on the feature extraction step. For a         onto a color naming system (associating names with colors
state-of-the-art of the literature see [1]. In our approach, we       allows everyone to create a mental image of a given color
extract information from both audio and visual modalities.            or color mixture). We compute a global weighted color his-
Whether these sources of information have been already ex-            togram (movie’s color distribution), an elementary color his-
ploited to genre classification, the novelty of our approach          togram (distribution of basic hues), light/dark, saturated/weak-
is in the content descriptors we use.                                 saturated, warm/cold color ratios, color variation (the amount
                                                                      of different colors in the movie), color diversity (the amount
                                                                      of different hues) and adjacency/complementarity color ra-
2.   VIDEO CONTENT DESCRIPTION                                        tios. For more information on visual descriptors see [4].
Audio descriptors. Most of the common video genres tend
to have very specific audio signatures, e.g. music clips con-
tain music, in sports there is the specific crowd noise, etc. To      3. EXPERIMENTAL RESULTS
address this specificity, we propose audio descriptors which          Results on development data. First validation was per-
are related to rhythm, timbre, onset strength, noisiness and          formed on the provided development data set (247 sequences)
                                                                      which was eventually extended to up to 648 sequences in
                                                                      order to provide a consistent training data set for classifi-
Copyright is held by the author/owner(s).                             cation (source blip.tv; sequences are different than the ones
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy             proposed for the official runs). We observed that in the case
           SVM linear kernel                                            1
                                                                                                                                    BayesNet
                                                                                                                                    DecisionTable
                                                                                                                                    FT
                                                                                                                                    HyperPipes
0.8
                                                                                                                                    J48
0.7        Fscore                                                                                                                   NNge
0.6                                                                                                                                 NaiveBayes
0.5                                                                                                                                 RBFNetwork
                                                                                                                                    RandomForest
0.4
                                                                                                                                    RandomTree
0.3                                                                                                                                 Ridor
0.2                                                                                                                                 SVMlinear
0.1                                                                                                                                 VFI
                                                                                                                                    kNN
  0
       )

                 )

                          (5 )


                                        (6 )


                                                    3)


                                                                                                 )

                                                                                                 )

                                                                                              (2 )


                                                                                                 )


                                                                                                 )


                                                                                                 )

                                                                                                4)

                                                                                              (3 )


                                                                                                 )


                                                                                                 )


                                                                                                 )


                                                                                                 )


                                                                                                 )

                                                                                              (1 )


                                                                                                 )


                                                                                                 )

                                                                                                7)


                                                                                              (3 )


                                                                                               4)


                                                                                                 )


                                                                                                 )
  (19

            (1 5


                                                                                            (11

                                                                                            (20


                                                                                            (1 0


                                                                                              23


                                                                                            (1 5


                                                                                              18

                                                                                            (2 2


                                                                                            l (3


                                                                                              23

                                                                                            (1 8


                                                                                              27

                                                                                            (1 8


                                                                                            (1 6


                                                                                            (1 7
                                                    y(


                                                                                           h(


                                                                                              (


                                                                                          l (2
                                                                                         d(


                                                                                          s(


                                                                                          s(


                                                                                          s(
                       ss

                                        sm


                                                                                          ry


                                                                                        ure


                                                                                         on


                                                                                            t

                                                                                          ia
                                                                                        ica


                                                                                        en
                                                              s
                                               ed
a rt

           tos


                                                                                     o ry


                                                                                         al


                                                                                       ng

                                                                                       alt


                                                                                       sic


                                                                                       on


                                                                                       gy


                                                                                      ing


                                                                                        nt
                                                                                      ed


                                                                                      ve
                                                                                    nta
                                                           ce
                         e


                                                                                    vie


                                                                                   litic


                                                                                    o rt
                                                                                   foo


                                                                                    ati
                                    a li


                                                                                   on


                                                                                  nm


                                                                                  me
                                                                                  ra t


                                                                                  ph
                     sin


                                                m


                                                                                  he


                                                                                 olo
                                                                                  mi


                                                                                  igi
                                                                                mu


                                                                                 tra
       au


                                                                                _m


                                                                                 gg
                                                                               teg
                                                         en


                                                                                uc
                                                                              me


                                                                                sp
                               u rn


                                                                              mo
                                             co


                                                                               ati


                                                                              po
                                                                             lite


                                                                            gra


                                                                             re l
                                                                             ga


                                                                             iro


                                                                            lop
                 bu


                                                                            hn


                                                                           blo
                                                      fer


                                                                           ed


                                                                          am
                                                                           ca


                                                                          uc
                                                                          cu
                             _ jo


                                                                         nv
                                                                        bio


                                                                        tec


                                                                         ve
                                                       n


                                                                       eo
                                                                       lt_


                                                                       ed
                                                                      do


                                                                      tre
                                                    co


                                                                     _e
                         en


                                                                     de
                                                                    to-
                                                            fau


                                                                   vid
                                                                  ins
                                                                the
                          iz


                                                                 b_
                                                                au
                                                           de
                      cit


                                                              ma


                                                              we
Figure 1: Average F score achieved using all audio-visual descriptors and genre classification ”performance”
for the best run, i.e. SVM with linear kernel (graph on top).


of some of the proposed genres, the genre specific content                     est on audio-color-action, 0.121 for SVM linear on audio-
is captured mainly with the textual information. Therefore,                    color-action (best run), 0.103 for SVM linear on audio and
our tests focused mainly on genres with specific audio-visual                  0.038 SVM linear on color-action. This is mainly due to
contents, like ”art”, ”food”, ”cars”, ”sports”, etc. (for which                the limited training data set compared to the diversity of
we provided a representative number of examples).                              test sequences and to the inclusion of the genres for which
   Tests were performed using a cross-validation approach.                     we obtain 0 precision (i.e. audio-visual information is not
We use for training p% of the existing sequences (randomly                     discriminant, see Figure 1). The fact that MAP provides
selected and uniformly distributed with respect to genre)                      only an overall average precision over all genres makes us
and the remainder for testing. Experiments were repeated                       unable to conclude on the genres which are better suited to
for different combinations between training and testing (e.g.                  be retrieved with audio-visual information and which fail to.
1000 repetitions).
   Figure 1 presents average Fscore = 2 · P · R/(P + R) ratio                  4. CONCLUSIONS AND FUTURE WORK
(where P and R are average precision and recall, respec-                          The proposed descriptors performed well for some of the
tively over all repetitions) for p = 50%, descriptor set =                     genres, however to improve the classification performance
audio-color-action (i.e. the descriptor set which provided the                 a more consistent training database is required. Also, our
most accurate results) and various classification approaches                   approach is more suitable for classifying genre patterns from
(see Weka at http : //www.cs.waikato.ac.nz/ml/weka/).                          the global point of view, like episodes from a series being not
The number in the brackets represent the number of test                        able to detect a genre related content within a sequence.
sequences used for each genre.                                                 Future tests will consist on preforming cross-validation on
   From the global point of view, the best results are ob-                     all the 2375 sequences (development + test sets).
tained with SVM and a linear kernel (depicted in Orange),
followed by k-NN (k = 3, depicted in Dark Green) and FT
(Functional Trees, depicted in Cyan). At genre level, the
                                                                               5. ACKNOWLEDGMENTS
best accuracy is obtained for genres with particular audio-                     Part of this work has been supported under the Financial
visual signatures. The graph on top presents a measure                         Agreement EXCEL POSDRU/89/1.5/S/62557.
of the individual genre classification ”performance” which is
computed as the Fscore times the number of test sequences                      6. REFERENCES
used. An Fscore obtained for a greater number of sequences                      [1] D. Brezeale, D.J. Cook, ”Automatic Video
is more representative than one obtained for only a few (val-                       Classification: A Survey of the Literature,” IEEE
ues are normalized with respect to 1 for visualization pur-                         Trans. on Systems, Man, and Cybernetics, Part C:
pose). The proposed descriptors provided good discrimina-                           Applications and Reviews, 38(3), pp. 416-430, 2008.
tive power for genres like (the number in the brackets is                       [2] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S.
Fscore ): ”food and drink” (0.757), ”travel” (0.633), ”politics”                    Schmiedeke, G.J.F. Jones, ”Overview of MediaEval
(0.552), ”web development and sites” (0.697), while at the                          2011 Rich Speech Retrieval Task and Genre Tagging
bottom end are genres whose contents are less reflected with                        Task”, MediaEval 2011 Workshop, Pisa, Italy, 2011.
audio-visual information, e.g. ”citizen journalism”, ”busi-                     [3] K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, ”Using
ness”, ”comedy” (see Figure 1).                                                     Block-Level Features for Genre Classification, Tag
Results on test data. For the final official runs, clas-                            Classification and Music Similarity Estimation,”
sification was performed on 1727 sequences with training                            MIREX-10, Utrecht, Netherlands, 2010.
performed on the previous data set (648 sequences). The                         [4] B. Ionescu, C. Rasche, C. Vertan, P. Lambert, ”A
overall results obtained in terms of MAP (Mean Average                              Contour-Color-Action Approach to Automatic
Precision) are less accurate than the previous results, thus:                       Classification of Several Common Video Genres”,
0.077 for k-NN on audio-color-action, 0.027 for RandomFor-                          AMR (LNCS 6817), Linz, Austria, 2010.

</pre>