=Paper= {{Paper |id=None |storemode=property |title=KIT at MediaEval 2011 - Content-based genre classification on web-videos |pdfUrl=https://ceur-ws.org/Vol-807/Semela_KIT_Genre_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SemelaE11 }} ==KIT at MediaEval 2011 - Content-based genre classification on web-videos== https://ceur-ws.org/Vol-807/Semela_KIT_Genre_me11wn.pdf
KIT at MediaEval 2011 - Content-based genre classification
                     on web-videos

                          Tomas Semela                                         Hazım Kemal Ekenel
                    Institute for Anthropomatics                              Institute for Anthropomatics
               Karlsruhe Institute of Technology (KIT)                   Karlsruhe Institute of Technology (KIT)
                     76131 Karlsruhe, Germany                                  76131 Karlsruhe, Germany
               tomas.semela@student.kit.edu                                        ekenel@kit.edu

ABSTRACT                                                           detector. It contains average shot duration and distribution
In this paper, we run our content-based video genre classi-        of shot lengths. The cognitive and later presented visual
fication system on the MediaEval evaluation corpus. Our            features are extracted from 5 linearly distributed frames per
system is based on several low level audio-visual cues, as         shot.
well as cognitive and structural information. The purpose
of this evaluation is to assess our content-based system’s per-
                                                                   2.2     Aural Features
formance on the diversified content of the blip.tv web-video         To benefit from the audio information of each clip, we
corpus, which is described in detail in [5].                       compute four features from the audio signal. All features
                                                                   are extracted from mono-channel audio with 16 kHz sample
                                                                   rate and a 256 kbit/s bit rate. The features include MFCC,
Categories and Subject Descriptors                                 Zero Crossing Rate and Signal Energy, and are utilized using
H.3.1 [Information Storage and Retrieval]: Content                 different representations.
Analysis and Indexing
                                                                   2.3     Low-level Visual Features
Keywords                                                             We used six different low level visual features which rep-
                                                                   resent color and texture information in the video.
Genre classification, content-based features
                                                                   2.3.1     Color descriptors
1.     MOTIVATION                                                    Histogram: We use the HSV color space and build a his-
   Automatic genre classification is an important task in mul-     togram with 162 bins [8].
timedia indexing. Several studies have been conducted on             Color moments: We use a grid size of 5×5. The first three
this topic. A comprehensive overview of these studies on           order color moments were calculated in each local block in
TV genre classification can be found in [6]. Recently, there       the image and the Lab color space is used [7].
has also been an increasing interest in web video genre clas-        Autocorrelogram: Autocorrelogram captures the spatial
sification [9]1 . In this study, we evaluated our content-based    correlation between identical colors. 64 quantized color bins
system, which is based on the low-level audio-visual features,     and five distances are used [4].
on the MediaEval corpus. The utilized features in the sys-
tem correspond to low level color and texture cues, as well
                                                                   2.3.2     Texture descriptors
as shot boundary and face detection outputs. We used this             Co-occurrence texture: As proposed in [1], five types of
features before for detecting high-level features in videos [2]    features are extracted from the gray level co-occurrence ma-
and successfully classified various TV content into genres         trix (GLCM): Entropy, Energy, Contrast, Correlation and
[3]. In the following sections we give a brief overview of our     Local homogeneity.
system, for details please refer to [3].                              Wavelet texture grid: We calculate the variances of the
                                                                   high-frequency sub-bands of the wavelet transform of each
                                                                   grid region. We performed 4-level analysis on a grid that
2.     CONTENT-BASED FEATURES                                      has 4 × 4 = 16 regions. Haar wavelet is employed, as in [1].
                                                                      Edge histogram: For the edge histogram, 5 filters as pro-
2.1      Cognitive and structural features                         posed in the MPEG-7 standard are used to extract the kind
   Cognitive and structural features are proposed in [6]. Cog-     of edge in each region of 2 × 2 pixels. Then, those small
nitive features are derived using a face detector. It contains     regions are grouped in a certain number of areas (4 rows ×
average number of faces per frame, distribution of number of       4 columns in our case) and the number of edges matched by
faces per frame and distribution of location of the faces in the   each filter (vertical, horizontal, diagonal 45◦ , diagonal 135◦
frame. Structural feature is derived using a shot boundary         and non-directional) are counted in the region’s histogram.
1
    Also as part of the ACM Multimedia Grand Challenge
                                                                   3.     CLASSIFICATION
                                                                     Classification is performed using multiple SVM classifiers.
Copyright is held by the author/owner(s).                          As can be seen in Fig. 1, content-based features are ex-
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy          tracted from each video and are used as input for separate
                                    Genre: Art
                                                                                            Our experiments show that a content-based system which
                                    MFCC SVM Model                                       is able to achieve nearly perfect accuracy on TV datasets
                                                                                        (95% and 99%, see [3]) and also very high performance on
                                                          ∑
             Audio Feature
               Extraction
          [F0, MFCC, SP, ZCR]       Wavelet SVM Model
                                                                                         a YouTube dataset (92.4%), is not able to achieve high per-
 Input
                                                                       Majority Voting   formance on the blip.tv corpus. The main reason for this
                                           
                                                          ∑
 Video       Video Feature
               Extraction           Genre: Web
                                                                                         might be the increased number of genres to be classified,
          [AC, CM, CoOc, HSV,
           Edge, Wavelet, Cog,
                                                                                         high intra-class diversity leading to difficulty in seperability
                                    MFCC SVM Model
                 Struct]                                                                 of genres from each other using content-based cues.
                                                                                           More interestingly the low-level visual and aural features
                                    Wavelet SVM Model                                    show more promising results than the selected higher-level
                                                                                         cognitive and structural cues. Either it is not possible to
                                                                                         cover the variety, or overall resemblance of all videos with
                    Figure 1: System Overview                                            these features or more promising high-level features have to
                                                                                         be found by analyzing the properties of the web-videos.
                                                                                            Because of the limits of content-based systems in this area,
SVMs, one for each genre and feature. Classification output                              the usage of metadata and other sources like ASR engines
of each SVM is summed up over all features for each genre                                is desirable to be able to attain a robust genre classification
and a genre is picked via majority voting.                                               system.

4.       EVALUATION AND DISCUSSION                                                       Acknowledgments
   The evaluation of this years MediaEval genre tagging task                             This study is funded by OSEO, French State agency for
was performed on 1727 clips form blip.tv, distributed un-                                innovation, as part of the Quaero Programme.
evenly over 26 categories including a default category. Single
label classification is performed and mean average precision                             5.   REFERENCES
(MAP) is used as the official performance measure. Training
of the SVMs was conducted on approximately 100 videos for                                [1] M. Campbell, E. Haubold, S. Ebadollahi, D. Joshi,
each genre, except for autos and vehicles where only 14 clips                                M. R. Naphade, A. P. Natsev, J. Seidl, J. R. Smith,
were available. These training videos are from a larger ad-                                  K. Scheinberg, and L. Xie. IBM Research
ditional set of blip.tv videos. However, since we had limited                                TRECVID-2006 Video Retrieval System. In Proc. of
time, it was not possible to process all these videos. There-                                NIST TRECVID Workshop 2006.
fore, we limited the number of training videos per genre to                              [2] H. K. Ekenel, H. Gao, and R. Stiefelhagen. Universität
100 videos, which are randomly selected for each genre. Be-                                  Karlsruhe (TH) at TRECVID 2008. In NIST
cause our system works as a single label classification system,                              TRECVID Workshop, Gaithersburg, USA, Nov. 2008.
we also computed simple classification accuracy and calcu-                               [3] H. K. Ekenel, T. Semela, and R. Stiefelhagen.
lated a 2nd MAP performance with a similarity value of 1,                                    Content-based video genre classification using multiple
instead of a very low probability output of our system.                                      cues. In Proceedings of the 3rd International Workshop
   All in all, 5 runs were evaluated using these three evalua-                               on Automated Information Extraction in Media
tion measures. In our case, a combination of all feature sets                                Production, AIEMPro’10, pages 21–26, 2010.
(run1) and each feature category like visual (run2), aural                               [4] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and
(run3), cognitive (run4) and structural (run5) are evaluated                                 R. Zabih. Image indexing using color correlograms. In
independently. The results are presented in Table 1. The                                     Computer Vision and Pattern Recognition (CVPR),
least contribution comes from the cognitive features, while                                  pages 762–768, 1997.
the visual features (run 2) contribute the most to the over-                             [5] M. Larson, M. Eskevich, R. Ordelman, C. Kofler,
all performance, outperforming the other runs in the MAP                                     S. Schmeideke, and G. J. F. Jones. Overview of
performance measures and achieving almost the same clas-                                     MediaEval 2011 Rich Speech Retrieval Task and Genre
sification accuracy as all feature sets together. From the                                   Tagging Task. In MediaEval 2011 Workshop, Pisa,
six available visual features color moments and wavelet tex-                                 Italy, September 1-2 2011.
ture show the best classification results with 20% and 23%,                              [6] M. Montagnuolo and A. Messina. Parallel neural
respectively.                                                                                networks for multimodal video genre classification.
   The best results (greater 50%) were achieved in the web de-                               Multimedia Tools Appl., 41:125–159, January 2009.
velopment (66.6%), mainstream media (68.9%), food and-                                   [7] M. A. Stricker and M. Orengo. Similarity of color
 drink (61.1%), movies and television (58.5%) and litera-                                    images. In Storage and Retrieval for Image and Video
ture category with 89.6%. Worst results (under 10%) showed                                   Databases (SPIE)’95, pages 381–392, 1995.
documentary (4.5%), educational (3.2%), health (9.5%), tra-                              [8] M. J. Swain and D. H. Ballard. Color indexing.
vel (7.1%) and videoblogging with 0%.                                                        International Journal of Computer Vision, 7:11–32,
                                                                                             1991.
                            run1     run2         run3         run4         run5         [9] Z. Wang, M. Zhao, Y. Song, S. Kumar, and B. Li.
                                                                                             YouTubeCat: Learning to categorize wild web videos.
    MAP                    0.0023   0.0035        0.001       0.001        0.003
                                                                                             In Computer Vision and Pattern Recognition (CVPR),
  2nd MAP                  0.0038   0.006         0.001       0.0012       0.0028
                                                                                             2010 IEEE Conference on, pages 879 –886, June 2010.
 Accuracy (%)               28.2     27.5         13.9          1.3          5.4

                   Table 1: Evaluation Results