=Paper= {{Paper |id=None |storemode=property |title=Brno University of Technology at MediaEval 2011 Genre Tagging Task |pdfUrl=https://ceur-ws.org/Vol-807/Hradis_BUT_Genre_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/HradisRB11 }} ==Brno University of Technology at MediaEval 2011 Genre Tagging Task== https://ceur-ws.org/Vol-807/Hradis_BUT_Genre_me11wn.pdf
     Brno University of Technology at MediaEval 2011 Genre
                          Tagging Task

                                     Michal Hradiš, Ivo Řezníček, Kamil Behúň
                          Graph@FIT, Brno University of Technology, Bozetechova 2, Brno, CZ
                         {ihradis, ireznice}@fit.vutbr.cz, xbehun03@stud.fit.vutbr.cz


ABSTRACT                                                          extracted patches were represented by SIFT and color SIFT
This paper briefly describes our approach to the video genre      descriptors (SIFT, CSIFT). Codebooks were created for
tagging task which was a part of MediaEval 2011. We fo-           each representation by k-means with Euclidean distance and
cused mainly on visual and audio information, and we ex-          exact nearest neighbor search. The size of all codebooks was
ploited metadata and automatic speech transcripts only in         4096. Local features were translated to BOW using code-
a very basic way. Our approach relied on classification and       book uncertainty [6] with Gaussian kernel and the standard
on classifier fusion to combine different sources of informa-     deviation set to average distance between closest neighbor-
tion. We did not use any additional training data except the      ing codewords. The BOW vectors were normalized to L1
very small exemplary set provided by MediaEval (only 246          unit length for classification.
videos). The best performance was achieved by metadata               BOW representation from audio was extracted almost in
alone. Combination with the other sources of information          the same way as from video. In this case the one-dimensional
did not improve results in the submitted runs. This was           audio signal was converted to mel-frequency spectrogram
achieved later by choosing more suitable weights in fusion.       which is a 2D representation and can be treated as an image.
Excluding the metadata, audio and video gave better results       For spectrograms, only DENSE8 and DENSE16 sampling
than speech transcripts. Using classifiers for 345 semantic       was used because the spectrograms do not contain distinct
classes from TRECVID 2011 semantic indexing (SIN) task            interest regions which Harris-Laplace could detect. Only
to project the data worked better than classifying directly       SIFT descriptor was used as spectrograms do not contain
from video and audio features.                                    any color information.
                                                                     As the provided training set is extremely small, we de-
                                                                  cided to expand this set by treating each video frame and
1.   INTRODUCTION                                                 short spectrogram as individual sample with label equal to
  Our approach was mainly motivated by a question how             label of the original video, and merge these partial deci-
video classification approaches which we employ to solve          sions later. 100 equidistant samples were extracted from
TRECVID SIN task [2] behave in a different context. The           each video (training and testing). The length of the spec-
genre tagging task [3] is similar to SIN except the classes are   trograms was set to 10 second and an overlap was allowed
of different kind, videos belong as a whole to a single class     when needed. Linear SVM was used to learn separate 1-to-
and, most importantly, the provided training set is in this       all classifiers for each genre. Meta-parameter C was set in
case more than magnitude smaller.                                 cross-validation which asserted that samples from a single
  We attempted to exploit most of the modalities available:       video did not appear in training and testing set at the same
video, audio, automatic speech recognition [1] (ASR) and          time. Considering the small number of original videos, we
user-supplied metadata. We did not use social network in-         set the same C for all classifiers for a particular represen-
formation. The image features extracted from video were           tation. The final response for a video was computed as the
standard Bag of visual Words (BOW) representations com-           number of samples from that video for which the classifier
monly used for image classification [5]. Spectrograms from        for a particular genre gave the highest response compared
audio were processed in the same way as image data. BOW           to the other genre classifiers.
was constructed from metadata and ASR as well.                       For TRECVID 2011 SIN task, we created classifiers for
                                                                  345 semantic classes. These classifiers were created in al-
                                                                  most the same way as the classifiers described in the pre-
2.   METHOD                                                       vious text. We applied these 345 classifiers to the image
  The BOW representation of video frames was constructed          and audio samples and created feature representations for
in a standard way [5] starting with local patch sampling fol-     the videos by computing histograms of their responses (8
lowed by computing descriptors [4] and a codebook trans-          bins per semantic class). The response histograms were then
form. We used Harris-Laplace detector (HARLAP) and                used to train genre classifiers as before - with a difference
dense sampling with position step 8 pixels and patch radius       that the training sets were only the 246 videos and that the
8 pixels (DENSE8), respective 16 pixels (DENSE16). The            responses were directly used as results. These classifiers are
                                                                  further denoted with TV11.
                                                                     For metadata and ASR we computed BOW representation
Copyright is held by the author/owner(s).                         by removing XML elements, non-alphabetic characters and
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy
by splitting words where lower-case character was followed                                 Run     MAP
by upper-case character. Classifiers for metedata and ASR                                 RUN1     0.165
were created in the same way as for TV11 representation.                                  RUN3     0.346
   We assumed that the number of training samples avail-                                  RUN4     0.322
able is too small for accurate and reliable fusion. For this                              RUN5     0.360
reason we decided to make an educated guess based on pre-
vious experience and results on the training set and combine       Table 1: Mean average precision on test set achieved
classifiers by weighted average with the weights set by hand.      by the runs submitted to MediaEval 2011.
Responses of classifier based on all audio and video features
were averaged into C-AV, TV11 into C-TV11 and ASR                        Features                                   TV11
and metadata into C-TEXT. These averages were normal-                    DENSE16 CSIFT                      0.126   0.194
ized the biggest standard deviation of individual class re-              DENSE16 SIFT                       0.100   0.178
sponses and were combined by weighted average.                           DENSE8 CSIFT                       0.116   0.201
   RUN1 used only ASR as required. RUN3 combined                         DENSE8 SIFT                        0.078   0.187
C-AV, C-TV11 and C-TEXT with weight for C-TEXT                           HARLAP CSIFT                       0.145   0.178
increased to 2.5. RUN4 combined C-AV and C-TEXT                          HARLAP SIFT                        0.133   0.174
which had weight 1.25. RUN4 combined C-TV11 and C-                       SPECTRUM DENSE16 SIFT              0.195   0.167
TEXT which had weight 1.25.                                              SPECTRUM DENSE16 SIFT              0.158   0.188
                                                                         COMBINED (C-AV, C-TV11)            0.226   0.275
                                                                         ASR                                0.165
3.   RESULTS                                                             METADATA                           0.405
   The results of the official runs are shown in Table 1. Using          C-TEXT                             0.300
the MediaEval methodology, we additionally evaluated all                 ALL WITHOUT METADATA               0.300
the separate parts which were combined for the official runs             ALL WITHOUT ASR                    0.448
as well as some other combinations. These unofficial results             RANDOM                             0.046
are shown in Table 2.
   From the individual types of features, the best results were    Table 2: Unofficial results on testing set. Mean av-
achieved by metadata. Metadata gives better results than           erage precision reported.
all the official runs where adding other features decreased
the performance. TV11 classifiers provide significantly bet-
ter results than classifiers trained directly on image features.   source of information for genre recognition, the audio and
The same is true also for their combinations, where C-TV11         video content features improved results when appropriately
gives 0.275 MAP and C-AV gives only 0.226 MAP. Ques-               combined. A larger training set would be needed to perform
tion remains if this is because the TRECVID classifiers bring      proper classifier fusion which could further increase the ben-
additional knowledge or due to the differences in the train-       efit of the content-based features.
ing of the two sets of classifiers. Interestingly, the audio
features provide good results comparable to visual features        Acknowledgements
in TV11, and are much better than image features when              This work has been supported by the EU FP7 project TA2:
learning directly from the features. The worse results in the      Together Anywhere, Together Anytime ICT-2007-214793,
case of TV11 could be explained by lower performance of            grant no 214793.
the original audio-classifiers on TRECVID data (almost two
times worse than image features).
   Further, we experimented with additional combinations of        5.   REFERENCES
features. We combined all classifiers and all classifiers ex-      [1] Jean-Luc Gauvain, Lori Lamel, and Gilles Adda. The
cluding METADATA with weights which more reflect per-                  limsi broadcast news transcription system. Speech
formance of the classifiers. These result are denoted as ALL,          Communication, 37(1-2):89 – 108, 2002.
respectively ALL WITHOUT METADATA, in Table 2.                     [2] Michal Hradis et al. Brno university of technology at
The weights were 1× ASR, 1× C-AV, 4× C-TV11 and                        trecvid 2010. In TRECVID 2010: Participant Notebook
8× METADATA. The combination ALL provides over-                        Papers and Slides, page 11. National Institute of
all best result 0.448 MAP and significantly improves over              Standards and Technology, 2010.
the metadata alone. Improving over all its components,             [3] Martha Larson et al. Overview of mediaeval 2011 rich
ALL WITHOUT METADATA reaches 0.3 MAP.                                  speech retrieval task and genre tagging task. In
                                                                       MediaEval 2011 Workshop, Pisa, Italy, September 1-2
4.   CONCLUSION                                                        2011.
                                                                   [4] Krystian Mikolajczyk and Cordelia Schmid. A
   The achieved results are surprisingly good considering the
                                                                       performance evaluation of local descriptors. IEEE
small size of the training set used. Question is how the
                                                                       Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630,
results would compare to other methods on this dataset,
                                                                       2005.
especially to those which use external sources of knowledge
and which focus more on the metadata, as it was shown to           [5] Cees G. M. Snoek et al. The mediamill trecvid 2010
be the most important source of information. Additionally,             semantic video search engine. In TRECVID 2010:
it is not certain how the presented methods would work on              Participant Notebook Papers and Slides, 2010.
more diverse dataset.                                              [6] J. C. van Gemert et al. Visual word ambiguity. PAMI,
   Although, the metadata is definitely the most important             32(7):1271–1283, 2010.