Using Local Region Semantics for Concept
                       Detection in Video
                        Evaggelos Spyrou, George Koumoulos, Yannis Avrithis and Stefanos Kollias


   Abstract— This paper presents a framework for the detection
of semantic features in video sequences. Low-level feature extrac-
tion is performed on the keyframes of the shots and a “feature
vector” including color and texture features is formed. A region
“thesaurus” that contains all the high-level features is constructed
using a subtractive clustering method.Then, a “model vector”
that contains the distances from each region type is formed and a
SVM detector is trained for each semantic concept. Experiments
were performed using TRECVID 2005 development data.
  Index Terms— semantic analysis, thesaurus, SVM, TRECVID

                                                                             Fig. 1.   Presented Framework


                         I. I NTRODUCTION
                                                                                          II. L OW-L EVEL F EATURE E XTRACTION

H      IGH-level concept detection in video documents remains
       still an unsolved problem. One aspect of this is the
extraction of the low-level features of a video sequence and the
                                                                                Since a set of dominant colors in an image or a region of
                                                                             interest has the ability to efficiently capture its color properties,
                                                                             an approach based on the MPEG-7 Dominant Color Descriptor
other is the method is used for assigning low-level descriptions             [6] was selected. The K-means clustering method is applied
to high-level concepts, a problem commonly referred to as the                on the RGB values of a given keyframe. As opposed to the
“Semantic Gap”. Many approaches have been proposed, all                      MPEG-7 Dominant Color descriptor, where the number of the
sharing the target of bridging the semantic gap, thus extracting             extracted representative colors varies allowing a maximum of
high-level concepts from multimedia documents.                               eight colors that can be extracted, a fixed number of colors
   In [5], a prototype multimedia analysis and retrieval system              is each time preselected in our approach.the MPEG-7 Homo-
is presented, that uses multi-modal machine learning tech-                   geneous Texture Descriptor (HTD) [6] was used to capture
niques in order to model semantic concepts in video. A region-               texture properties of each region. The energy deviations of the
based approach in content retrieval that uses Latent Semantic                descriptors were discarded, in order to simplify the description,
Analysis (LSI) is presented in [9]. The extraction of low-level              preventing biasing towards the texture features.
concepts is performed after the image is clustered by a mean                    All the low-level visual descriptions of a keyframe are
shift algorithm thus features are selected locally in [8]. In [11],          normalized to avoid scale effects and merged into a unique
a region-based approach using MPEG-7 visual features and                     vector. This vector will be referred to as feature vector.
knowledge in the form of an ontology is presented. Moreover,
in the context of TV news bulletins, a hybrid thesaurus
approach is presented in [7], a lexicon-driven approach for                             III. R EGION T HESAURUS C ONSTRUCTION
an interactive video retrieval system is presented in [2] and                   Given the entire set of the keyframes extracted from a video,
a lexicon design for semantic indexing in media databases is                 it is obvious that those with similar semantic features should
also presented in [1].                                                       have similar low-level descriptions. To exploit this, clustering
   In this work, the problem of concept detection in video                   is performed on all the descriptions of the training set. Since
is approached in the following way: Low-level features are                   we cannot have a priori knowledge for the exact number of
extracted from keyframes, each representing a shot. A model                  the required classes, Subtractive Clustering [3] is the applied
vector is formed by associating these descriptions with the                  method on the low-level description set, since it determines
words of a thesaurus. Then a SVM classifier is used to detect                the number of the clusters. Each cluster may or may not
the semantic concepts.The presented framework is depicted in                 represent a high-level feature and each high-level feature may
figure 1.                                                                    be represented by one or more clusters. For example, the
                                                                             concept desert can have many instances differing in i.e. the
   E.Spyrou, G.Koumoulos, Y.Avrithis and S.Kollias are with Image, Video     color of the sand. Moreover, in a cluster that may contain
and Multimedia Systems Laboratory, School of Electrical and Computer         instances from the semantic entity i.e. sea, these instances
Engineering, National Technical University of Athens, 9 Iroon Polytechniou
Str., 157 80 Athens, Greece.(e-mail:espyrou@image.ece.ntua.gr)               could be mixed up with parts from another concept i.e. sky, if
                                                                             present in an image.
     Concept     35 Region Types     62 Region Types   125 Region Types         Concept     2 DC + HT      3 DC + HT    4 DC + HT     5 DC + HT
      Desert         82.5%               77.5%              70.1%                Desert       77.5%          80.5%        82.5%         79.0%
    Vegetation       80.5%               71.3%              67.2%              Vegetation     70.5%          77.5%        80.5%         81.2%
    Mountain         83.6%               77.7%              67.0%              Mountain       70.3%          82.0%        83.6%         78.6%
      Road           72.0%               67.0%              65.9%                Road         68.0%          70.0%        72.0%         70.0%
       Sky           80.1%               77.4%              70.0%                 Sky         77.5%          80.1%        80.1%         79.0%
      Snow           70.5 %              62.1%              55.2%                Snow         57.2%          62.0%        70.5%         72.2%

                                   TABLE I                                                                  TABLE II
C LASSIFICATION RATE USING BOTH VISUAL DESCRIPTORS FOR VARIOUS             C LASSIFICATION RATE USING BOTH VISUAL DESCRIPTORS FOR VARIOUS
                   NUMBERS OF THE REGION TYPES                                  NUMBERS OF THE DOMINANT COLORS , THESAURUS SIZE = 35


                                                                                              Concept       DC      HT      DC+HT
   A thesaurus combines a list of every term in a given domain                                 Desert      80.2%   77.2%    82.5%
                                                                                             Vegetation    72.5%   75.0%    80.5%
of knowledge and a set of related terms for each term in                                     Mountain      72.1%   77.5%    83.6%
the list. In our approach, the constructed “Region Thesaurus”                                  Road        71.5%   70.2%    72.0%
                                                                                                Sky        85.0%   70.1%    80.1%
contains all the “Region Types” that are encountered in the                                    Snow        75.0%   60.1%    70.5%
training set. These region types are the centroids of the clusters                                    TABLE III
and all the other members of the cluster are their synonyms.               C LASSIFICATION RATE USING ONLY COLOR , ONLY TEXTURE AND BOTH
The use of the thesaurus is to facilitate the association of the                       VISUAL DESCRIPTORS , THESAURUS SIZE = 35
low-level features of the image with the corresponding high-
level concepts. Since the number of the region types can be
very large, the dimensionality of the model vector may become
very high. To avoid this, principal component analysis (PCA)                                       ACKNOWLEDGMENT
is applied in order to reduce its dimensionality, thus facilitating
                                                                            The work presented in this paper was partially supported
the performance of the feature detectors.
                                                                          by the European Commission under contracts FP6-027026 K-
                                                                          Space and FP6-027685 MESH. Evaggelos Spyrou is funded
       IV. M ODEL V ECTOR K EYFRAME D ESCRIPTION
                                                                          by the Greek Secretariat of Research and Technology (PENED
   After the construction of the region thesaurus, a “model               Ontomedia 03 ED 475)
vector” is formed for each keyframe. Its dimensionality is
equal to the number of concepts constituting the thesaurus.                                               R EFERENCES
The distance of a region to a region type is calculated as a
                                                                           [1] M. N. A. Natsev and J. Smith, “Lexicon design for semantic indexing
linear combination of the dominant color and homogeneous                       in media databases,” in International Conference on Communication
texture distances respectively, as in [4].                                     Technologies and Programming, 2003.
   Having calculated the distance of each region of the image              [2] D. C. K. Cees G.M. Snoek, Marcel Worring and A. W. Smeulders,
                                                                               “Learned lexicon-driven interactive video retrieval,” 2006.
to all the region types of the constructed thesaurus, the model            [3] S. Chiu, Extracting Fuzzy Rules from Data for Function Approximation
vector that semantically describes the visual content of the                   and Pattern Classification. John Wiley and Sons, 1997.
image is formed by keeping the smaller distance for each high-             [4] E.Spyrou, H.LeBorgne, T.Mailis, E.Cooke, Y.Avrithis, and N.O’Connor,
                                                                               “Fusing mpeg-7 visual descriptors for image classification,” in Interna-
level concept.                                                                 tional Conference on Artificial Neural Networks (ICANN), 2005.
   For each semantic concept , a support vector machine [10] is            [5] IBM, “Marvel: Multimedia analysis and retrieval system.” [Online].
trained. Its input is the model vector and its output determines               Available: http://mp7.watson.ibm.com/
                                                                           [6] B. Manjunath, J. Ohm, V. Vasudevan, and A. Yamada, “Color and texture
whether the concept exists or not within the keyframe.                         descriptors,” IEEE trans. on Circuits and Systems for Video Technology,
                                                                               vol. 11, no. 6, pp. 703–715, 2001.
                  V. E XPERIMENTAL R ESULTS                                [7] V. G. N. Boujemaa, F. Fleuret and H. Sahbi, “Visual content extraction
                                                                               for automatic semantic annotation of video news,” in IS&T/SPIE Confer-
   For the evaluation of the presented framework,part of the                   ence on Storage and Retrieval Methods and Applications for Multimedia,
development data of TRECVID 2005 was used. This set                            part of Electronic Imaging symposium, January 2004.
consists of approximately 65000 keyframes, captured from                   [8] B. Saux and G.Amato, “Image classifiers for scene analysis,” in Inter-
                                                                               national Conference on Computer Vision and Graphics, 2004.
TV news bulletins. The high-level features for which feature               [9] F. Souvannavong, B. Mérialdo, and B. Huet, “Region-based video
detectors were implemented are: desert, vegetation, mountain,                  content indexing and retrieval,” in CBMI 2005, Fourth International
road, sky and snow. Experiments were performed on the size                     Workshop on Content-Based Multimedia Indexing, June 21-23, 2005,
                                                                               Riga, Latvia, Jun 2005.
of the region thesaurus, the number of dominant colors and the            [10] V. Vapnik, Statistical Learning Theory. John Wiley and Sons, 1998.
presence or not of both visual descriptors. Results are shown             [11] N. Voisine, S. Dasiopoulou, V. Mezaris, E. Spyrou, T. Athanasiadis,
in tables I, II and III.                                                       I. Kompatsiaris, Y. Avrithis, and M. G. Strintzis, “Knowledge-assisted
                                                                               video analysis using a genetic algorithm,” in 6th International Workshop
                                                                               on Image Analysis for Multimedia Interactive Services (WIAMIS 2005),
                     VI. C ONCLUSION                                           April 13-15, 2005.
   The experimental results indicate that the selected concepts
can be detected when a keyframe is represented by a model
vector with the use of a visual thesaurus. Moreover, future
plans include integration of the presented framework to the
one of [11] and fusion of their results.