Using Local Region Semantics for Concept Detection in Video Evaggelos Spyrou, George Koumoulos, Yannis Avrithis and Stefanos Kollias Abstract— This paper presents a framework for the detection of semantic features in video sequences. Low-level feature extrac- tion is performed on the keyframes of the shots and a “feature vector” including color and texture features is formed. A region “thesaurus” that contains all the high-level features is constructed using a subtractive clustering method.Then, a “model vector” that contains the distances from each region type is formed and a SVM detector is trained for each semantic concept. Experiments were performed using TRECVID 2005 development data. Index Terms— semantic analysis, thesaurus, SVM, TRECVID Fig. 1. Presented Framework I. I NTRODUCTION II. L OW-L EVEL F EATURE E XTRACTION H IGH-level concept detection in video documents remains still an unsolved problem. One aspect of this is the extraction of the low-level features of a video sequence and the Since a set of dominant colors in an image or a region of interest has the ability to efficiently capture its color properties, an approach based on the MPEG-7 Dominant Color Descriptor other is the method is used for assigning low-level descriptions [6] was selected. The K-means clustering method is applied to high-level concepts, a problem commonly referred to as the on the RGB values of a given keyframe. As opposed to the “Semantic Gap”. Many approaches have been proposed, all MPEG-7 Dominant Color descriptor, where the number of the sharing the target of bridging the semantic gap, thus extracting extracted representative colors varies allowing a maximum of high-level concepts from multimedia documents. eight colors that can be extracted, a fixed number of colors In [5], a prototype multimedia analysis and retrieval system is each time preselected in our approach.the MPEG-7 Homo- is presented, that uses multi-modal machine learning tech- geneous Texture Descriptor (HTD) [6] was used to capture niques in order to model semantic concepts in video. A region- texture properties of each region. The energy deviations of the based approach in content retrieval that uses Latent Semantic descriptors were discarded, in order to simplify the description, Analysis (LSI) is presented in [9]. The extraction of low-level preventing biasing towards the texture features. concepts is performed after the image is clustered by a mean All the low-level visual descriptions of a keyframe are shift algorithm thus features are selected locally in [8]. In [11], normalized to avoid scale effects and merged into a unique a region-based approach using MPEG-7 visual features and vector. This vector will be referred to as feature vector. knowledge in the form of an ontology is presented. Moreover, in the context of TV news bulletins, a hybrid thesaurus approach is presented in [7], a lexicon-driven approach for III. R EGION T HESAURUS C ONSTRUCTION an interactive video retrieval system is presented in [2] and Given the entire set of the keyframes extracted from a video, a lexicon design for semantic indexing in media databases is it is obvious that those with similar semantic features should also presented in [1]. have similar low-level descriptions. To exploit this, clustering In this work, the problem of concept detection in video is performed on all the descriptions of the training set. Since is approached in the following way: Low-level features are we cannot have a priori knowledge for the exact number of extracted from keyframes, each representing a shot. A model the required classes, Subtractive Clustering [3] is the applied vector is formed by associating these descriptions with the method on the low-level description set, since it determines words of a thesaurus. Then a SVM classifier is used to detect the number of the clusters. Each cluster may or may not the semantic concepts.The presented framework is depicted in represent a high-level feature and each high-level feature may figure 1. be represented by one or more clusters. For example, the concept desert can have many instances differing in i.e. the E.Spyrou, G.Koumoulos, Y.Avrithis and S.Kollias are with Image, Video color of the sand. Moreover, in a cluster that may contain and Multimedia Systems Laboratory, School of Electrical and Computer instances from the semantic entity i.e. sea, these instances Engineering, National Technical University of Athens, 9 Iroon Polytechniou Str., 157 80 Athens, Greece.(e-mail:espyrou@image.ece.ntua.gr) could be mixed up with parts from another concept i.e. sky, if present in an image. Concept 35 Region Types 62 Region Types 125 Region Types Concept 2 DC + HT 3 DC + HT 4 DC + HT 5 DC + HT Desert 82.5% 77.5% 70.1% Desert 77.5% 80.5% 82.5% 79.0% Vegetation 80.5% 71.3% 67.2% Vegetation 70.5% 77.5% 80.5% 81.2% Mountain 83.6% 77.7% 67.0% Mountain 70.3% 82.0% 83.6% 78.6% Road 72.0% 67.0% 65.9% Road 68.0% 70.0% 72.0% 70.0% Sky 80.1% 77.4% 70.0% Sky 77.5% 80.1% 80.1% 79.0% Snow 70.5 % 62.1% 55.2% Snow 57.2% 62.0% 70.5% 72.2% TABLE I TABLE II C LASSIFICATION RATE USING BOTH VISUAL DESCRIPTORS FOR VARIOUS C LASSIFICATION RATE USING BOTH VISUAL DESCRIPTORS FOR VARIOUS NUMBERS OF THE REGION TYPES NUMBERS OF THE DOMINANT COLORS , THESAURUS SIZE = 35 Concept DC HT DC+HT A thesaurus combines a list of every term in a given domain Desert 80.2% 77.2% 82.5% Vegetation 72.5% 75.0% 80.5% of knowledge and a set of related terms for each term in Mountain 72.1% 77.5% 83.6% the list. In our approach, the constructed “Region Thesaurus” Road 71.5% 70.2% 72.0% Sky 85.0% 70.1% 80.1% contains all the “Region Types” that are encountered in the Snow 75.0% 60.1% 70.5% training set. These region types are the centroids of the clusters TABLE III and all the other members of the cluster are their synonyms. C LASSIFICATION RATE USING ONLY COLOR , ONLY TEXTURE AND BOTH The use of the thesaurus is to facilitate the association of the VISUAL DESCRIPTORS , THESAURUS SIZE = 35 low-level features of the image with the corresponding high- level concepts. Since the number of the region types can be very large, the dimensionality of the model vector may become very high. To avoid this, principal component analysis (PCA) ACKNOWLEDGMENT is applied in order to reduce its dimensionality, thus facilitating The work presented in this paper was partially supported the performance of the feature detectors. by the European Commission under contracts FP6-027026 K- Space and FP6-027685 MESH. Evaggelos Spyrou is funded IV. M ODEL V ECTOR K EYFRAME D ESCRIPTION by the Greek Secretariat of Research and Technology (PENED After the construction of the region thesaurus, a “model Ontomedia 03 ED 475) vector” is formed for each keyframe. Its dimensionality is equal to the number of concepts constituting the thesaurus. R EFERENCES The distance of a region to a region type is calculated as a [1] M. N. A. Natsev and J. Smith, “Lexicon design for semantic indexing linear combination of the dominant color and homogeneous in media databases,” in International Conference on Communication texture distances respectively, as in [4]. Technologies and Programming, 2003. Having calculated the distance of each region of the image [2] D. C. K. Cees G.M. Snoek, Marcel Worring and A. W. Smeulders, “Learned lexicon-driven interactive video retrieval,” 2006. to all the region types of the constructed thesaurus, the model [3] S. Chiu, Extracting Fuzzy Rules from Data for Function Approximation vector that semantically describes the visual content of the and Pattern Classification. John Wiley and Sons, 1997. image is formed by keeping the smaller distance for each high- [4] E.Spyrou, H.LeBorgne, T.Mailis, E.Cooke, Y.Avrithis, and N.O’Connor, “Fusing mpeg-7 visual descriptors for image classification,” in Interna- level concept. tional Conference on Artificial Neural Networks (ICANN), 2005. For each semantic concept , a support vector machine [10] is [5] IBM, “Marvel: Multimedia analysis and retrieval system.” [Online]. trained. Its input is the model vector and its output determines Available: http://mp7.watson.ibm.com/ [6] B. Manjunath, J. Ohm, V. Vasudevan, and A. Yamada, “Color and texture whether the concept exists or not within the keyframe. descriptors,” IEEE trans. on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 703–715, 2001. V. E XPERIMENTAL R ESULTS [7] V. G. N. Boujemaa, F. Fleuret and H. Sahbi, “Visual content extraction for automatic semantic annotation of video news,” in IS&T/SPIE Confer- For the evaluation of the presented framework,part of the ence on Storage and Retrieval Methods and Applications for Multimedia, development data of TRECVID 2005 was used. This set part of Electronic Imaging symposium, January 2004. consists of approximately 65000 keyframes, captured from [8] B. Saux and G.Amato, “Image classifiers for scene analysis,” in Inter- national Conference on Computer Vision and Graphics, 2004. TV news bulletins. The high-level features for which feature [9] F. Souvannavong, B. Mérialdo, and B. Huet, “Region-based video detectors were implemented are: desert, vegetation, mountain, content indexing and retrieval,” in CBMI 2005, Fourth International road, sky and snow. Experiments were performed on the size Workshop on Content-Based Multimedia Indexing, June 21-23, 2005, Riga, Latvia, Jun 2005. of the region thesaurus, the number of dominant colors and the [10] V. Vapnik, Statistical Learning Theory. John Wiley and Sons, 1998. presence or not of both visual descriptors. Results are shown [11] N. Voisine, S. Dasiopoulou, V. Mezaris, E. Spyrou, T. Athanasiadis, in tables I, II and III. I. Kompatsiaris, Y. Avrithis, and M. G. Strintzis, “Knowledge-assisted video analysis using a genetic algorithm,” in 6th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2005), VI. C ONCLUSION April 13-15, 2005. The experimental results indicate that the selected concepts can be detected when a keyframe is represented by a model vector with the use of a visual thesaurus. Moreover, future plans include integration of the presented framework to the one of [11] and fusion of their results.