Introduction

Enabling Interoperability between Multimedia Resources: An Ontology Matching Perspective

Nicolas James

Konstantin Todorov

Celine Hudelot

nicolas.james

konstantin.todorov

celine.hudelotg@ecp.fr

0 0 MAS Laboratory, Ecole Centrale Paris , F-92 295 Ch 1 atenay-Malabry , France

The semantic annotation of images can bene t from representations of useful concepts and the links between them as ontologies. Recently, several multimedia ontologies have been proposed in the literature as suitable knowledge models to bridge the well known semantic gap between low level features of image content and its high level conceptual meaning. Nevertheless, these multimedia ontologies are often dedicated to (or initially built for) particular needs or a particular application. Ontology matching, de ned as the process of relating di erent heterogeneous models, we will argue, is a suitable approach to solve interoperability issues in semantic image annotation and retrieval. We propose a generic instance-based ontology matching approach, applied to an important semantic image retrieval issue: the bridging of the semantic gap by matching a multimedia ontology against a common-sense knowledge resource.

Introduction

The fast growth of shared digital image and video collections together with the intensive use of visual information for decision making in many domains (medicine, geosciences, etc) require new e ective methods for search and retrieval in these collections. In order to enable and improve the communication and the interface between humans and computers, it is necessary to understand the semantic content of images and to built linguistic descriptions of their content in an automatic way. Following decades of research on Content Based Image Retrieval (CBIR), automatic image annotation is nowadays an active research topic which aims at bridging the semantic and the perceptual levels of abstraction, known as the Semantic gap problem [ 11 ]. In most of the image annotation approaches, the computed linguistic description is often only related to perceptual manifestations of semantics. Nevertheless, as explained in [ 5 ], the image semantics cannot be considered as being included explicitly in the image itself. It rather depends on prior knowledge and on the context of use of the visual information. In consequence, explicit semantics, represented by ontologies, has been intensely used in the eld of image retrieval recently.

With the growth of the application of ontology-based solutions in the multimedia domain, a lot of interoperability issues have arisen: (a) At the semantic level { between di erent representations of the same domain knowledge; (b) At the visual level { between di erent multimedia ontologies; (c) Between the visual level and the semantic level, i.e. the semantic gap problem. Ontology matching, widely used for semantic web applications and rarely in the context of image sharing and retrieval, that we de ned as the process of relating heterogeneous knowledge models, can be used to solve these kinds of interoperability issues. This paper proposes a generic approach to address the question of lling the semantic gap by matching an ontology at the semantic level (Wordnet1 associated to the image database LabelMe[ 10 ]) with an ontology at the visual level (LSCOM [ 12 ]).

Next section is a short review of existing multimedia ontologies and related approaches. Section 3 describes the ontology matching framework which forms the methodological background of our approach, presented in turn in Section 4. Results of our preliminary experiments are discussed in Section 5; Section 6 concludes. 2

Related Work

In the past few years, concept-based multimedia retrieval has been a very active research eld with a major e ort in the automatic detection of semantic concepts from low level features with machine learning approaches. Despite these e orts, the semantic gap problem is still an issue for the semantic understanding of multimedia documents. Recently, many knowledge models have been proposed to improve multimedia retrieval and interpretation by the explicit modeling of the di erent relationships between semantic concepts. Indeed, many generic large scale multimedia ontologies or multimedia concept lexicons together with image collections have been proposed to improve multimedia search and retrieval by providing an e ective representation and interpretation of multimedia concepts [ 13,12,1 ]. We propose to classify these ontologies in four major groups: (1) semantic web multimedia ontologies often based on MPEG-7, reviewed in [ 1 ] (2) visual concept hierarchies (or networks) inferred from inter-concept visual similarity contexts (among which VCNet based on Flickr Distance [ 15 ] and the Topic Network of Fan [ 3 ]), (3) speci c multimedia lexicons often composed of a hierarchy of semantic concepts with associated visual concept detectors used to describe and to detect automatically the semantic concepts of multimedia documents (LSCOM [ 12 ], multimedia thesauri [ 13 ]) and (4) generic ontologies based on existing semantic concept hierarchies such as WordNet populated with annotated images or multimedia documents (ImageNet [ 2 ], LabelMe [ 10 ]). These ontologies have proved to be very useful mainly in the context of semantic concept detection and automatic multimedia annotation but many problems still remain unsolved among which enabling the interoperability between visual concepts and high level concepts. Although there exist attempts to solve these problems by manual concept mappings [ 13 ], little e ort has been directed towards performing them in an automatic manner. Moreover, these ontologies are often dedicated to (or built for) particular needs or a particular application and are 1 http://wordnet.princeton.edu/ complementary knowledge sources. While studies have been done to analyze the di erent inter-concept similarities in di erent multimedia ontologies [ 8 ], to the best of our knowledge, there are no studies which propose a cross analysis and a joint use of these di erent and complementary ontologies.

This paper proposes to situate these problems in an O[ntology] M[atching] framework. The OM-approach presented in next section is much in line with the tradition of extensional matching. This comprises a set of techniques which base the similarity of concepts on characteristics of the instances that these concepts contain [ 6 ]. 3

An Ontology Matching Approach

An ontology is based on a set of concepts and relations de ned on these concepts, which altogether describe the knowledge in a given domain of interest. Due to the fact that di erent communities, independently from one another, tend to conceptualize di erently the same domain of interest, a growing number of heterogeneous ontologies, describing similar or overlapping parts of the world are created. An OM procedure aims at reducing this heterogeneity by linking the correspondent elements of two ontologies in an automatic or semi-automatic manner.

Formally, a populated ontology will be de ned by O = fC; is_a; R; I; gg; where C is a set whose elements are called concepts, is_a is a partial order on C; R is a set of other (binary) relations holding between the concepts from the set C, I is a set whose elements are called instances and g : C ! 2I is an injection from the set of concepts to the set of subsets of I:

We note that the sets C and I are compulsorily non-empty, in contrast to R: Thus, the de nition above describes an ontology which, although not limited to subsumptional relations, necessarily contains a hierarchical backbone, de ned by the partial order. The set I may contain text documents, images or other (real world data) entities. By assumption, every instance can be represented as an n-dimensional real-valued vector, de ned by n input variables of some kind which are the same for all instances in I.

In the context of semantic image annotation, WordNet together with the LabelMe database [ 10 ] and LSCOM [ 12 ] together with the TRECVID 2005 database are two examples of such populated ontologies. Concepts are the nodes of the WordNet hierarchy in ImageNet or the LSCOM categories, while instances are the images in the associated databases, which are labeled by these concepts. It is important to note that the set R is empty for the LSCOM ontology. In the case of WordNet, R contains several useful relations like is_a_member_of, is_a_part_of, opposes, etc.

Often the outcome of an OM-procedure is a set of cross-ontology concept alignments, issued from a measure of concept similarity. The measures used in the current study are based on variable selection and we will describe them in more detail.

Variable selection techniques (reviewed in [ 4 ]) serve to rank the input variables of a given problem (e.g. classi cation) by their importance for the output (the class a liation of an instance), according to certain evaluation criteria. A real valued score which accounts for this importance is attached to every variable. In our case, this can be of help for uncovering latent input-output dependencies. Assuming that instances are represented as real-valued vectors, the computed scores would indicate which of the vector dimensions are most important for the separation of the instances (within a single ontology) into those that belong to a given concept and those that do not and thus best characterize this concept.

We de ne a binary classi cation training set SOc for each concept c from an ontology O by taking I; the entire set of instances assigned to O and labeling all instances from the set g(c) as positive and all the rest (Ing(c)) as negative. By the help of a variable selection procedure performed on SOc, we obtain a representation of the concept c as a list

L(c) = (sc1; sc2; :::; scn); (1) where sic is the score associated to the ith variable. To compute a score per variable and per concept, we apply the S[upport] V[ector] M[achine]-based variable selection technique introduced in [ 14 ]. A series of SVMs is learned on the training set SOc by subsequently removing a variable at a time. The ability of each variable to discriminate c from the other concepts in O is evaluated by measuring the sensitivity of the VC-dimension, an important SVM parameter, with respect to the variable in question.

By following the described procedure, given two source ontologies O1 and O2; a representation as the one in (1) is made available for every concept of each of these ontologies. The similarity of two concepts, A 2 O1 and B 2 O2 is then assessed in terms of their corresponding representations L(A) and L(B): Several choices of a similarity measure based on these representations are proposed and compared in [ 14 ]. In the experimental work contained in this paper, we have used Pearson's, Spearman's and Kendall's measures of correlation calculated on the variable scores or ranks (integers corresponding to the scores) given by simP earson = simSpearman = 1

6 qPn i=1(siA sAmean)2qPn

i=1(siB Pn i=1(siA sAmean)(siB sBmean) sBmean)2

; P d2

i i n(n2 1) ;

nc simKendall = 12 n(n nd : 1) (2) (3) In the formulae above, sAmean and sBmean are the means of the scores over all input variables, di is the di erence of the ranks calculated for the ith variable w.r.t. the two concepts, and nc and nd are the numbers of concordant and discordant pairs among the lists of scores L(A) and L(B):

Filling the Semantic Gap with Mapped Concepts

As noted in the introduction, many challenging issues in the eld of image retrieval stem from the semantic gap problem. Two examples are the construction of robust high level concept detectors and the creation of user oriented annotations with high level semantics. In this section, we propose an attempt to ll the semantic gap by matching two complementary resources: a visual and a semantic thesaurus. Contrary to [ 13 ], our approach is automatic, generic (ontology independent) and makes use of the visual knowledge shared by the source ontologies.

On one hand, we chose LSCOM [ 12 ], an ontology dedicated to multimedia annotation. It was initially built in the framework of TRECVID2 with the criteria of concept usefulness, concept observability and feasibility of concept automatic detection. LSCOM is populated by the development set of TRECVID 2005 videos (news broadcasting). On the other hand, we used WordNet [ 9 ] populated with the LabelMe dataset [ 10 ]. Many interoperability issues can be addressed for these two ontologies among which semantic interoperability and semantic gap interoperability. Aligning these resources allows for the semantic enrichment of concepts belonging to a multimedia ontology with high level linguistic concepts from a general and common sense knowledge base and the evaluation of the quality of the baseline concept detectors by studying the link between concepts whose semantics is related to their perceptual manifestations and concepts whose semantics is related to common sense.

In our setting, the instances that extensionally de ne a concept are images whose annotations contain the name associated to this concept. An image is represented as a vector of descriptors. We use a codebook built on a bag-of-features model and histograms of codewords which is, nowadays, the best approach in the state-of-the-art [ 7 ]. In that, the variables which describe the instances are the bins of these histograms. The generic variable selection approach described in Section 3 is applied directly on our data. In result, we obtain a concept representation as the one introduced in eq. (1) for every concept of our two source ontologies. As stated above (Section 3), there exist several plausible choices of a measure of similarity for two concepts represented in this manner. In our experiments, we have tested the three measures of correlation given in (2) and (3). Regardless of the particular choice, the similarity is always based on visual criteria, since the underlying concept representations are obtained by using visual characteristics of the instances (in the particular case of LSCOM and WordNet these are the sets of images of either TRECVID or LabelMe).

Aligning LSCOM to WordNet allows to infer knowledge about the LSCOM concepts (dedicated to the multimedia document annotation) with regard to the concepts of WordNet and the alignment could be used to build a linguistic description of the concepts of LSCOM, or, in other words, to answer the question \What is an LSCOM concept in WordNet?" in an automatic manner. This improves the retrieval process in several ways: (1) through query expansion and

2 http://www-nlpir.nist.gov/projects/tv2005/

reformulation, i.e. retrieving documents annotated with concepts from an ontology O1 using a query composed of concepts of an ontology O2, (2) through a better description of the documents in the indexing process. However, note that this relation is not symmetric: alignments in the other sense are prompt to fail to be of any help, since WordNet concepts are rather atomic (such as \car") as compared to the more complex LSCOM concepts (e.g. \Natural Disaster Scene"). 5

Experimental Results

We use a part of the LSCOM ontology, LSCOM Annotation v1.03, which is a subset of 449 concepts from the initial LSCOM ontology, and is used for annotating 61,517 images from the TRECVID2005 development set. Since this set contains images from broadcast news videos, the choosen LSCOM subpart is particularly adapted to annotate this kind of content, thus contains abstract and speci c concepts (e.g. 196 Science Technology, 330 Interview On Location). To the contrary, our sub-ontology de ned from WordNet populated with LabelMe (3676 concepts) is very general considering the nature of LabelMe, which is composed of photographs from the daily life.

In this way, to provide a preliminary evaluation of the suggested approach, we chose three concepts from the LSCOM ontology and ve concepts from the WordNet ontology. The choice of the selected concepts was made on several criteria: (1) the number of associated instances, (2) for every selected concepts there is no semantic ambiguity in our dataset, (3) for WordNet only: a high con dence (arbitrarily decided) in the discrimination of the concept using only perceptual information.

3 http://www.ee.columbia.edu/ln/dvmm/lscom/

To construct image features, we use a bag-of-features model with a visual codebook, built classically using the well known SIFT descriptor and a K-Means algorithm. The quanti cation of the extracted SIFT features was investigated in two ways: (1) over all the instances associated to the selected concepts (LSCOM and WordNet), (2) only over the LabelMe images and quanti cation per concept. The two experimentations gave very similar results, and the results of the experiment based on the rst codebook are resumed in Table 1.

The values in the rst three matrices are correlations indicating high similarity for positive values (low for non-positive). As we can see, the concept WordNet:TV is weakly correlated to the chosen LSCOM concepts, and the concept WordNet:House is highly correlated with LSCOM:Natural Disasters and LSCOM:Single Familly Homes but not with LSCOM:US Flags. This is coherent with the TRECVID2005 data considering that the images annotated with LSCOM:US Flags are mostly images from speeches of politicians during presidential elections. An example of an LSCOM image annotation that could be extended to WordNet by the help the concept mapping is given in Fig. 1. 6

Conclusion

The paper proposes an ontology matching technique to solve interoperability issues in the area of semantic image annotation and retrieval. In particular, we have addressed the problem of bridging the semantic gap by the help of a generic instance-based ontology matching approach which aims at automatically producing concept-based annotations enriched with a lexical description of the concepts. In preliminary experiments, we have tested a concept similarity measure on two small sets of concepts taken from the LSCOM ontology and WordNet/LabelMe. Our results are in good agreement with the nature of the instances associated to the selected LSCOM concepts. However, the e ciency of the approach has to be tested on larger sets of concepts (currently in progress). A large-scale application would also allow us to bene t from all the semantic relations in WordNet, like hypernymy, meronymy, antonymy. In the future, we plan to investigate the qualities of our automatic approach in terms of retrieval e ciency as compared to approaches that solely rely on manual mappings. Acknowledgments. This work is funded by the French National Research Agency (ANR) through the COSINUS program (project COLLAVIZ ANR-08-COSI-003) and by the region ^Ile de France through the SEBASTIAN2 project (Cap Digital cluster).

Dasiopoulou ,

Tzouvaras , I. Kompatsiaris , and

M.G.

Strintzis. Enquiring MPEG- 7 based multimedia ontologies . Multimedia Tools and Applications , pages 1 { 40 , 2010 .

Deng ,

Dong ,

Socher ,

L.J.

Li ,

Li , and

Fei-Fei . ImageNet: a large-scale hierarchical image database . In CVPR , pages 710 { 719 , 2009 .

Fan ,

Luo ,

Shen , and

Yang . Integrating visual and semantic contexts for topic network generation and word sense disambiguation . ACM CIVR'09 , pages 1{8 , 2009 .

Guyon and

Elissee . An introduction to variable and feature selection . JMLR , 3 ( 1 ): 1157 { 1182 , 2003 .

Hudelot ,

Maillot , and

Thonnat . Symbol grounding for semantic image interpretation: from image data to semantics . In SKCV-Workshop , ICCV, 2005 .

Isaac , L. van der Meij, S. Schlobach, and

Wang . An empirical study of instance-based ontology matching . The Semantic Web , pages 253 { 266 , 2008 .

Y.G.

Jiang ,

Yang ,

C.W.

Ngo , and

A.G.

Hauptmann . Representations of keypoint-based semantic concept detection: A comprehensive study . IEEE Trans. on Multimedia , in press, 2010 .

Koskela and

Smeaton . An empirical study of inter-concept similarities in multimedia ontologies . In CIVR'07 , pages 464 { 471 . ACM, 2007 .

G.A.

Miller . WordNet: a lexical database for English . Communications of the ACM , 38 ( 11 ): 39 { 41 , 1995 .

10.

B.C.

Russell ,

Torralba ,

K.P.

Murphy , and W.T. Freeman. LabelMe: a database and web-based tool for image annotation . IJCV , 77 ( 1 ): 157 { 173 , 2008 .

11. A.W.M. Smeulders , M.

Worring , S.

Santini , A.

Gupta , and R.

Jain . Content-based image retrieval at the end of the early years . IEEE Trans. Patt. An. Mach. Intell. , pages 1349 { 1380 , 2000 .

12.

J.R.

Smith and

S.F.

Chang . Large-scale concept ontology for multimedia . IEEE Multimedia , 13 ( 3 ): 86 { 91 , 2006 .

13. C.G.M. Snoek , B.

Huurnink , L.

Hollink , M. De Rijke , G. Schreiber, and M. Worring . Adding semantics to detectors for video retrieval . IEEE Trans. on Mult. , 9 ( 5 ): 975 { 986 , 2007 .

14.

Todorov ,

Geibel , and K.-U. Kuhnberger. Extensional ontology matching with variable selection for support vector machines . In CISIS , pages 962 { 968 . IEEE Computer Society Press, 2010 .

15. Lei

, Xian-Sheng

Hua

, Nenghai Yu, Wei-Ying Ma , and Shipeng Li . Flickr distance . In MM'08 , pages 31 { 40 . ACM, 2008 .