1. INTRODUCTION

Semi-Automatic Multimedia Metadata Integration

Samir Amir

samir.amir@li 1 2 3 4 5

Ioan Marius Bilasco

marius.bilasco@li 1 2 3 4 5

Taner Danisman

taner.danisman@li 1 2 3 4 5

Thierry Urruty

thierry.urruty@li 1 2 3 4 5

Chabane Djeraba

chabane.djeraba@li 1 2 3 4 5 0 Ancestor Context Similarity 1 Commentsand Tokens Linguistic Filtering 2 Existing standards (n:m)mappings Mediated Schema 3 LIFL UMR CNRS 8022, University of Lille1, Telecom-Lille1 Villeneuve d'Ascq , France 4 LeafContext Similarity 5 Linguistic Similarity CommentSimilarity Calculation

The recent growing of multimedia in our lives requires an extensive use of metadata for multimedia management. Consequently, many heterogeneous metadata standards have appeared. In this context, several integration techniques have been proposed in order to deal with this challenge. These integrations are made manually which are costly and timeconsuming. This paper presents a new system for a semiautomatic integration of metadata which is done by using several types of information on metadata schemas.

1. INTRODUCTION

Multimedia resources play an increasingly pervasive role in our lives. Thus, there is a growing need to enable the management of such resources. This is the origin of the appearance of several metadata standards [ 3 ] which are heterogeneous data since they have been created by independent communities. In order to resolve the heterogeneity problem, several solutions have been proposed to integrate heterogeneous metadata. However, These solutions are performed by human experts, which is costly and time-consuming. Besides, the integration process must be updated every time a new standard appears. In this context, an intelligent metadata integration solution is needed to address the interoperability problem by providing an automatic system for mapping between metadata. To do so, tools and mechanisms must resolve the semantic and the structural heterogeneity and align terms between metadata where schema matching plays a central role. Among the schema matching approaches that have been experienced, we can highlight the success of the work done in [ 5 ] [ 6 ]. However both use of them consider only one type of context which makes them not e cient in the case where schemas to be matched have a high structural heterogeneity. In this paper a new schema matching-based approach for XML metadata integration is proposed. In particular, we propose a new matching technique which exploits the semantic and structural information in a manner that increases the matching accuracy. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Existing standards

Mediated Schema SchemaParsing

NodeNames

Tokenization d NameSimilarity Calculation

Pre-Processing 2.

THE PROPOSED APPROACH

In this section, we describe the di erent steps of the proposed matching system as shown in Figure 1: pre-processing, linguistic and structural similarity computation. 2.1

Pre-Processing

After modeling XML Schema as a directed labeled graph, we start by parsing all entities involved in the matching process (element, attributes and comments corresponding to these entities). Then, these entities are ltered and normalized using tokenization, lemmatisation and stopword list. 2.2

Linguistic Similarity Computation

This phase is concerned with the linguistic similarity computation between every XML Schema node pairs using their similarity names and comments. 2.2.1

Names Matching

We calculate the similarity distance between all node pairs in the two schemas. We rst start with the explicitation of tokens by using WordNet. Each node ni represented by a set of tokens Mi will have a set of synonyms synset for each token mi. M0i is the nal result that regroups all synsets returned by Mi explicitation.

Mi0 = Mi [fmkj9mj 2 Mi \ mk 2 synset(mj )g

We compute the similarity Sname between all node pairs. To do so, for each node pair (n1; n2) we calculate Sname by using Jaro-Winkler metric (JW) [ 1 ] between each token mi 2 M01 and all tokens mj 2 M2 (and vice versa). We take the maximum score (MJW) for each token mi. Finally, the average of the best similarities is calculated:

Sname(n1; n2) =

We apply the tf/idf to calculate the similarity between comments. To do so, all comments on two schemas are considered as documents, each node will be represented by a vector whose coordinates are the results of tf/idf. Hence, the similarity between two nodes is the distance between vectors corresponding to their comments. Let us consider v = (w1, w2,...., wP ), a vector representing a certain node n. P = jU j is the number of distinct words in all comments in two schemas. The ith element wi in the vector v, which represents the node n in a schema, is calculated as follow: N wi = tfi idfi idfi = log2 bi where tfi is the term frequency. tfi represents the number of times that the ith word in U appears in the comment corresponding to ni. idfi is the inverse of the percentage of the concepts which contain the word wi. N is the number of comments in U in both schemas. bi is the number of comments which contain the word wi at least one time. The similarity Scomment is the distance between the vectors.

Scomment(vi; vj) =

k=1 wikwjk qPP k=1(wik)2

PP k=1(wjk)2

The result of above processes is a linguistic similarity matrix lSim: lSim(ni; nj) = 1 Sname(ni; nj)+ 2 Scomment(ni; nj) (5) where 1 + 2 = 1 and ( 1; 2) 0 2.3

Structural Similarity Computation

Linguistic similarity computation may provide several false positive candidates. Thus, in order to eliminate the false candidates, the structural similarity is computed by considering three kinds of nodes contexts [ 4 ]: ancestor context, immediate descendant context and leaf context. 2.3.1

Ancestor Context

The ancestor context of a node ni is de ned as the path pi extending from the root node of the schema to ni. The ancestor context similarity ancSim between (ni; nj) is based on the resemblance measure between their paths (pi; pj). This is done by calculating three scores established in [ 2 ]. ancSim(ni; nj) = lSim(ni; nj) ( LCSn(pi; pj)

GAP (pi; pj)

LD(pi; pj)) (6) + + = 1 and ( ; ; )

2.3.2 Immediate Descendants Context

To obtain the immediate descendants context similarity immSim (ni; nj), we compare their two immediate descen(2) (3) (4) dants context sets. This is done by using the linguistic similarity lSim between each pair of children in the two sets. We select the matching pairs with maximum similarity values. Finally, the average of best similarity values is taken. 2.3.3

Leaf Context

The leaf context of a node ni is the set of leaf nodes of subtrees rooted at ni. If li 2 leaves(ni) is a leaf node, then the context of li is given by the path pi from ni to li. leafSim(li; lj) = lSim(li; lj) ( LCSn(pi; pj)

GAP (pi; pj)

LD(pi; pj)) (7) To obtain the leaf context similarity between two leaves li 2 leaves(ni) and lj 2 leaves(nj), we compute the leaf similarity leafSim between each pair of leaves in the two leaf sets. We then select the matching pairs with the maximum similarity values. The average of the best similarity values is taken. 2.4

Node Similarity

The node similarity nodeSim is obtained by the combination of three context scores: nodeSim(ni; nj) = ancSim(ni; nj) +

immSim(ni; nj) + leafSim(ni; nj) (8) + + = 1 and ( ; ; ) 0, once the structural similarity computation is made, the system returns the k node candidates per source ni that have the maximum values of nodeSim and greater than a given threshold e.g. 0.7. 3.

CONCLUSION

Due to the number of existing metadata standards and their heterogeneity, there has been a great interest to develop an automatic integration solution. The existence of such model makes the integration process faster and less expensive. We proposed a new XML Schema matching technique to automate the integration of multimedia metadata. We essentially proposed a linguistic and structural similarity measure linking metadata encoded in di erent formats. In our ongoing work, we plan to enhance the proposed matching system through a better use of the structural information by using the adjacency of nodes to detect other mappings.

[1]

Bilenko ,

R. J.

Mooney ,

W. W.

Cohen ,

P. D.

Ravikumar , and

S. E.

Fienberg . Adaptive name matching in information integration . IEEE Intelligent Systems , 18 ( 5 ): 16 { 23 , 2003 .

[2]

Carmel ,

Y. S.

Maarek ,

Mandelbrod , Y. Mass, and A. So er. Searching xml documents via xml fragments . In SIGIR , pages 151 { 158 , 2003 .

[3]

Hausenblas . Multimedia vocabularies on the semantic web , July 2005 .

[4] M.-L. Lee , L. H.

Yang , W.

Hsu , and X.

Yang . Xclust: clustering xml schemas for e ective integration . In CIKM , pages 292 { 299 , 2002 .

[5]

Madhavan ,

P. A.

Bernstein , and

Rahm . Generic schema matching with cupid . In VLDB , pages 49 { 58 , 2001 .

[6]

Melnik ,

Garcia-Molina ,

and E.

Rahm . Similarity ooding: A versatile graph matching algorithm and its application to schema matching . In ICDE , pages 117 { 128 , 2002 .