Semi-Automatic Multimedia Metadata Integration Samir Amir, Ioan Marius Bilasco, Taner Danisman, Thierry Urruty and Chabane Djeraba LIFL UMR CNRS 8022, University of Lille1, Telecom-Lille1 Villeneuve d’Ascq, France {samir.amir, marius.bilasco, taner.danisman, thierry.urruty, chabane.djeraba}@lifl.fr ABSTRACT Existing standards Mediated Schema The recent growing of multimedia in our lives requires an ex- tensive use of metadata for multimedia management. Con- Pre-Processing sequently, many heterogeneous metadata standards have ap- Comments and Node Names peared. In this context, several integration techniques have Schema Parsing Tokenization Tokens Linguistic Filtering been proposed in order to deal with this challenge. These integrations are made manually which are costly and time- User-Defined Synonyms consuming. This paper presents a new system for a semi- Linguistic Similarity d automatic integration of metadata which is done by using Name Similarity Comment Similarity WordNet several types of information on metadata schemas. Calculation Calculation 1. INTRODUCTION Structural Similarity Leaf Context Similarity + Immediate Descendants Multimedia resources play an increasingly pervasive role Similarity + Ancestor Context Similarity in our lives. Thus, there is a growing need to enable the management of such resources. This is the origin of the ap- (n:m) mappings pearance of several metadata standards [3] which are hetero- Existing standards Mediated Schema geneous data since they have been created by independent communities. In order to resolve the heterogeneity problem, several solutions have been proposed to integrate heteroge- Figure 1: Matching process phases neous metadata. However, These solutions are performed by human experts, which is costly and time-consuming. Be- sides, the integration process must be updated every time a 2. THE PROPOSED APPROACH new standard appears. In this context, an intelligent meta- In this section, we describe the different steps of the pro- data integration solution is needed to address the interoper- posed matching system as shown in Figure 1: pre-processing, ability problem by providing an automatic system for map- linguistic and structural similarity computation. ping between metadata. To do so, tools and mechanisms must resolve the semantic and the structural heterogeneity 2.1 Pre-Processing and align terms between metadata where schema match- After modeling XML Schema as a directed labeled graph, ing plays a central role. Among the schema matching ap- we start by parsing all entities involved in the matching pro- proaches that have been experienced, we can highlight the cess (element, attributes and comments corresponding to success of the work done in [5] [6]. However both use of these entities). Then, these entities are filtered and normal- them consider only one type of context which makes them ized using tokenization, lemmatisation and stopword list. not efficient in the case where schemas to be matched have a high structural heterogeneity. In this paper a new schema 2.2 Linguistic Similarity Computation matching-based approach for XML metadata integration is This phase is concerned with the linguistic similarity com- proposed. In particular, we propose a new matching tech- putation between every XML Schema node pairs using their nique which exploits the semantic and structural informa- similarity names and comments. tion in a manner that increases the matching accuracy. 2.2.1 Names Matching We calculate the similarity distance between all node pairs in the two schemas. We first start with the explicitation of Permission to make digital or hard copies of all or part of this work for tokens by using WordNet. Each node ni represented by a personal or classroom use is granted without fee provided that copies are set of tokens Mi will have a set of synonyms synset for each 0 not made or distributed for profit or commercial advantage and that copies token mi . Mi is the final result that regroups all synsets bear this notice and the full citation on the first page. To copy otherwise, to returned by Mi explicitation. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 0 [ \ Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. Mi = Mi {mk |∃mj ∈ Mi mk ∈ synset(mj )} (1) We compute the similarity Sname between all node pairs. dants context sets. This is done by using the linguistic simi- To do so, for each node pair (n1 , n2 ) we calculate Sname larity lSim between each pair of children in the two sets. We by using Jaro-Winkler metric (JW) [1] between each token select the matching pairs with maximum similarity values. 0 mi ∈ M1 and all tokens mj ∈ M2 (and vice versa). We take Finally, the average of best similarity values is taken. the maximum score (MJW) for each token mi . Finally, the average of the best similarities is calculated: 2.3.3 Leaf Context 0 0 The leaf context of a node ni is the set of leaf nodes of subtrees rooted at ni . If li ∈ leaves(ni ) is a leaf node, then P P mi ∈M1 M JW (mi , M2 ) + mj ∈M2 M JW (mj , M1 ) Sname (n1 , n2 ) = |M1 | + |M2 | the context of li is given by the path pi from ni to li . (2) leafSim(li , lj ) = lSim(li , lj ) 2.2.2 Comments Matching ∗ (δLCSn (pi , pj ) − θGAP (pi , pj ) − λLD(pi , pj )) (7) We apply the tf/idf to calculate the similarity between comments. To do so, all comments on two schemas are con- To obtain the leaf context similarity between two leaves li ∈ sidered as documents, each node will be represented by a leaves(ni ) and lj ∈ leaves(nj ), we compute the leaf similarity vector whose coordinates are the results of tf/idf. Hence, leafSim between each pair of leaves in the two leaf sets. We the similarity between two nodes is the distance between then select the matching pairs with the maximum similarity vectors corresponding to their comments. Let us consider values. The average of the best similarity values is taken. v = (w1 , w2 ,...., wP ), a vector representing a certain node n. P = |U | is the number of distinct words in all comments 2.4 Node Similarity in two schemas. The ith element wi in the vector v, which The node similarity nodeSim is obtained by the combina- represents the node n in a schema, is calculated as follow: tion of three context scores: N nodeSim(ni , nj ) = α ∗ ancSim(ni , nj ) + β ∗ immSim(ni , nj ) wi = tfi ∗ idfi idfi = log2 (3) bi + γ ∗ leafSim(ni , nj ) (8) where tfi is the term frequency. tfi represents the number α + β + γ = 1 and (α, β, γ) ≥ 0, once the structural simi- of times that the ith word in U appears in the comment larity computation is made, the system returns the k node corresponding to ni . idfi is the inverse of the percentage of candidates per source ni that have the maximum values of the concepts which contain the word wi . N is the number nodeSim and greater than a given threshold e.g. 0.7. of comments in U in both schemas. bi is the number of comments which contain the word wi at least one time. The 3. CONCLUSION similarity Scomment is the distance between the vectors. Due to the number of existing metadata standards and PP their heterogeneity, there has been a great interest to de- k=1 wik wjk Scomment (vi , vj ) = qP (4) velop an automatic integration solution. The existence of P PP k=1 (wik ) ∗ 2 2 k=1 (wjk ) such model makes the integration process faster and less ex- pensive. We proposed a new XML Schema matching tech- The result of above processes is a linguistic similarity ma- nique to automate the integration of multimedia metadata. trix lSim: We essentially proposed a linguistic and structural similarity lSim(ni , nj ) = µ1 ∗Sname (ni , nj )+µ2 ∗Scomment (ni , nj ) (5) measure linking metadata encoded in different formats. In our ongoing work, we plan to enhance the proposed match- where µ1 + µ2 = 1 and (µ1 , µ2 ) ≥ 0 ing system through a better use of the structural information by using the adjacency of nodes to detect other mappings. 2.3 Structural Similarity Computation Linguistic similarity computation may provide several false 4. REFERENCES positive candidates. Thus, in order to eliminate the false [1] M. Bilenko, R. J. Mooney, W. W. Cohen, P. D. candidates, the structural similarity is computed by consid- Ravikumar, and S. E. Fienberg. Adaptive name ering three kinds of nodes contexts [4]: ancestor context, matching in information integration. IEEE Intelligent immediate descendant context and leaf context. Systems, 18(5):16–23, 2003. [2] D.Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, and 2.3.1 Ancestor Context A. Soffer. Searching xml documents via xml fragments. The ancestor context of a node ni is defined as the path In SIGIR, pages 151–158, 2003. pi extending from the root node of the schema to ni . The [3] M. Hausenblas. Multimedia vocabularies on the ancestor context similarity ancSim between (ni , nj ) is based semantic web, July 2005. on the resemblance measure between their paths (pi , pj ). [4] M.-L. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust: This is done by calculating three scores established in [2]. clustering xml schemas for effective integration. In ancSim(ni , nj ) = lSim(ni , nj ) CIKM, pages 292–299, 2002. [5] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic ∗ (δLCSn (pi , pj ) − θGAP (pi , pj ) − λLD(pi , pj )) (6) schema matching with cupid. In VLDB, pages 49–58, δ + θ + λ = 1 and (δ, θ, λ) ≥ 0 2001. [6] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity 2.3.2 Immediate Descendants Context flooding: A versatile graph matching algorithm and its To obtain the immediate descendants context similarity application to schema matching. In ICDE, pages immSim (ni , nj ), we compare their two immediate descen- 117–128, 2002.