Semi-Automatic Multimedia Metadata Integration

              Samir Amir, Ioan Marius Bilasco, Taner Danisman, Thierry Urruty and Chabane
                                                 Djeraba
                                          LIFL UMR CNRS 8022, University of Lille1, Telecom-Lille1
                                                      Villeneuve d’Ascq, France
            {samir.amir, marius.bilasco, taner.danisman, thierry.urruty, chabane.djeraba}@lifl.fr


ABSTRACT                                                                                                    Existing standards                        Mediated Schema


The recent growing of multimedia in our lives requires an ex-
tensive use of metadata for multimedia management. Con-                                                                                                                Pre-Processing

sequently, many heterogeneous metadata standards have ap-                                                                                                            Comments and
                                                                                                                                                   Node Names
peared. In this context, several integration techniques have                                                              Schema Parsing
                                                                                                                                                   Tokenization
                                                                                                                                                                    Tokens Linguistic
                                                                                                                                                                        Filtering
been proposed in order to deal with this challenge. These
integrations are made manually which are costly and time-                                    User-Defined
                                                                                              Synonyms
consuming. This paper presents a new system for a semi-                                                                                                              Linguistic Similarity
                                                                                                                                        d
automatic integration of metadata which is done by using                                                                         Name Similarity            Comment Similarity
                                                                                              WordNet
several types of information on metadata schemas.                                                                                  Calculation                 Calculation


1.     INTRODUCTION                                                                                                                                                  Structural Similarity

                                                                                                                Leaf Context Similarity + Immediate Descendants
   Multimedia resources play an increasingly pervasive role                                                          Similarity + Ancestor Context Similarity
in our lives. Thus, there is a growing need to enable the
management of such resources. This is the origin of the ap-
                                                                                                                                  (n:m) mappings
pearance of several metadata standards [3] which are hetero-                                                Existing standards                       Mediated Schema

geneous data since they have been created by independent
communities. In order to resolve the heterogeneity problem,
several solutions have been proposed to integrate heteroge-                                       Figure 1: Matching process phases
neous metadata. However, These solutions are performed
by human experts, which is costly and time-consuming. Be-
sides, the integration process must be updated every time a                          2.    THE PROPOSED APPROACH
new standard appears. In this context, an intelligent meta-                             In this section, we describe the different steps of the pro-
data integration solution is needed to address the interoper-                        posed matching system as shown in Figure 1: pre-processing,
ability problem by providing an automatic system for map-                            linguistic and structural similarity computation.
ping between metadata. To do so, tools and mechanisms
must resolve the semantic and the structural heterogeneity                           2.1     Pre-Processing
and align terms between metadata where schema match-                                    After modeling XML Schema as a directed labeled graph,
ing plays a central role. Among the schema matching ap-                              we start by parsing all entities involved in the matching pro-
proaches that have been experienced, we can highlight the                            cess (element, attributes and comments corresponding to
success of the work done in [5] [6]. However both use of                             these entities). Then, these entities are filtered and normal-
them consider only one type of context which makes them                              ized using tokenization, lemmatisation and stopword list.
not efficient in the case where schemas to be matched have
a high structural heterogeneity. In this paper a new schema                          2.2     Linguistic Similarity Computation
matching-based approach for XML metadata integration is
                                                                                       This phase is concerned with the linguistic similarity com-
proposed. In particular, we propose a new matching tech-
                                                                                     putation between every XML Schema node pairs using their
nique which exploits the semantic and structural informa-
                                                                                     similarity names and comments.
tion in a manner that increases the matching accuracy.
                                                                                     2.2.1      Names Matching
                                                                                       We calculate the similarity distance between all node pairs
                                                                                     in the two schemas. We first start with the explicitation of
Permission to make digital or hard copies of all or part of this work for            tokens by using WordNet. Each node ni represented by a
personal or classroom use is granted without fee provided that copies are            set of tokens Mi will have a set of synonyms synset for each
                                                                                                   0
not made or distributed for profit or commercial advantage and that copies           token mi . Mi is the final result that regroups all synsets
bear this notice and the full citation on the first page. To copy otherwise, to      returned by Mi explicitation.
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.                                                                         0          [              \
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.                                             Mi = Mi         {mk |∃mj ∈ Mi   mk ∈ synset(mj )}                                               (1)
  We compute the similarity Sname between all node pairs.                       dants context sets. This is done by using the linguistic simi-
To do so, for each node pair (n1 , n2 ) we calculate Sname                      larity lSim between each pair of children in the two sets. We
by using Jaro-Winkler metric (JW) [1] between each token                        select the matching pairs with maximum similarity values.
       0
mi ∈ M1 and all tokens mj ∈ M2 (and vice versa). We take                        Finally, the average of best similarity values is taken.
the maximum score (MJW) for each token mi . Finally, the
average of the best similarities is calculated:
                                                                                2.3.3       Leaf Context
                                             0                          0
                                                                                  The leaf context of a node ni is the set of leaf nodes of
                                                                                subtrees rooted at ni . If li ∈ leaves(ni ) is a leaf node, then
                      P                            P
                          mi ∈M1 M JW (mi , M2 ) +   mj ∈M2 M JW (mj , M1 )
 Sname (n1 , n2 ) =
                                           |M1 | + |M2 |                        the context of li is given by the path pi from ni to li .
                                                                          (2)

                                                                                      leafSim(li , lj ) = lSim(li , lj )
2.2.2      Comments Matching                                                          ∗ (δLCSn (pi , pj ) − θGAP (pi , pj ) − λLD(pi , pj ))   (7)
   We apply the tf/idf to calculate the similarity between
comments. To do so, all comments on two schemas are con-                        To obtain the leaf context similarity between two leaves li ∈
sidered as documents, each node will be represented by a                        leaves(ni ) and lj ∈ leaves(nj ), we compute the leaf similarity
vector whose coordinates are the results of tf/idf. Hence,                      leafSim between each pair of leaves in the two leaf sets. We
the similarity between two nodes is the distance between                        then select the matching pairs with the maximum similarity
vectors corresponding to their comments. Let us consider                        values. The average of the best similarity values is taken.
v = (w1 , w2 ,...., wP ), a vector representing a certain node
n. P = |U | is the number of distinct words in all comments
                                                                                2.4      Node Similarity
in two schemas. The ith element wi in the vector v, which                          The node similarity nodeSim is obtained by the combina-
represents the node n in a schema, is calculated as follow:                     tion of three context scores:

                                                           N                    nodeSim(ni , nj ) = α ∗ ancSim(ni , nj ) + β ∗ immSim(ni , nj )
                  wi = tfi ∗ idfi         idfi = log2                    (3)
                                                           bi                                        + γ ∗ leafSim(ni , nj )                   (8)
where tfi is the term frequency. tfi represents the number                      α + β + γ = 1 and (α, β, γ) ≥ 0, once the structural simi-
of times that the ith word in U appears in the comment                          larity computation is made, the system returns the k node
corresponding to ni . idfi is the inverse of the percentage of                  candidates per source ni that have the maximum values of
the concepts which contain the word wi . N is the number                        nodeSim and greater than a given threshold e.g. 0.7.
of comments in U in both schemas. bi is the number of
comments which contain the word wi at least one time. The                       3.     CONCLUSION
similarity Scomment is the distance between the vectors.                          Due to the number of existing metadata standards and
                                 PP                                             their heterogeneity, there has been a great interest to de-
                                    k=1 wik wjk
    Scomment (vi , vj ) = qP                               (4)                  velop an automatic integration solution. The existence of
                              P            PP
                              k=1 (wik ) ∗
                                        2               2
                                             k=1 (wjk )                         such model makes the integration process faster and less ex-
                                                                                pensive. We proposed a new XML Schema matching tech-
   The result of above processes is a linguistic similarity ma-                 nique to automate the integration of multimedia metadata.
trix lSim:                                                                      We essentially proposed a linguistic and structural similarity
 lSim(ni , nj ) = µ1 ∗Sname (ni , nj )+µ2 ∗Scomment (ni , nj ) (5)              measure linking metadata encoded in different formats. In
                                                                                our ongoing work, we plan to enhance the proposed match-
where µ1 + µ2 = 1 and (µ1 , µ2 ) ≥ 0                                            ing system through a better use of the structural information
                                                                                by using the adjacency of nodes to detect other mappings.
2.3     Structural Similarity Computation
  Linguistic similarity computation may provide several false                   4.     REFERENCES
positive candidates. Thus, in order to eliminate the false                      [1] M. Bilenko, R. J. Mooney, W. W. Cohen, P. D.
candidates, the structural similarity is computed by consid-                        Ravikumar, and S. E. Fienberg. Adaptive name
ering three kinds of nodes contexts [4]: ancestor context,                          matching in information integration. IEEE Intelligent
immediate descendant context and leaf context.                                      Systems, 18(5):16–23, 2003.
                                                                                [2] D.Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, and
2.3.1      Ancestor Context                                                         A. Soffer. Searching xml documents via xml fragments.
  The ancestor context of a node ni is defined as the path                          In SIGIR, pages 151–158, 2003.
pi extending from the root node of the schema to ni . The                       [3] M. Hausenblas. Multimedia vocabularies on the
ancestor context similarity ancSim between (ni , nj ) is based                      semantic web, July 2005.
on the resemblance measure between their paths (pi , pj ).                      [4] M.-L. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust:
This is done by calculating three scores established in [2].                        clustering xml schemas for effective integration. In
      ancSim(ni , nj ) = lSim(ni , nj )                                             CIKM, pages 292–299, 2002.
                                                                                [5] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic
      ∗ (δLCSn (pi , pj ) − θGAP (pi , pj ) − λLD(pi , pj ))             (6)        schema matching with cupid. In VLDB, pages 49–58,
δ + θ + λ = 1 and (δ, θ, λ) ≥ 0                                                     2001.
                                                                                [6] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
2.3.2      Immediate Descendants Context                                            flooding: A versatile graph matching algorithm and its
  To obtain the immediate descendants context similarity                            application to schema matching. In ICDE, pages
immSim (ni , nj ), we compare their two immediate descen-                           117–128, 2002.