<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-Automatic Multimedia Metadata Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samir Amir</string-name>
          <email>samir.amir@li</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioan Marius Bilasco</string-name>
          <email>marius.bilasco@li</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taner Danisman</string-name>
          <email>taner.danisman@li</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Urruty</string-name>
          <email>thierry.urruty@li</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chabane Djeraba</string-name>
          <email>chabane.djeraba@li</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ancestor Context Similarity</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Commentsand Tokens Linguistic Filtering</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Existing standards (n:m)mappings Mediated Schema</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>LIFL UMR CNRS 8022, University of Lille1, Telecom-Lille1 Villeneuve d'Ascq</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>LeafContext Similarity</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Linguistic Similarity CommentSimilarity Calculation</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The recent growing of multimedia in our lives requires an extensive use of metadata for multimedia management. Consequently, many heterogeneous metadata standards have appeared. In this context, several integration techniques have been proposed in order to deal with this challenge. These integrations are made manually which are costly and timeconsuming. This paper presents a new system for a semiautomatic integration of metadata which is done by using several types of information on metadata schemas.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Multimedia resources play an increasingly pervasive role
in our lives. Thus, there is a growing need to enable the
management of such resources. This is the origin of the
appearance of several metadata standards [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which are
heterogeneous data since they have been created by independent
communities. In order to resolve the heterogeneity problem,
several solutions have been proposed to integrate
heterogeneous metadata. However, These solutions are performed
by human experts, which is costly and time-consuming.
Besides, the integration process must be updated every time a
new standard appears. In this context, an intelligent
metadata integration solution is needed to address the
interoperability problem by providing an automatic system for
mapping between metadata. To do so, tools and mechanisms
must resolve the semantic and the structural heterogeneity
and align terms between metadata where schema
matching plays a central role. Among the schema matching
approaches that have been experienced, we can highlight the
success of the work done in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However both use of
them consider only one type of context which makes them
not e cient in the case where schemas to be matched have
a high structural heterogeneity. In this paper a new schema
matching-based approach for XML metadata integration is
proposed. In particular, we propose a new matching
technique which exploits the semantic and structural
information in a manner that increases the matching accuracy.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
      </p>
      <p>Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
User-Defined
Synonyms
WordNet</p>
      <p>Existing standards</p>
      <p>Mediated Schema
SchemaParsing</p>
      <p>NodeNames</p>
      <p>Tokenization
d
NameSimilarity
Calculation</p>
      <p>Pre-Processing
2.</p>
    </sec>
    <sec id="sec-2">
      <title>THE PROPOSED APPROACH</title>
      <p>In this section, we describe the di erent steps of the
proposed matching system as shown in Figure 1: pre-processing,
linguistic and structural similarity computation.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Pre-Processing</title>
      <p>After modeling XML Schema as a directed labeled graph,
we start by parsing all entities involved in the matching
process (element, attributes and comments corresponding to
these entities). Then, these entities are ltered and
normalized using tokenization, lemmatisation and stopword list.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Linguistic Similarity Computation</title>
      <p>This phase is concerned with the linguistic similarity
computation between every XML Schema node pairs using their
similarity names and comments.
2.2.1</p>
      <sec id="sec-4-1">
        <title>Names Matching</title>
        <p>We calculate the similarity distance between all node pairs
in the two schemas. We rst start with the explicitation of
tokens by using WordNet. Each node ni represented by a
set of tokens Mi will have a set of synonyms synset for each
token mi. M0i is the nal result that regroups all synsets
returned by Mi explicitation.</p>
        <p>Mi0 = Mi [fmkj9mj 2 Mi \ mk 2 synset(mj )g</p>
        <p>
          We compute the similarity Sname between all node pairs.
To do so, for each node pair (n1; n2) we calculate Sname
by using Jaro-Winkler metric (JW) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] between each token
mi 2 M01 and all tokens mj 2 M2 (and vice versa). We take
the maximum score (MJW) for each token mi. Finally, the
average of the best similarities is calculated:
        </p>
        <p>Sname(n1; n2) =</p>
        <p>We apply the tf/idf to calculate the similarity between
comments. To do so, all comments on two schemas are
considered as documents, each node will be represented by a
vector whose coordinates are the results of tf/idf. Hence,
the similarity between two nodes is the distance between
vectors corresponding to their comments. Let us consider
v = (w1, w2,...., wP ), a vector representing a certain node
n. P = jU j is the number of distinct words in all comments
in two schemas. The ith element wi in the vector v, which
represents the node n in a schema, is calculated as follow:
N
wi = tfi idfi idfi = log2 bi
where tfi is the term frequency. tfi represents the number
of times that the ith word in U appears in the comment
corresponding to ni. idfi is the inverse of the percentage of
the concepts which contain the word wi. N is the number
of comments in U in both schemas. bi is the number of
comments which contain the word wi at least one time. The
similarity Scomment is the distance between the vectors.</p>
        <p>Scomment(vi; vj) =</p>
        <p>PP</p>
        <p>k=1 wikwjk
qPP
k=1(wik)2</p>
        <p>PP
k=1(wjk)2</p>
        <p>The result of above processes is a linguistic similarity
matrix lSim:
lSim(ni; nj) = 1 Sname(ni; nj)+ 2 Scomment(ni; nj) (5)
where 1 + 2 = 1 and ( 1; 2)
0
2.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Structural Similarity Computation</title>
      <p>
        Linguistic similarity computation may provide several false
positive candidates. Thus, in order to eliminate the false
candidates, the structural similarity is computed by
considering three kinds of nodes contexts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: ancestor context,
immediate descendant context and leaf context.
2.3.1
      </p>
      <sec id="sec-5-1">
        <title>Ancestor Context</title>
        <p>
          The ancestor context of a node ni is de ned as the path
pi extending from the root node of the schema to ni. The
ancestor context similarity ancSim between (ni; nj) is based
on the resemblance measure between their paths (pi; pj).
This is done by calculating three scores established in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
ancSim(ni; nj) = lSim(ni; nj)
( LCSn(pi; pj)
        </p>
        <p>GAP (pi; pj)</p>
        <p>LD(pi; pj))
(6)
+ +
= 1 and ( ; ; )</p>
        <p>0</p>
      </sec>
      <sec id="sec-5-2">
        <title>2.3.2 Immediate Descendants Context</title>
        <p>To obtain the immediate descendants context similarity
immSim (ni; nj), we compare their two immediate
descen(2)
(3)
(4)
dants context sets. This is done by using the linguistic
similarity lSim between each pair of children in the two sets. We
select the matching pairs with maximum similarity values.
Finally, the average of best similarity values is taken.
2.3.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Leaf Context</title>
        <p>The leaf context of a node ni is the set of leaf nodes of
subtrees rooted at ni. If li 2 leaves(ni) is a leaf node, then
the context of li is given by the path pi from ni to li.
leafSim(li; lj) = lSim(li; lj)
( LCSn(pi; pj)</p>
        <p>GAP (pi; pj)</p>
        <p>LD(pi; pj))
(7)
To obtain the leaf context similarity between two leaves li 2
leaves(ni) and lj 2 leaves(nj), we compute the leaf similarity
leafSim between each pair of leaves in the two leaf sets. We
then select the matching pairs with the maximum similarity
values. The average of the best similarity values is taken.
2.4</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Node Similarity</title>
      <p>The node similarity nodeSim is obtained by the
combination of three context scores:
nodeSim(ni; nj) =
ancSim(ni; nj) +</p>
      <p>immSim(ni; nj)
+
leafSim(ni; nj)
(8)
+ + = 1 and ( ; ; ) 0, once the structural
similarity computation is made, the system returns the k node
candidates per source ni that have the maximum values of
nodeSim and greater than a given threshold e.g. 0.7.
3.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>Due to the number of existing metadata standards and
their heterogeneity, there has been a great interest to
develop an automatic integration solution. The existence of
such model makes the integration process faster and less
expensive. We proposed a new XML Schema matching
technique to automate the integration of multimedia metadata.
We essentially proposed a linguistic and structural similarity
measure linking metadata encoded in di erent formats. In
our ongoing work, we plan to enhance the proposed
matching system through a better use of the structural information
by using the adjacency of nodes to detect other mappings.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Ravikumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Fienberg</surname>
          </string-name>
          .
          <article-title>Adaptive name matching in information integration</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <volume>16</volume>
          {
          <fpage>23</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mandelbrod</surname>
          </string-name>
          , Y. Mass, and
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>So er. Searching xml documents via xml fragments</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>151</volume>
          {
          <fpage>158</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          .
          <article-title>Multimedia vocabularies on the semantic web</article-title>
          ,
          <year>July 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M.-L. Lee</surname>
            ,
            <given-names>L. H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Hsu</surname>
            , and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Xclust: clustering xml schemas for e ective integration</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>292</volume>
          {
          <fpage>299</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Generic schema matching with cupid</article-title>
          .
          <source>In VLDB</source>
          , pages
          <volume>49</volume>
          {
          <fpage>58</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Similarity ooding: A versatile graph matching algorithm and its application to schema matching</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>117</volume>
          {
          <fpage>128</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>