<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Context-based Approach for Complex Semantic Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Youssef Bououlid Idrissi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julie Vachon</string-name>
          <email>vachon@iro.umontreal.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIRO, University of Montreal</institution>
          ,
          <addr-line>Montreal, QC</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Semantic matching1 is a fundamental step in implementing data sharing applications. Most systems automating this task however limit themselves to finding simple (one-to-one) matching. In fact, complex (many-to-many) matching raises a far more difficult problem as the search space of concept combinations is often very large. This article presents Indigo, a system discovering complex matching in two steps. First, it semantically enriches data sources with complex concepts extracted from their development artifacts. It then proceeds to the alignment of data sources thus enhanced.</p>
      </abstract>
      <kwd-group>
        <kwd>complex matching</kwd>
        <kwd>semantic matching</kwd>
        <kwd>context analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Semantic matching consists in finding semantic correspondences between
heterogenous sources. When done manually, this task can prove to be very tedious
and error prone [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To date, many systems [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ] have addressed the automation
of this stage. However, most solutions confine themselves to simple matching
(one-to-one) although complex matching (many-to-many ) is frequently required
in practice. Linking concept address to the concatenation of concepts street +
city is a typical example of a complex match. The little work addressing
complex matching can be explained by the greater complexity of finding complex
matches than of discovering simple ones. To cope with this challenging
problem, this article introduces the solution implemented by our matching system
Indigo2. Indigo avoids searching such large spaces of possible concept
combinations. It rather implements an innovative solution based on the exploration
of the data sources’ informational context, which can indeed contain very useful
semantic hints about concept combinations. The informational context of a data
source is composed of all the available textual and formal artifacts documenting,
specifying, implementing this data source. Indigo distinguishes two main sets of
documents in the informational context (cf. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for details). The first set, called
the descriptive context, gathers all the available data source specification and
1 also called semantic alignment, or simply matching
2 INteroperabilty and Data InteGratiOn
documentation files produced during the different development stages. The
second set is called the operational context. It is composed of formal artifacts such
as programs, forms or XML files. In formal settings, significant combinations of
concepts are more easily located (e.g. they can be found in formulas, function
declarations, etc.). Indigo thus favors the exploration of the operational context
to identify relevant concept combinations that can form new complex concepts.
Complex concepts are added to data sources as new candidates for the matching
phase. As an experiment, Indigo was used for the semantic matching of two
database schemas taken from two open-source e-commerce applications, Java
Pet Store [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and eStore [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], provided with all their source code files.
2
      </p>
      <p>Indigo’s architecture
To handle both context analysis and semantic matching, Indigo presents an
architecture composed of two main modules: a Context Analyzer and a Mapper
module. The Context Analyzer module takes the data sources to be matched
along with their related contexts and proceeds to their enrichment before
delivering them to the Mapper module for their effective matching.
2.1</p>
      <sec id="sec-1-1">
        <title>Context Analyzer</title>
        <p>The Context Analyzer comprises two main modules, each one being specialized
in a specific type of concept extraction. 1) The Concept name collector explores
the descriptive context of a data source to find (simple) concept names related
to the ones found in the data source’s schema. 2) The Complex concept extractor
analyzes the operational context to extract complex concepts. The left side of
Figure 1 shows the current architecture of our Context Analyzer. Modules are
either basic or meta analyzers. The analysis performed is based on heuristic rules
in all cases.</p>
        <sec id="sec-1-1-1">
          <title>Cocnoclelepcttnoarme</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Comepxlterxacctoonrcept</title>
          <p>ComGepnleexractoorncept</p>
          <p>Head
meta−analyzer</p>
          <p>Arithmetic
Parser</p>
          <p>Concepts</p>
          <p>Linker
SQL
Analyzer</p>
          <p>Program
Analyzer</p>
          <p>Form
Analyzer</p>
          <p>Schema
Analyzer</p>
          <p>Supervisor
Content−based
Coordinator</p>
          <p>Name−based
Aligner
DTD
Analyzer</p>
          <p>XSD
Analyzer</p>
          <p>Statistic−based</p>
          <p>Aligner</p>
          <p>Whirl−based
Aligner
Basic analyzers. Basic analyzers (depicted by white boxes on Fig 1) are
responsible for the effective mining of complex concepts. The extraction is
performed as dictated by heuristic rules having the following shape ruleN ame ::
SP1||SP2...||SPn → extractionAction. The left part is a disjunction of
syntactic patterns (noted SP) that basic analyzers try to match3 when parsing a
document. A SP is a regular expression which can contain pattern variables
name, type, exp1, exp2, ... expn (e.g. for an accessor method in a Java
program: SPi = {public type getname * return exp1}). When a basic analyzer
recognizes one of the SPs appearing on the left-hand side of a rule, pattern
variables are assigned values (by pattern matching) and the corresponding
righthand side action of the rule is executed. This action builds a complex concept
&lt; name, type, concept combinaison &gt; using the pattern variables’ values. Each
basic analyzer applies its own set of heuristic extraction rules over each of the
artifacts it is assigned. Our current basic analyzers deal with forms, programs,
SQL requests, DTD and XSD schemas.</p>
          <p>Meta-analyzers. Each meta-analyzer is in charge of a set of artifacts
composing the informational context. Its role essentially consists in classifying these
artifacts and assigning each of them to a relevant child. To do so, it applies
heuristics like checking file name extensions or parsing file internal structures.
The meta-analyzer module at the head of the Context Analyzer is in charge of
the Concept name collector and the Complex concept extractor coordination. It
enhances data sources with the simple and complex concepts respectively
delivered by these two modules. For complex concepts, this enrichment step not only
requires the name of the enriching concepts but also the values4 associated to
them. Regarding the Complex concept extractor, let’s mention it coordinates the
actions of the basic analyzers responsible for the complex concepts extraction.
In addition, it relies on an internal module, called complex concept generator,
that validates discovered concept combinations and generates complex concepts
by replacing expressions (coming from source code) by appropriate concepts of
the data source.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Mapper module</title>
        <p>
          As shown on the right side of Figure 1, the current architecture of the Mapper
is composed of several matching modules hierarchically organized. Each aligner
supports a given matching strategy and is responsible of generating a mapping
between data sources. On top, each coordinator supervises a given set of aligners
and combines their returned results. The current implementation of the
Mapper comprises the three following aligners: (1) The Name-based aligner proposes
matches between concepts having similar names with regards to the JaroWinkler
lexical similarity metric. (2) The Whirl-based aligner uses an adapted version of
the so-called WHIRL technique developed by Cohen and Hirsh [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to match
concepts having similar instances. Finally, (3) the Statistic-based aligner
compares concepts’ content which is represented, in this case, by a normalized vector
describing seven characteristics (e.g. minimum value, maximum, variance, etc.).
3 N.b. This kind of text matching is called ”pattern matching”
4 These values are assessed by querying the database using SQL SELECT statements.
Indigo’s Context Analyzer has been applied to match the data sources of the
eStore and PetStore applications. Two measures, respectively called significance
and relevance, were defined to evaluate its performance. The significance
measure indicates the percentage of extracted complex concepts presenting a
semantically sound combination of concepts (e.g. concat(first name, last name)). The
relevance measure is used to compute the proportion of significant complex
concepts which effectively appear in the final mapping of the two data sources. The
Context Analyzer has globally discovered 31 complex concepts of which 87% were
significant. Of course, not all of them were relevant. The manual examination of
the data sources revealed that eStore’s data source only contained two complex
concepts while PetStore’s contained none. Indigo nevertheless succeeded in
discovering both relevant complex concepts of eStore. After their enhancement with
complex concepts, the PetStore’s and the eStore’s data sources were matched
by the Mapper module which was able to discover all complex matches.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>We proposed Indigo, an innovative solution for the discovery of complex matches
between data sources. Avoiding to search the unbounded space of possible
concept combinations, Indigo discovers complex concepts by searching operational
context artifacts of data sources. Newly discovered complex concepts are added
to data sources as new matching candidates for complex matching. Indigo
implements a Context analyzer and a Mapper module both offering a flexible and
extensible hierarchical architecture. Specialized analyzers and aligners can be
added to allow the application of new mining and matching strategies.
Extensibility and adaptability are undoubtedly appreciable qualities of Indigo. Our
first experiments with Indigo showed the pertinence of this approach and let’s
hope for promising futur results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          :
          <article-title>A Survey of Approaches to Automatic Schema Matching</article-title>
          .
          <source>VLDB Journal</source>
          Vol.
          <volume>10</volume>
          ,
          <string-name>
            <surname>Issue</surname>
          </string-name>
          .
          <volume>4</volume>
          (
          <year>2001</year>
          )
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bououlid</surname>
            ,
            <given-names>I.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vachon</surname>
          </string-name>
          , J.:
          <article-title>Context Analysis for Semantic Mapping of Data Sources Using a Multi-Strategy Machine Learning Approach</article-title>
          .
          <source>In Proc. of the International Conf. on Enterprise Information Systems (ICEIS05)</source>
          , Miami, USA (
          <year>2005</year>
          )
          <fpage>445</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semantic Integration in Heterogeneous Databases Using Neural Networks</article-title>
          .
          <source>In Proc. of the 20th Conf. on Very Large Databases (VLDB)</source>
          (
          <year>1994</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al:
          <article-title>State of the Art on Ontology Alignment. Part of a research project funded by the IST Program of the Commission of the European Communities</article-title>
          ,
          <source>project number IST-2004-507482. Knowledge Web Consortium</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Sun</given-names>
            <surname>Microsystems</surname>
          </string-name>
          (http://java.sun.com/developer/releases/petstore/)(
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>McUmber</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Developing pet store using rup and xde</article-title>
          .
          <source>Web Site</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Hirsh, H.:
          <article-title>Joins that Generalize: Text Classification using Whirl</article-title>
          .
          <source>In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining</source>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>