<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovery of similar blocks from very large-scale ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aicha Boubekeur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdellah Chouarfia</string-name>
          <email>chouarfia@univ-usto.dz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Tiaret</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, University of UST-Oran</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <fpage>324</fpage>
      <lpage>329</lpage>
      <abstract>
        <p>Large scale ontology matching is a labour-intensive and timeconsuming process. To alleviate the problem, many automated solutions have been proposed. In order to avoid the drawbacks of the existing solutions, this paper proposes to cut down the number of concept pairs for which a similarity measure must be computed during ontology matching. More important, the main contribution is to deal subsets of concepts pair: on the one hand, if two concepts are highly similar, we leverage the concept hierarchy to skip subsequent matching between sub-concepts of one concept and super-concepts of the other concept. On the other hand, if two concepts are weakly similar, we leverage the locality phenomenon of matching to skip subsequent matching between one concept and the neighbours of the other concept.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Large ontology</kwd>
        <kwd>neighbour</kwd>
        <kwd>similarity</kwd>
        <kwd>link</kwd>
        <kwd>search space</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years, many large ontologies are created and maintained in the areas
including machine translation, information retrieval, e-commerce, digital library,
medicine, and life science. These ontologies have more than thousands to millions of
concepts and properties, and some of them contain more than billions of instances such as
Cyc1, WordNet2, SUMO3, Gene Ontology4 and UMLS5.</p>
      <p>
        It has been argued that the difficulties of the operations of constructing, matching,
reusing, maintaining, and reasoning on large ontologies would be extremely
simplified by splitting large ontologies into smaller modules which cover specific subjects
[
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. Ontology modularization is the collective name of two approaches for
fragmenting ontologies into smaller, coherent components (modules), which are
themselves ontologies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
      </p>
      <p>Ontology Partitioning Approaches. The ontology is partitioned into a number of
modules {M1, ···, Mn} such that the union of all the modules is semantically
equivalent to the original ontology {M1 U M2 U ... U Mn} = O; i.e. the Mi modules are not
necessarily disjoint. Thus, a function partition (O) can be formally defined as follows:</p>
      <p>
        Partition (O) M= {{M1, M2, ...., Mn}| {M1 U M2 U ... U Mn} = O} (1)
The partitioning method reduces the search space and thus leads to better efficiency. The
space complexity of the matching process is also reduced. Four partition based methods
COMA++ [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], FalconAO [
        <xref ref-type="bibr" rid="ref3 ref7">7,3</xref>
        ], Taxomap [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], anchor flood [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] will be discussed below.
      </p>
      <p>
        Module Extraction techniques. Concepts that form a coherent fragment of an
ontology O are extracted to form a module M, such that it covers a given vocabulary
(based on an initial module signature) Sig(M), Such that Sig(M Sig(O) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In fact
this task consists in reducing an ontology to the sub-part, the module, that covers a
particular sub-vocabulary of O, as such M O [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Note that M is now ontology itself.
A function extract (O, Sig (M)) can be defined as follows:
      </p>
      <p>
        Extract (O, Sig (M)) {M | M O} (2)
There are numerous techniques [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] for module extraction, more than ontology
partitioning approaches that have been developed for different purposes. The main usage of these
approaches concerns partial reusing, when an application or a system needs only a part of
ontology. Broadly speaking, modularization approaches aim to identify the minimal
set of necessary concepts and definitions for different parts of the original ontology.
However, ontology partitioning approaches present several drawbacks. They cannot
control the size of blocks, which may be too small or too large for matching [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref9">3, 4, 5, 6,9</xref>
        ].
They can also cause another problem, namely, the partitioning can make the elements on
the boundaries of blocks lose some semantic information, which in turn affects the quality
of final matching results. This paper proposes a generic solution to assess preliminary 1:
n mappings between any two concepts from two given ontologies based on their
descriptive (semantic) information. On the one hand, if two concepts are highly similar,
we leverage the concept hierarchy to skip subsequent matching between sub-concepts of
one concept and super-concepts of the other concept. On the other hand, if two concepts
are weakly similar, we leverage the locality phenomenon of matching to skip subsequent
matching between one concept and the neighbors of the other concept.
      </p>
      <p>The paper is structured as follows: Section 2 discusses large scale matching
techniques. Section 3 presents definitions and basic concepts used throughout the paper.
Section 4 describes our structure-based matching approach. Finally, Section 5
provides some concluding remarks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        According to Shvaiko P and Euzenat J [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] one of the toughest challenges for
matching system is handling large scale schemas or ontology. Large-scale ontologies
are a kind of ontologies created to describe complex real world domains. So, various
large scale matching techniques are categorized in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
      </p>
      <p>- The early pruning strategy is to reduce the search space for matching; one
matcher can prune entity pairs whose semantic correspondence value is very low, thus
reducing search space for the subsequent matcher (Quick Ontology Matching algorithm
(QOM), Eric peukert et al. schema and ontology matching algorithm).</p>
      <p>- The partition strategy is performed in such a way that each partition of first
ontology is matched with only small subset of the partitions of the second ontology. This
method reduces the search space and thus better efficiency (Coma++, Falcon-AO,
Taxomap and Anchor flood).</p>
      <p>- The parallel matching technique has two kinds' inter-matcher and intra-matcher
parallelization. Inter-matcher parallelization deals with parallel execution of
independently executable matchers while intra-matcher parallelization deals with internal
decomposition of individual matchers or matcher parts into several match tasks that
can be executed in parallel (Gross &amp; al. ontology matching algorithm).</p>
      <p>- Other matching tool: RiMOM and ASMOV ontology matching tools,
Agreementmaker schema and ontology matching tool.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>The following definitions and basic concepts are used throughout the paper:
Definition 1 (schema graph): A schema graph (directed acyclic graph) of an
ontology is given by (V,E,Labv), where: V = {r, v2, ..., vn} is a finite set of nodes, each of
them is uniquely identified by an object identifier (OID), where r is the schema graph
root node. E = {(vi, vj)|vi, vj ∈ V } is a finite set of edges. Labv is a finite set of node
labels. These labels are strings for describing the properties of the element and
attribute nodes, such as name and data type.</p>
      <p>Definition 2 (neighbor): A neighbor concept c can be defined as follows:
Neighbors(c) = {Sub(c) U Sup(c)} avec Sub(c) = {c'| c' sub-concept c} and Sup(c) = {c'| c'
sup-concept c}</p>
      <p>Definition 3 (strong-Links): Given two schema graph G=(O1, E, Labv) and
G'=(O2, E', Labv') of ontologies O1 and O2, the similarity values between ai∈S and
concepts b1, b2, …, bn in ontology O2 are Sim (a i)={ sim (ai, bj)∈ G X G' | j=1..n},
and the strong-Links of a node ai∈O1 is given by SN(ai)= { bj | sim (ai, bj) ≥ thresold},
thresold is a high value in [0..1].</p>
      <p>Definition 4 (low-Links): Given two schema graph G= (O1, E, Labv) and G'= (O2,
E', Labv') of ontologies, the similarity values between ai ∈O1 and concepts b1, b2, …,
bn in ontology O2 are Sim (a i)={ sim (ai, bj) ∈ G X G' | j=1..n}, and the low-Links
of a node ai∈O1 is given by LN(ai)={bj | sim (ai, bj) &lt; thresold}, thresold is usually a
small value in [0..1].</p>
      <p>Through these two last definitions, the matching process can reduce maximum
times of similarity computation and thus reduce the time complexity significantly.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Structure connected Links</title>
      <sec id="sec-4-1">
        <title>Our structure-based matching approach is realized by:</title>
        <p>Step1. This phase is concerned with the representation of heterogeneous ontologies
as sequence representations. First, each ontology is parsed and represented internally
as a rooted ordered labeled graph, wherein each graph component (element and/or
attribute) is represented as a node, while edges are used to represent relationships
between components. Each node in the schema graph carries the associated element
properties.</p>
        <p>Step2. Compute preliminary similarities between any two entities for two given
ontologies based on their descriptive information i.e. generate set of concepts pairs or
links. It utilizes both structural and linguistic information for initial alignment and
then applied subsequent similarity propagation strategy to produce more alignments if
necessary. It main function is to match the heterogeneous ontologies.</p>
        <p>Step3. The first issue is to extract two kinds of virtual sub-graph for highly /
weakly similar concepts (links) across ontologies. The second issue is to reduce the search
space (i.e. Space and time complexity of the matching process), concerning
widescale semantic heterogeneity in matching: this phase specifies all the similarity to be
computed, and among these calculations, several links can be skipped in matching
process.</p>
        <p>Step4. During matching process, if credible alignments are computed, the
corresponding high similarity links are isolated. Such links are to predict the ignorable
similarity calculations in the remaining matching process. Also if the incredible
matching results are found, the corresponding negative reduction' Links according to
the locality of matching are also constructed, and such links to predict the ignorable
similarity calculations are utilized. The similarity measure between entities from the
two ontologies is computed by analyzing the literal and structural information in
semantic subgraph extract in previous part.</p>
        <p>Step 5. Repeat the two last steps for more alignment.</p>
        <p>To this end, this process aims at providing high quality alignments between
concept pairs with a time processing limit reasonable and it not needs to modularize or
partition the large ontologies.</p>
        <p>Therefore, considering structural information is a natural way for enhancing
ontology mapping as illustrated by: Given two entities ai from O1 and bj from O2, we first
apply and compute the similarities between entities based on the similarities of words
e.g. the string-based and WordNet-based methods:</p>
        <p>
          String-based method. the similarity measure between words wi and wj is defined
as:
simStr(wi,wj) = comm(wi,wj) - diff(wi,wj) + winkler(wi,wj) (3)
where comm(wi,wj) stands for the commonality between wi and wj , diff(wi,wj) for
the difference between wi and wj , and winkler (wi,wj) for the improvement of the
result using the method introduced by Winkler in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          WordNet-based method. We use an electronic lexicon, WordNet, for calculating
the similarity values between words. The similarity between two words wi and wj is
measured by using the inverse of the sum length of the shortest paths [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
simWN(wi,wj) =1 /(llength + rlength) (4)
Where llength is the shortest path from word node wi to its common hypernym with word
node wj and rlength denotes the shortest path from wj to its common hypernym with wi.
Instead of matching to all concepts by traversing taxonomies completely, the goal is to
find Links between ontologies, at this step, it only considers on finding Links from the
Cartesian product (X) of the two ontologies. These Links, are very important matching
concepts, are used to reduce the time complexity in matching without exploring other
commonalities between neighbors from the corresponding Links (initial Links generation).
The algorithm proposed here generates a set of matching concepts as the initial links (see
Algorithm1). The function Sim is an aggregated similarity function incorporating name
and structural similarities (step 2):
        </p>
        <p>Algorithme 1: Algorithme 2:</p>
      </sec>
      <sec id="sec-4-2">
        <title>Input: Two ontologies O1,O2 Output: Neighbor-set For each pair (ai,bj) Compute sim(ai,aj)</title>
        <p>If (sim (ai,bj) &gt; 0)</p>
        <p>then Links
End</p>
        <p>Return (Neighbor-set)
End</p>
        <p>O1xO2 do
U {(ai,bj)}</p>
      </sec>
      <sec id="sec-4-3">
        <title>Input: Ontology O1, Ontology O2, Links Output: Set of Strong-Links Links are generated by algorithme1 Getstrong-Links ai</title>
        <p>SN Ø</p>
        <sec id="sec-4-3-1">
          <title>For each bj O2 do</title>
          <p>Compute sim(ai,bj)
If sim(ai,bj) &gt; threshold
then SN</p>
          <p>U {bj}</p>
          <p>End
End</p>
          <p>Return Getstrong-Links</p>
          <p>For finding efficient results, two possibly solutions are provided:
─ If concept A matches concept B, it needs not to calculate the similarity between
sub-concepts (/super-concepts) of A and super-concepts (/sub-concepts) of B, thus
we can reduce the total times of similarity calculations.
─ If A does not match B, it is very possible that their neighbors also do not match
each other that imply we can ignore many similarity calculations.</p>
          <p>Obviously, it needs to discover the high-Links and the low-Links dynamically in
matching, and then uses these Links to optimize similarity calculations. For
SN(ai)={b1, bk2,…,bn} , the strong-Links set RSN (ai) is calculated by:
RSN (ai)= j=1 RSN (ai|bj)=[sub(ai)Xsup(lub(b1,…, bk))] U [sup(ai)Xsub(glb(b1,…, bk))]
With lub(b1,…, bk) and glb(b1,…, bk) are the least upper bound and the greatest lower
bound for (b1,…, bk). Apparently, the total strong-Links sets during the matching
process is RSN = U RSN (ai) i=1,n (see Algorithm2 &amp; 3):
Algorithme 3:</p>
          <p>Input: Ontology O1, Ontology O2, Strong-Links
Output: total strong-Links sets</p>
          <p>StrongLinks are generated by algorithme2
Matchedset strong-Links (ai)
Generates the neighbors of ai {sub(ai) | sup(aj)}</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>For each bj SN</title>
          <p>Generates the neighbors of bj
{sub(bj) | sup(bj)}
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>First of all, the analysis in the existing matching systems depicts that there is
always a tradeoff between effectiveness and efficiency. The main goal of this paper is
to deal with wide-scale semantic heterogeneity in large scale ontology matching. For
this purpose, we focus on reducing complexity, concerning wide-scale semantic
heterogeneity in space matching. To accomplish this, we propose to skip subsequent
matching between sub-concepts of one concept and super-concepts of the other
concept (of shortcuts) of ontologies as input. However, it may be asked if this solution
is quite adapted to find the most correct mappings between two concepts and the
offline discovering mappings from different ontologies. As a future work, we aim at
answering these questions.
6</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Shvaiko</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
            <given-names>J</given-names>
          </string-name>
          , “
          <article-title>Ten challenges for ontology matching</article-title>
          ,
          <source>” Confederated International Conference on the Move to Meaningful Internet Systems</source>
          , pp.
          <fpage>1164</fpage>
          -
          <lpage>1182</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rahm</surname>
            <given-names>E</given-names>
          </string-name>
          , “
          <article-title>Towards Large-Scale Schema and Ontology Matching,” Schema matching and mapping</article-title>
          , Bellahsene
          <string-name>
            <given-names>Z</given-names>
            ,
            <surname>Bonifati</surname>
          </string-name>
          <string-name>
            <surname>A</surname>
          </string-name>
          Rahm E, eds. New York: Springer Heidelberg, pp.
          <fpage>3</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Safar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reynaud</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zargayouna</surname>
          </string-name>
          .
          <article-title>Alignment-based partitioning of large-scale ontologies</article-title>
          .
          <source>In Advances in Knowledge Discovery and Management</source>
          , volume
          <volume>292</volume>
          , pages
          <fpage>251</fpage>
          -
          <lpage>269</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Seidenberg</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.L.</given-names>
            <surname>Rector</surname>
          </string-name>
          , “
          <article-title>Web ontology segmentation: Analysis, classification</article-title>
          and use”, In Modular Ontologies,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parent</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Spaccapietra</surname>
          </string-name>
          , LNCS 5445, Springer,
          <year>2009</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Doran</surname>
          </string-name>
          ,
          <article-title>Paul Ontology modularization: principles and practice</article-title>
          .
          <source>Doctoral thesis</source>
          , University of Liverpool , octobre (
          <year>2009</year>
          ) .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Bouquet</surname>
          </string-name>
          , L. Sera¯ni, S. Zanobini:
          <article-title>Semantic coordination: A new approach and an application</article-title>
          .
          <source>In Proceedings of the 2nd Int. Semantic Web Conf. (ISWC'03)</source>
          . (
          <year>2003</year>
          )
          <fpage>130</fpage>
          -
          <lpage>145</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>G.</given-names>
            <surname>Stoilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.B.</given-names>
            <surname>Stamou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.D.</given-names>
            <surname>Kollias</surname>
          </string-name>
          :
          <article-title>A string metric for ontology alignment</article-title>
          .
          <source>In Proceedings of the 4th Int. Semantic Web Conference (ISWC'05) (ISWC'05)</source>
          . (
          <year>2005</year>
          )
          <fpage>624</fpage>
          -
          <lpage>637</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>I.</given-names>
            <surname>Palmisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Payne</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Doran</surname>
          </string-name>
          , “
          <article-title>Task oriented evaluation of module extraction techniques”</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>ISWC</given-names>
          </string-name>
          , LNCS 5823, Springer,
          <year>2009</year>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. M.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schlicht</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Stuckenschmidt</surname>
            and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sabou</surname>
          </string-name>
          , “
          <article-title>Criteria and evaluation for ontology modularization techniques”</article-title>
          , In Modular Ontologies,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parent</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Spaccapietra</surname>
          </string-name>
          , LNCS 5445, Springer,
          <year>2009</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>