<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FORPM: Boosting Users' E ect on Ontology Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dunwei Wen</string-name>
          <email>dwwen@mail.csu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaohu Fan</string-name>
          <email>fanxiaohu17@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fuhua Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing and Information Systems, Athabasca University</institution>
          ,
          <addr-line>Athabasca, Alberta T9S 3A3</addr-line>
          ,
          <country>Canada fdunweiw</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Information Science and Engineering, Central South University</institution>
          ,
          <addr-line>Changsha, Hunan 410083</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we attempt to view the ontology matching task from an information gaining angle. In our opinions, the information used for matching mainly comes from the matching tools as well as the human experts. With this understanding, we believe that by making good use of user e orts, we can also accelerate the matching process. Hence we present a prototype system named FORPM. First, it ranks the entities of the ontology. Important entities are chosen as centroids to form fragments. Then, users can use those centroids' information to estimate the content of the fragments and initially match them. Finally, automatic matching is carried out among those matched fragments. Experiment results obtained so far show that with a few user e orts, our approach signi cantly improves the matching e ciency while the loss of accuracy is acceptable.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Ontology matching aims at nding semantic relationships between entities of di erent
ontologies for solving the interoperation problem. From the viewpoint of information
theory, we view a matching problem as a process of information gaining with
uncertainty ( ). Let ' denotes the information obtained by the matching tool (from ontology
itself as well as external source like WordNet), ! denotes the information provided by
users in the validation step (we hold the same kind of opinions with [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that fully
automatic ontology matching is still impossible). To obtain matching results of high quality,
we believe that the following equation has to be satis ed:
! + '
( ):
(1)
      </p>
      <p>On one hand, to our best knowledge, recently numerous researches have been
focused on how to maximize ' and made great progresses. On the other hand, we believe
that the human users, especially the domain experts are capable of discovering
complex relationships, such as more general ( ), less general ( ) etc., between candidate
pairs. This extra information, however, is always ignored and human e ort is simply
used in the validation step to judge simple relations such as matching or not matching.
With these understandings, our work is aiming at making good use of the information
provided by users, thus accelerating the matching process.</p>
      <p>In this paper we propose: 1) an information theory based model for concept ranking
and centroid extraction; 2) a clustering algorithm for ontology partitioning. To test our
approch, we also introduce a prototype system called FORPM.</p>
    </sec>
    <sec id="sec-2">
      <title>FORPM (Framework for Ontology Ranking, Partitioning and</title>
    </sec>
    <sec id="sec-3">
      <title>Matching)</title>
      <p>FORPM is implemented with Java under JDK 5.0 and Eclipse 3.1.2. The system
architecture is shown below in Fig 1. First, two ontologies are input and then transformed
into DAG (Directed Acyclic Graph), where the is-a relations are transformed into arcs
and concepts are transformed into nodes. After the four main process steps in the dash
line, the result and a reference-mapping le are sent to the evaluation module, in which
the evaluation results are generated automatically and presented to the user.
Step 1: Entity Ranking Based on our observation, the amount of information
provided by a user equals the sum of the amount of information Ii provided in T times of
the user's validation.</p>
      <p>If we assume that the cost of the user's every validation be the same, then one
intuitive way to improve the matching e ciency is to maximize Ii wisely. Hence
fundamental to our ranking approach is the ability to measure how much information is conveyed
in a node thereby giving a sense of how much information the computer would gain
by being informed about a discovered matching pair.</p>
      <p>
        In information theory [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the amount of information contained in an event is
measured by the negative logarithm of the probability of occurrence of the event. Thus if
is an event that has possible outcome values x1; x2; :::; xn occurring with probabilities
pr1; pr2; :::; prn, the amount of information gained or uncertainty removed by knowing
that has the outcome xi is given by:
      </p>
      <p>I ( = xi) =
log (pri) :
(3)
If we have a node ni 2 Nodes, then the amount of information contained in ni is:
I ( = ai) =</p>
      <p>log ( = ai) :
I (ni) =</p>
      <p>X I(ai):
ai2arcsi</p>
      <p>Based on this we can build a model to measure the amount of information of a node
in an ontology graph by considering the concept as an event and is-a relations as its
outcomes. Assume that in the ontology graph G(Arcs; Nodes), where Arcs is the set of
all is-a relations and Nodes is the set of all concepts in ontology O. Then for any arc
ai 2 Arcs, its probability is given by</p>
      <p>Pr ( = ai) =</p>
      <p>1 !
jarcsij
:
Here jarcsij is the number of arcs connecting with Node ni. Thus the amount of
information contained in arc ai is:
We use the amount of information to rank nodes. The node contains the most amount
of information is de ned as an information center.</p>
      <p>De nition 1 (Information Center/Centroid Node). Let G(N, A) be an ontology graph,</p>
      <sec id="sec-3-1">
        <title>N be the node set, A be the arcs set, then node Ic 2 N is an information center if for any node ni 2 N:</title>
        <p>I (Ic)</p>
        <p>I (ni) :
Step 2: IFC (Information Flooding theory based Clustering)</p>
        <p>The goal of this step is to form fragments from centroids. We noticed that in the
isa hierarchy tree, semantic similarity between two concepts often decays as the distance
between them increases. In our work, we de ne an information ooding function to
measure how strong a source node could a ect a target node.</p>
      </sec>
      <sec id="sec-3-2">
        <title>De nition 2 (Information Flood). Let G (N, A) be an ontology graph, ni, n j 2N, we de ne the information ood from ni to n j as:</title>
        <p>In f oFlood ni; n j = F Disti j I (ni) :
(4)
(5)
(6)
(7)
(8)
(9)
Where I(ni) is the information contained in ni, F(Disti j) is a quadratic experiential
decay function to simulate the attenuation of similarity de ned as</p>
        <p>F(Disti j) =</p>
        <p>1
a</p>
        <p>Disti2j + b</p>
        <p>Disti j + c
:
and Disti j is the number of arcs between node n j and node n j in the is-a relation
hierarchy tree. In our experiment, we have a = 0:25; b = 0:5; c = 0.</p>
      </sec>
      <sec id="sec-3-3">
        <title>De nition 3 (Fragment). Let O be an Ontology, Gi(N, A) be a graph representing part</title>
        <p>of O, where N is the node set, A is the arc set. Let dList be the set of all centroid nodes
in O. If for any d j 2 dList and all nk 2 N, we have di 2 dList which satis es</p>
      </sec>
      <sec id="sec-3-4">
        <title>In f oFlood (di; nk)</title>
      </sec>
      <sec id="sec-3-5">
        <title>Max In f oFlood d j; nk :</title>
        <p>(10)</p>
      </sec>
      <sec id="sec-3-6">
        <title>Then we say Gi is a fragment f(di,mi) with mi as its size and di as its centroid node.</title>
        <p>We brie y describe the partitioning algorithm in Table 1. The algorithm receives two
parameters, Max, the upper bound of the number of the fragment, and Min, the lower
bound of the size of the fragment. We set a max iteration number to ensure the stop of
the algorithm. Also a merge algorithm is implemented to deal with the fragments whose
size is below the lower bound.</p>
        <p>Step 3: Manual Matching.</p>
        <p>
          In this step, users use those centroid nodes to estimate the content of the fragments
and then match them manually. A centroid node may have more than one counterparts
with relations such as equivalence (=), more general( ), less general( ), mismatch (?)
and overlapping (\). Two centroids are considered semantically matched [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] if the
relation between them is not mismatch.
        </p>
        <p>Step 4: Automatic Matching.</p>
        <p>
          Two fragments are viewed as matched if their centroids are semantically matched,
the remaining matching work between two matched fragments is the same as a normal
task but of smaller sizes. Various approaches could be adopted here to nish the task.
In FORPM, we employ the same string-based tech in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for demonstration.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Discussion</title>
      <p>
        In our experiment, we adopted a dataset from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The Russia1a contains 151 concepts
while Russia1b contains 162 concepts, with 64 human con rmed mappings (concepts
only). We used F-Measure [cf.4] and Cost (see blow) as quility metrics.
      </p>
      <p>Cost =</p>
      <p>#Compare T imes
#Found Marched Pairs
:
(11)</p>
      <p>In FORPM, users can tune the system by changing the value of Max and Min in
step2. We can see in Fig. 2 that the more fragments there are, the more likely users
are to make right judgememts which lower the cost. Meanwhile more human work is
required (The user has to do Max*Max times validations at most). According to our
experience, it seems that the program performs best when the parameter Max is set
between 5% 10% of the total number of the concepts.
1
0.9
0.8
0.7</p>
      <p>e
0.6 r</p>
      <p>u
0.5 sea
0.4</p>
      <p>M
0.3 F
0.2
0.1
0</p>
      <p>Max</p>
    </sec>
    <sec id="sec-5">
      <title>Concluding Remarks</title>
      <p>In this paper, we have proposed an information gaining theory based framework for
ontology matching. We have shown that with a few user e orts, our approach is e ective
in reducing the matching complexity.</p>
      <p>
        Our work is inspired by data mining technology. We gain our idea of information
model from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our tool refers to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]'s work in implementation, while [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] propose
an automatic block based matching approach. Both [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and our tool employ a ranking
step to label the blocks. However, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]'s rank step is after the partitioning step, while in
FORPM, ranking step is rstly carried out since we employ an extraction-like clustering
algorithm.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We thank National Science and Engineering Research Council of Canada, Academic
Research Council of Athabasca University for their nancial support for the research,
and thank anonymous reviewers for their detailed and constructive comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.F.</given-names>
            <surname>Noy</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.A.</given-names>
            <surname>Musen</surname>
          </string-name>
          <article-title>: The PROMPT suite: interactive tools for ontology merging and mapping</article-title>
          .
          <source>International Journal of Human-Computer Studies</source>
          .
          <year>2003</year>
          ,
          <volume>59</volume>
          (
          <issue>6</issue>
          ):
          <fpage>983</fpage>
          -
          <lpage>1024</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>C.E.</surname>
          </string-name>
          <article-title>SHANNON: A Mathematical Theory of Communication</article-title>
          .
          <source>The Bell System Technical Journal</source>
          .Vol.
          <volume>27</volume>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          ,
          <fpage>623</fpage>
          -
          <lpage>656</lpage>
          ,July, October,
          <year>1948</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Avesani</surname>
          </string-name>
          , Fausto Giunchiglia, and
          <article-title>Mikalai Yatskevich: A Large Scale Taxonomy Mapping Evaluation</article-title>
          .
          <source>ISWC 2005.LNCS 3729</source>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Marc</given-names>
            <surname>Ehrig</surname>
          </string-name>
          and
          <article-title>Ste en Staab: QOM- Quick Ontology Mapping</article-title>
          .
          <source>The Semantic Web Proceedings ISWC'04</source>
          .pp.
          <fpage>289</fpage>
          -
          <lpage>303</lpage>
          ,
          <year>2004</year>
          . Springer-Verlag LNCS 3298.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Kemafor</given-names>
            <surname>Anyanwu Angela Maduko</surname>
          </string-name>
          and Amit Sheth :
          <source>SemRank: Ranking Complex Relationship Search Results on the Semantic Web In: WWW 2005 May 10-14</source>
          ,
          <year>2005</year>
          , Chiba, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Heiner</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          and Michel Klein:
          <article-title>Structure-Based Partitioning of Large Concept Hierarchies</article-title>
          .
          <source>In: The SemanticWeb Proceedings ISWC'04</source>
          pp.
          <fpage>289</fpage>
          -
          <lpage>303</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <source>Qu: Partition-Based Block Matching of Large Class Hierarchies In Proceedings of ASWC</source>
          <year>2006</year>
          , Beijing, China,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>