<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology Mapping via Structural and Instance-Based Similarity Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantin Todorov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Geibel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IKW, Universtity of Osnabru ̈ck</institution>
          ,
          <addr-line>Albrechstr. 28, 49076 Osnabru ̈ck</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Berlin, Fakult ̈at IV</institution>
          ,
          <addr-line>Franklinstr. 28/29, 10587 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents an overview of a novel procedure for mapping hierarchical ontologies, populated with properly classified text documents. It combines structural and instance-based approaches to reduce the terminological and conceptual ontology heterogeneity. It yields important granularity and instantiation judgments about the inputs and is to be applied to mapping web-directories.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        document is represented as a TF/IDF vector [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We assume that O1 and O2
share a significant extensional overlap and all the documents are in the same
natural language.
      </p>
      <p>
        An entity which plays a key role in our approach is the intersection of DO1 and
DO2 . However, when the sets elements are vectors of text documents, it is very
likely that the sets contain documents which are similar, but not identical, and
therefore not part of the intersection. In order to make use of such documents, we
introduce the notion of relaxed intersection (RI) which integrates both identical
and similar documents from both sets as opposed to the standard strict set
intersection (Figure 1(b)): RI(DO1 , DO2) = {diO1 , djO2 |dist(diO1 , djO2 ) ≤ cd, diO1 ∈
DO1 , djO2 ∈ DO2 }, where dist is a properly chosen distance measure on the set of
TF/IDF documents [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and cd is a similarity parameter to be empirically set. In
the sequel, by document set intersection we will mean their relaxed intersection.
2
      </p>
      <p>
        Structural and Instance-based Mapping Strategies
In the following, we will describe the structural approach which forms the first
part of our mapping strategy. A hierarchical ontology as described in definition
1 is directly translated to a directed rooted tree G(V, E). Since only hyponimic
relations are allowed at this stage, we will assume that the ontology graphs are
unlabeled. Bunke et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced a graph distance, which accounts for the
structural closeness of two taxonomies, represented as non-empty graphs G1 and
G2:
d(G1, G2) = 1 −
|mcs(G1, G2)|
max(|G1|, |G2|)
.
      </p>
      <p>
        The abbreviation mcs stands for maximal common subgraph of G1 and G2
defined as a maximal graph isomorphic to sub-graphs of both G1 and G2 and |G|
denotes the number of vertices in a graph G. The problem of finding a mcs is
solved in polynomial time for trees. Various algorithms are discussed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        In addition to the structural approach, we employ two extensional
methods for deriving concepts similarity assertions. Even though independent from
one another, they can be combined yielding an improved similarity learner. In
both approaches, we make use of Support Vector Machines (SVMs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
operating on the sets of TF/IDF documents assigned to the input ontologies. SVMs are
machine learning classifiers which can be trained on a given data set and learn to
discriminate between positive and negative instances. For two concepts A ∈ CO1
and B ∈ CO2 we define the data sets SO1 = {(diO1 , yiA)} and SO2 = {(djO2 , yjB)},
where diO1 , djO2 ∈ Rd, i = 1, ..., nO1, j = 1, ..., nO2 with d - the dimension of the
TF/IDF vectors. yA and yB are labels taking values +1 when the corresponding
document is assigned to A or B, respectively, and −1 otherwise. The labels
separate the documents in each ontology into such that belong to a given concept A
or B, respectively (positive instances) and such that do not (negative instances).
      </p>
      <p>
        One convenient way of making use of extensional information is to model
concepts as ”bags” of instances and measure their similarity on set theoretic
accounts considering A and B very similar when A∩B ≈ A. A standard
instancebased similarity measure is the Jaccard coefficient [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], defined as: J acc(A, B) =
PP ((AA∪∩BB)) , where P (X) is the probability of a random instance to be an element of
X. Note that P (A∩B) = P (A, B) and P (A∪B) = P (A, B)+P (A, B)+P (A, B),
where the entity P (A, B) denotes the joint probability of A and B. Each of the
three joint probabilities is estimated by the fraction of documents that belong
to both A and B: P (A, B) = |A∩O1B|+|A∩O2B| , where ∩O1 denotes intersecting
|DO1 |+|DO2 |
documents belonging to O1 only. By training an SVM classifier on the data set
SO1 and applying it on the document set DO2 we come up with an estimation
of the quantity |A ∩O2 B|. Repeating the procedure after inversing the roles of
O1 and O2 yields |A ∩O1 B|. The same algorithm is applied for the other joint
probabilities until we have approximations of all of them, as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The second extensional indicator for semantic closeness we propose is based
on a variable selection procedure for SVMs. Variable selection in
descriptive statistics is about pointing out the input variables, which most strongly
affect the response. For a given data set of the type SO1 it indicates which of
the TF/IDF vector dimensions are most important for the separation of the
documents into such that belong to a given concept and such that do not. Our
variable selection criterion is the sensitivity of the VC dimension [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - an
indicator of the classifying capacity of a set of classifiers (e.g. the set of hyperplanes
in a multidimensional space). Our initial experiments have shown variations in
the estimation of that parameter according to the presence or absence of a given
variable (vector dimension) in the data set. The procedure yields for each
ontology an ordered list of variables on top of which are found the variables which
are most important for the class separation. If the orders of the variables in both
sets are similar, or if a significant number of most pertinent variables from both
sets coincide, then the two concepts A and B are identified as similar.
3
      </p>
      <p>
        A Procedure for Ontology Mapping
The structural and extensional approaches described so far are our instruments
used to build a combined procedure for ontology mapping. Another important
criterion for concept similarity is the presence of similar concept names in both
ontologies. Linguistic analysis approaches to this problem, relying on names and
textual description of ontology elements, are used in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Even though
not explicitly discussed in this section, we keep in mind that this name-based
criterion is to be checked at any step before measuring the instance-based
similarity of a pair of concepts and is to become an integral part of the structural
similarity approach.
      </p>
      <p>Let us take as input again the ontologies O1 and O2 together with their
corresponding document sets DO1 and DO2. In the following, we describe our
method for combining the mapping approaches earlier.</p>
      <p>Case 1: |DO1 | ≈ |DO2 |</p>
      <p>The first big case considers ontologies which contain similar number of
documents. The ratio rΔ = ||DDOO11 ∩∪DDOO22 || is an indicator of the size of the intersection
of both sets relative to the sets size. There are two further possibilities:
- Case 1.1. rΔ &gt; crΔ , where crΔ ∈ (0, 1) is a parameter to be fixed. In this
case we have two different ontologies on (almost) the same documents sets. It is
very likely that they share a conceptual similarity. We proceed to checking the
graph distance between them.</p>
      <p>- Case 1.1.a. d(GO1 , GO2 ) ≈ 0. The taxonomies have similar structures,
describe the same domain and have the same extensions. It is left to establish
the precise concept-to-concept mappings, done by the help of the instance based
similarity check.</p>
      <p>- Case 1.1.b. d(GO1 , GO2 ) ≈ 1. The maximal common subgraph of both
ontologies is quite small, i.e. one of the taxonomies contains significantly lower
number of nodes compared to the other (Figure 2(a)). Let us assume that
|CO1 | &lt; |CO2 |. Since both ontologies are ”built” on approximately the same
sets of documents, this means that O2 is more specific than O1, and contains
more concept nodes. O1 can be directly injected into O2. The concept-to-concept
correspondences indicating the exact injection pattern are provided by
instancebased concept similarity applied on the set of the nodes of the mcs of both
taxonomies.</p>
      <p>- Case 1.2. rΔ ≤ crΔ. The ontologies are little likely to be similar since their
extensions share very little (or no) overlap.
Case 2: |DO1 | &lt; |DO2 |</p>
      <p>In the second big case, the set DO1 contains less documents than the set DO2
(conventional choice). One can distinguish between three further sub-cases: Case
2.1. - the two sets do not intersect (rΔ = 0); Case 2.2. - the two sets intersect, but
do not fully overlap; and Case 2.3. - the smaller set is a subset of the bigger one.
Case 2.1. is in conflict with a major assumption introduced in the beginning and
therefore does not provide mapping candidates. Case 2.2. conforms with either
Case 2.1. or Case 2.3., depending on the size of the intersection DO1 ∩ DO2
relative to |DO1 |. We will study in details Case 2.3. and proceed to measure the
structural similarity between the inputs.</p>
      <p>- Case 2.3.a. d(GO1 , GO2 ) ≈ 0. O1 is structurally very similar to O2. Hence,
it is just as specific as O2, but less populated with documents. This indicates
that O1 can be replaced entirely by O2.</p>
      <p>- Case 2.3.b. d(GO1 , GO2 ) ≈ 1. There are two different scenarios, depending
on which of the two input ontologies contains more nodes.</p>
      <p>1) |CO1 | &lt; |CO2 |, i.e. there are less concepts in O1 than in O2. This is the
case when O1 is a sub-taxonomy of O2 and can be entirely injected into it, as
described in Case 1.1.</p>
      <p>2) |CO2 | &lt; |CO1 |. O1 is more granular a hierarchy, but less populated than
O2 (Figure 2(b)). We will take instances from O2 and assign them to O1 by first
aligning the nodes of both ontologies by the help of the instance-based mapping
procedure. to another in terms of both conceptualization and instantiation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>H.</given-names>
            <surname>Bunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shearer</surname>
          </string-name>
          .
          <article-title>A graph distance metric based on the maximal common subgraph, Pattern Recogn</article-title>
          .
          <source>Lett.</source>
          , volume
          <volume>19</volume>
          , number
          <issue>3-4</issue>
          ,
          <fpage>255</fpage>
          -
          <lpage>259</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          .
          <article-title>Learning Concept Hierarchies from Text Corpora Using Formal Concept Analysis</article-title>
          ,
          <source>JAIR</source>
          Volume
          <volume>24</volume>
          ,
          <fpage>305</fpage>
          -
          <lpage>339</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Cristianini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          .
          <article-title>An Introduction to Support Vector Machines and other kernel-based learning methods</article-title>
          ., Cambridge University Press,
          <source>ISBN 0-521- 78019-5</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Learning to map between ontologies on the semantic web</article-title>
          ,
          <source>WWW '02: Proceedings of the 11th international conference on World Wide Web</source>
          ,
          <fpage>662</fpage>
          -
          <lpage>673</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          . Ontology Matching, Springer-Verlag New York, Inc.,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Isaac</surname>
          </string-name>
          , L. van der Meij, S. Schlobach,
          <string-name>
            <surname>S. Wang.</surname>
          </string-name>
          <article-title>An empirical study of instance-based ontology matching</article-title>
          .
          <source>In Proceedings of the 6th International Semantic Web Conference</source>
          , Busan, Korea,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Text categorization with support vector machines: learning with many relevant features</article-title>
          .
          <source>Proceedings of ECML-98, 10th European Conference on Machine Learning, Number</source>
          <volume>1398</volume>
          ,
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Korfhage</surname>
          </string-name>
          .
          <source>Information Storage and Retrieval, Section</source>
          <volume>5</volume>
          .7,
          <string-name>
            <surname>Document</surname>
            <given-names>Similarity</given-names>
          </string-name>
          ,
          <fpage>125</fpage>
          -
          <lpage>133</lpage>
          . Wiley and Sons,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Stumme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche.</surname>
          </string-name>
          FCA-MERGE:
          <article-title>Bottom-Up Merging of Ontologies</article-title>
          , IJCAI,
          <fpage>225</fpage>
          -
          <lpage>234</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          .
          <article-title>Combining Structural and Instance-Based Ontology Similarities for Mapping Web Directories</article-title>
          ,
          <string-name>
            <surname>ICIW</surname>
          </string-name>
          ,
          <source>Third International Conference on Internet and Web Applications and Services</source>
          , Athens,
          <fpage>596</fpage>
          -
          <lpage>601</lpage>
          , IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. G. Valiente.
          <source>Algorithms on Trees and Graphs</source>
          , Springer-Verlag,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>