<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CODI: Combinatorial Optimization for Data Integration - Results for OAEI 2010</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Noessner</string-name>
          <email>jan@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Niepert</string-name>
          <email>mathias@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KR &amp; KM Research Group University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The problem of linking entities in heterogeneous and decentralized data repositories is the driving force behind the data and knowledge integration effort. In this paper, we describe our probabilistic-logical alignment system CODI (Combinatorial Optimization for Data Integration). The system provides a declarative framework for the alignment of individuals, concepts, and properties of two heterogeneous ontologies. CODI leverages both logical schema information and lexical similarity measures with a well-defined semantics for A-Box and T-Box matching. The alignments are computed by solving corresponding combinatorial optimization problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          CODI (Combinatorial Optimization for Data Integration) leverages terminological
structure for ontology matching. The current implementation produces mappings between
concepts, properties, and individuals including mappings between object and data type
properties. The system combines lexical similarity measures with schema information
to reduce or completely avoid incoherence and inconsistency during the alignment
process. The system is based on the syntax and semantics of Markov logic [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and
transforms the alignment problem to a maximum-a-posteriori optimization problem.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Specific techniques used</title>
        <p>
          Markov logic combines first-order logic and undirected probabilistic graphical
models [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. A Markov logic network (MLN) is a set of first-order formulae with weights.
Intuitively, the more evidence there is that a formula is true the higher the weight of
this formula. It has been proposed as a possible approach to several problems
occurring in the context of the semantic web [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We have shown that Markov logic provides
a suitable framework for ontology matching as it captures both hard logical axioms
and soft uncertain statements about potential correspondences between entities. The
probabilistic-logical framework we propose for ontology matching essentially adapts
the syntax and semantics of Markov logic. However, we always type predicates and
we require a strict distinction between hard and soft formulae as well as hidden and
observable predicates. Given a set of constants (the classes and object properties of
the ontologies) and formulae (the axioms holding between the objects and classes), a
Markov logic network defines a probability distribution over possible alignments. We
refer the reader to [
          <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
          ] for an in-depth discussion of the approach and some
computational challenges. For generating the Marcov logic networks we used the approach
described in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>T-Box Matching Formalization Given two ontologies O1 and O2 and an initial
apriori similarity measure σ we apply the following formalization. First, we introduce
observable predicates O to model the structure of O1 and O2 with respect to both
concepts and properties. For the sake of simplicity we use uppercase letters D, E, R to
refer to individual concepts and properties in the ontologies and lowercase letters d, e, r
to refer to the corresponding constants in C. In particular, we add ground atoms of
observable predicates to F h for i ∈ {1, 2} according to the following rules1:
Oi |= D ⊑</p>
        <p>E 7→ subi(d, e)
Oi |= D ⊑ ¬</p>
        <p>E 7→ disi(d, e)
Oi |= ∃R.⊤ ⊑</p>
        <p>Oi |= ∃R.⊤ ⊒
Oi |= ∃R.⊤ ⊑ ¬</p>
        <p>D 7→ subid(r, d)
D 7→ supid(r, d)
D 7→ disid(r, d)
The ground atoms of observable predicates are added to the set of hard constraints F h,
forcing them to hold in computed alignments. The hidden predicates mc and mp, on the
other hand, model the sought-after concept and property correspondences, respectively.
Given the state of the observable predicates, we are interested in determining the state
of the hidden predicates that maximize the a-posteriori probability of the corresponding
possible world. The ground atoms of these hidden predicates are assigned the weights
specified by the a-priori similarity σ. The higher this value for a correspondence the
more likely the correspondence is correct a-priori. Hence, the following ground
formulae are added to F s:
(mc(c, d), σ(C, D))
(mp(p, r), σ(P, R))
if C and D are concepts
if P and R are properties
Notice that the distinction between mc and mp is required since we use typed predicates
and distinguish between the concept and property type.</p>
        <p>
          Cardinality Constraints A method often applied in real-world scenarios is the
selection of a functional one-to-one alignment [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Within the ML framework, we can
include a set of hard cardinality constraints, restricting the alignment to be functional
and one-to-one. In the following we write x, y, z to refer to variables ranging over the
appropriately typed constants and omit the universal quantifiers.
        </p>
        <p>mc(x, y) ∧ mc(x, z) ⇒ y = z
mc(x, y) ∧ mc(z, y) ⇒ x = z
Analogously, the same formulae can be included with hidden predicates mp, restricting
the property alignment to be one-to-one and functional.
1 Due to space considerations the list is incomplete. For instance, predicates modeling range
restrictions are not included.</p>
        <p>Coherence Constraints Incoherence occurs when axioms in ontologies lead to
logical contradictions. Clearly, it is desirable to avoid incoherence during the alignment
process. All existing approaches to alignment repair remove correspondences after the
computation of the alignment. Within the ML framework we can incorporate
incoherence reducing constraints during the alignment process for the first time. This is
accomplished by adding formulae of the following type to F h.</p>
        <p>
          dis1(x, x′ ) ∧ sub2(x, x′ ) ⇒ ¬(mc(x, y) ∧ mc(x′ , y′ ))
dis1d(x, x′ ) ∧ sub2d(y, y′ ) ⇒ ¬(mp(x, y) ∧ mc(x′ , y′ ))
Stability Constraints Several approaches to schema and ontology matching
propagate alignment evidence derived from structural relationships between concepts and
properties. These methods leverage the fact that existing evidence for the equivalence
of concepts C and D also makes it more likely that, for example, child concepts of C
and child concepts of D are equivalent. One such approach to evidence propagation is
similarity flooding [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. As a reciprocal idea, the general notion of stability was
introduced, expressing that an alignment should not introduce new structural knowledge [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
The soft formula below, for instance, decreases the probability of alignments that map
concepts X to Y and X ′ to Y ′ if X ′ subsumes X but Y ′ does not subsume Y .
(sub1(x, x′ ) ∧ ¬sub2(y, y′ ) ⇒ mc(x, y) ∧ mc(x′ , y′ ), w1)
(sub1d(x, x′ ) ∧ ¬sub2d(y, y′ ) ⇒ mp(x, y) ∧ mc(x′ , y′ ), w2)
Here, w1 and w2 are negative real-valued weights, rendering alignments that satisfy the
formulae possible but less likely.
        </p>
        <p>The presented list of cardinality, coherence, and stability constraints could be
extended by additional soft and hard formulae. Other constraints could, for example,
model known correct correspondences or generalize the one-to-one alignment to
mto-n alignments.</p>
        <p>
          A-Box Matching The current instance matching configuration of CODI leverages
terminological structure and combines it with lexical similarity measures. The approach
is presented in more detail in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It uses one T-Box T but two different A-Boxes
A1 ∈ O1 and A2 ∈ O2. In cases with two different T-Boxes the T-Box matching
approach is applied as a preprocessing step, merge the two aligned T-Boxes and then use
our instance matching algorithm. CODI offers complete conflict elimination meaning
that the resulting alignment is always coherent for OWL DL ontologies. This
component is based on the work of Meilicke et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. CODI enforces the instance alignment
to be consistent. To this end, we need to introduce observable predicates O to model
conflicts, that is, a positive assertion of one instance in one ontology and a negative
assertion of the same instance in the other ontology. This is done for both property and
concept assertions.
        </p>
        <p>Analogous to the concept and property alignment before, we introduce the hidden
predicate mi representing instance correspondences. Let C be a concept and P be a
property of T-Box T . Further, let A ∈ A1 and B ∈ A2 be individuals in the respective
A-Boxes. Then, using a reasoner, ground atoms are added to the set of hard constraints
F h according to the following rules:</p>
        <p>T ∪ A1 |= C(A) ∧ T ∪ A2 |= ¬C(B)</p>
        <p>T ∪ A1 |= ¬C(A) ∧ T ∪ A2 |= C(B)</p>
        <p>
          T ∪ A1 |= P (A, A′ ) ∧ T ∪ A2 |= ¬P (B, B′ )
T ∪ A1 |= ¬P (A, A′ ) ∧ T ∪ A2 |= P (B, B′ )
7→ ¬mi(a, b)
7→ ¬mi(a, b)
7→ ¬mi(a, b) ∨ ¬mi(a′ , b′ )
7→ ¬mi(a, b) ∨ ¬mi(a′ , b′ )
In addition to these formulae we included cardinality constraints analogous to those
used in the concept and property matching of Section 1.2. In the instance matching
formulation, the a-priori similarity σc and σp measures the normalized overlap of concept
and property assertions, respectively. For more details on these measures, we refer the
reader to [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The following formulae are added to the set of soft formulae F s:
(mi(a, b), σc(A, B))
(mi(a, b) ∧ mi(c, d), σp(A, B, C, D))
if A and B are instances
if A, B, C, and D are instances
1.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Adaptations made for the evaluation</title>
        <p>
          The strength of the system is its modularity allowing the incorporation of different
similarity measures. The system can be optimized in two major ways: (a) Inclusion of novel
formulae enforcing the logical consistency and (b) the inclusion of additional similarity
measures. There is room for improvement since we used a very simple lexical
similarity measure based on the Levenshtein distance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for our experiments. It is possible to
apply different aggregation functions like average or maximum and to include specific
properties of an ontology like URIs, labels, and comments.
        </p>
        <p>In all OAEI test cases Algorithm 1 was used for computing the a-priori similarity
σ(entity1, entity2). In the case of concept and property alignments, the a-priori
similarity is computed by taking the maximal similarity between the URIs, labels and OBO
to OWL constructs. In case of instance matching the algorithm goes through all data
properties and takes the average of the similarity scores.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Link to the System and Parameters File</title>
        <p>CODI can be downloaded from http://codi-matcher.googlecode.com.
1.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Link to the Set of Provided Alignments</title>
        <p>The alignments for the tracks Benchmark and Conference has been made with the
SEALS platform. For Anatomy, IIMB, and Restaurant the alignments can be found
at http://code.google.com/p/codi-matcher/downloads/list
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In the following section, we present the results of the CODI system for the individual
OAEI tracks. Due to space considerations, we do not explain the different benchmarks
in more detail.</p>
      <p>Algorithm 1 σ(entity1, entity2)
Benchmark Track While our system’s strength is its modularity and adaptability to
different ontologies we used the exact same setting for all ontology matching tracks.
Hence, the performance on the benchmark track is rather poor. This is primarily due
to the high threshold of 0.85 for the Levenshtein similarity measure that we applied in
each of the ontology matching tracks. The results are shown in Table 1.
Conference Track On the real-world conference dataset CODI achieves very good
results since it employs logical reasoning to avoid incoherences. The execution time is
between 2 and 4 minutes per test case2. Table 2 summarizes the overall results.
Anatomy Track The results on the anatomy track are also convincing. The results
shown in Table 3 are en par with the 2009 results of state-of-the-art matching
applications. The F1 scores are between 0.79 and 0.73 for all subtasks, even for the two tasks
Focus on Precision and Focus on Recall. Thus, our algorithm achieves satisfiable
precision and recall values without sacrifices on the F1 score. For the last task, where a
partial reference alignment was given, we could gain almost 5 % on the F1 score. This
is because incorporating a partial reference alignment in our system is straight-forward.
The reference alignment becomes a direct part of the optimization problem, enforcing
good correspondences while ruling out contradicting ones. However, since our
algorithm uses logical reasoning and has to solve an NP-hard optimization problem, the
execution times are quite high3.
2 All experiments are executed on a Desktop PC with 2 GB RAM and a Intel Core2 Duo 2.4</p>
      <p>GHz processor.
3 This forces us to submit the solutions without the seals platform because of a timeout after 45
minutes.</p>
      <p>IIMB Track The instance matching benchmark IIMB consists of 80 transformations
divided in four transformation categories containing 20 transformations each. We
applied the full A-Box matching functionality described above with a threshold on the
a-priori similarity of 0.1. The average execution time on the IIMB small (large) dataset
is 2.6 (35.1) minutes. Table 4 summarizes the different results of the CODI system.
The values without brackets are the results for the small IIMB dataset and the values in
brackets for the large one.</p>
      <p>Transformations
Precision
Recall
F1 score
PR Track For this track consisting of small files about persons and restaurants, we
used a simple one to one alignment only based on lexical similarity scores since no
significant structural information is available. Thus, the runtime was with less than 5
seconds per test case very short. The results of the CODI system are depicted in Table 5.</p>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Discussions on the way to improve the proposed system</title>
        <p>CODI is a very young system and does not yet provide a user interface. Hence,
improvements in usability by designing a suitable user interface will be one of the next
steps. In case of the quality of the alignments, more sophisticated lexical similarity
measures will be tested and integrated. We are also working on novel algorithms solving the
optimization problems more efficiently.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Comments on the OAEI 2010 procedure</title>
        <p>The SEALS evaluation campaign is very beneficial since it is the first time that the
matchers must have a standardized interface which could possibly be used by everyone.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comments on the OAEI 2010 measures</title>
        <p>
          We encorage the organizers to use semantic precision and recall measures as described
in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>CODI performs concept, property, and instance alignments. It combines logical and
structural information with a-priori similarity measures in a well-defined way by using
the syntax and semantics of Markov logic. The system therefore not only aligns the
entities with the highest lexical similarity but also enforces the coherence and consistency
of the resulting alignment.</p>
      <p>The overall results of the young system are very promising. Especially when
considering the fact that there are many optimization possibilities with respect to the lexical
similarity measures that have not yet been investigated. The strength of the CODI
system is the combination of lexical and structural information and the declarative nature
that allows easy experimentation. We will continue the development of the CODI
system and hope that our approach inspires other researchers to leverage terminological
structure for ontology matching.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>I.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Palandri</surname>
          </string-name>
          , Antonelli, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Stroe</surname>
          </string-name>
          .
          <article-title>Efficient selection of mappings and automatic quality-driven combination of matching methods</article-title>
          .
          <source>In Proceedings of the ISWC 2009 Workshop on Ontology Matching</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Richardson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Singla</surname>
          </string-name>
          .
          <article-title>Just add weights: Markov logic for the semantic web</article-title>
          .
          <source>In Proceedings of the Workshop on Uncertain Reasoning for the Semantic Web</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Fleischhacker</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>A Practical Implementation of Semantic Precision and Recall</article-title>
          . In 2010 International Conference on Complex,
          <source>Intelligent and Software Intensive Systems</source>
          , pages
          <fpage>986</fpage>
          -
          <lpage>991</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary codes capable of correcting deletions and insertions and reversals</article-title>
          .
          <source>Doklady Akademii Nauk SSSR</source>
          , pages
          <fpage>845</fpage>
          -
          <lpage>848</lpage>
          ,
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>Analyzing mapping extraction approaches</article-title>
          .
          <source>In Proceedings of the Workshop on Ontology Matching</source>
          , Busan, Korea,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamilin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>Repairing ontology mappings</article-title>
          .
          <source>In Proceedings of the Conference on Artificial Intelligence</source>
          , pages
          <fpage>1408</fpage>
          -
          <lpage>1413</lpage>
          , Vancouver, Canada,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Similarity flooding: A versatile graph matching algorithm and its application to schema matching</article-title>
          .
          <source>In Proceedings of ICDE</source>
          , pages
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Niepert</surname>
          </string-name>
          .
          <article-title>A Delayed Column Generation Strategy for Exact k-Bounded MAP Inference in Markov Logic Networks</article-title>
          .
          <source>In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Niepert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>A Probabilistic-Logical Framework for Ontology Matching</article-title>
          .
          <source>In Proceedings of the 24th AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Noessner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Niepert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Meilicke</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>Leveraging Terminological Structure for Object Reconciliation</article-title>
          .
          <source>The Semantic Web: Research and Applications</source>
          , pages
          <fpage>334</fpage>
          -
          <lpage>348</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Richardson</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          .
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>62</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <article-title>Improving the accuracy and efficiency of map inference for markov logic</article-title>
          .
          <source>In Proceedings of the Conference on Uncertainty in Artificial Intelligence</source>
          , pages
          <fpage>468</fpage>
          -
          <lpage>475</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>