<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontological Quality Control in Large-scale, Applied Ontology Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Catherine Legg</string-name>
          <email>clegg@waikato.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Sarjant</string-name>
          <email>sarjant@waikato.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Waikato</institution>
          ,
          <country country="NZ">New Zealand</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To date, large-scale applied ontology mapping has relied greatly on label matching and other relatively simple syntactic features. In search of more holistic and accurate alignment, we offer a suite of partially overlapping ontology mapping heuristics which allows us to hypothesise matches and test them against the knowledge in our source ontology (OpenCyc). We thereby automatically align our source ontology with 55K concepts from Wikipedia with 93% accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        We have developed a method of specifically ontological quality control in
ontology mapping which combines a suite of partially overlapping mapping
heuristics with common-sense knowledge in OpenCyc. Our approach differs from
previous largely label-matching approaches
        <xref ref-type="bibr" rid="ref2 ref5">(Suchanek et al, 2008, Ponzetto and
Navigli, 2009)</xref>
        in its use of knowledge, and also from previous knowledge-based
approaches
        <xref ref-type="bibr" rid="ref3 ref4">(Shvaiko and Euzenat, 2005, Sabou et al, 2006)</xref>
        , in treating potential
matches as hypotheses, and testing them more iteratively and open-endedly than
previously accomplished.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Iterative Mapping Process</title>
      <p>Concept to Wikipedia article mapping is governed by a priority queue which
iteratively evaluates potential mappings ordered via continuously updated
weightings. The process begins with concept-to-article mappings (Table 1), then
verifies these using article-to-concept heuristics. The weight of each potential
mapping is equal to the product of weights produced by the two sets of heuristics.
Concept → Article
TITLE MATCHING
SYNONYM MATCHING
CONTEXT-RELATED
SYNONYM MATCHING
Article → Concept
TITLE MATCHING
LABEL MATCHING</p>
      <p>Example</p>
      <p>Batman-TheComicStrip → {Batman (comic strip):1.0}
ComputerWorm → {Worm:1.0, Computer worm:0.39, ... (+5 more)}</p>
      <p>ComputerWorm → {Computer worm:1.0, Worm:0.59,... (+4 more)}
representing appropriate classes. The mapping weight is multiplied by the
proportion of assertions not rejected using OpenCyc’s disjointness knowledge.
Example 1: “Bill Laswell is an American [[bassist]], [[record
producer|producer]] and [[record label]] owner.” Only three of the four
assertions in this sentence are kept: BillLaswell is a UnitedStatesPerson,
BassGuitarist, and Producer. BillLaswell cannot be a RecordCompany
because OpenCyc knows a person cannot be a company.</p>
      <p>Example 2: The concept Basketball-Ball initially maps as follows
(Basketball:1.0, Basketball (ball):0.95, College basketball:0.02). The second
candidate is the correct one, as the first refers to the team sport. The algorithm
attempts to map its first choice Basketball back to Basketball-Ball, which
succeeds but also creates a new potential reverse mapping Basketball →
Basketball. Consistency checking now tests “Basketball-Ball is a
TeamSport”, which fails, removing this potential mapping. The next highest
reverse-mapping is Basketball → Basketball, which is found to be consistent, so
a mapping is recorded for that. The process now backtracks to hypothesising the
second-best option from the original list: Basketball (ball):0.95, which also
successfully reverse-maps and is consistent, creating a new (correct) mapping. It is
worth emphasising how similar the two ‘basketball concepts’ are by standard
semantic relatedness measures, and thus the subtlety our methods are capable of.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Conclusions</title>
      <p>The algorithm identified 54,987 mappings of OpenCyc concepts to Wikipedia
articles. Applying manual analysis to a random 300 mappings, 266 were judged
‘True’ (88.5%), 21 ‘False’ (7%) and 13 (4.3%) were assigned ‘B’ for ‘Broader
term’ (the mapping was largely correct but one side generalised the other). Thus 93%
of our mappings were either ‘True’ or highly related. Although YAGO reports 95%
accuracy, what is being rated is not mapping joins between Wordnet and
Wikipedia, but the truth of assertions in infoboxes. Although our efforts so far lack
the scale of projects such as YAGO, we suggest they have a role to play in
longterm development towards maximum accuracy in this field. We offer our results at:
http://bit.ly/10MlLjl.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <source>Ontology Matching</source>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia</article-title>
          ,
          <string-name>
            <surname>IJCAI</surname>
          </string-name>
          <year>2009</year>
          , Pasadena, California, pp.
          <fpage>2083</fpage>
          -
          <lpage>2088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Sabou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Using the Semantic Web as Background Knowledge for Ontology Mapping</article-title>
          , OM-2006, Athens, GA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>A Survey of Schema-based Matching Approaches</article-title>
          .
          <source>Journal on Data Semantics 4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Yago: A Large Ontology from Wikipedia and WordNet</article-title>
          .
          <source>Elsevier Journal of Web Semantics</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <fpage>203</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>