<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic mapping of Wikipedia categories into OpenCyc types?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksander Smywiński-Pohl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Wróbel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AGH University of Science and Technology, Faculty of Computer Science</institution>
          ,
          <addr-line>Electronics and Telecommunications</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jagiellonian University, Faculty of Management and Social Communication</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The aim of the research presented in the article is the mapping between the English Wikipedia categories and OpenCyc types. The mapping algorithm is heuristic and it takes into account structural similarities between the categories and the corresponding types. The achieved mapping precision ranges from 82 to 92 % (depending on the evaluation scheme), recall from 67 to 76%. The results of the algorithm and its code are available at http://cycloped.io.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Approach</title>
      <p>
        The aim of this research is automatic mapping of Wikipedia categories into
OpenCyc [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] types. Although Wikipedia category system is hierarchical in
nature, it is more like a thesaurus than a classification scheme [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], since it lacks any
clear-defined hierarchical structure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. By mapping the categories into OpenCyc
types we will be able to levarage the well defined structure of that ontology in
Wikipedia-related information extraction tasks.
      </p>
      <p>
        The automatic mapping of categories is divided into three stages. In the first
stage the categories are pre-processed, in order to filter-out the uninteresting
categories. In the second stage for each category a set of candidate mappings
is generated and in the last stage disambiguation is performed by comparing
the context of the category with the contexts of the candidate types. As such
it is similar to the method employed in YAGO for mapping the categories into
WordNet synsets [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The disambiguation is based on structural similarities between the OpenCyc
ontology and Wikipedia category system treated as a taxonomy. The primary
means for structuring Wikipedia is the inclusion relation that holds between
categories and articles as well as categories themselves. In the first case, if the
article represents an entity, the inclusion in a category might be approximated
by instantiation relation, while in the second case the inclusion of category might
be approximated by specialization relation. Instantiation and specialization are
strictly defined in OpenCyc and are the primary means for structuring its
contents. Checking if inclusion of articles and categories in the category that is being
mapped has a corresponding instantiation and specialization assertions stated
in OpenCyc provides evidence for validity of a given candidate mapping.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>Out of 616 thousand of categories with plural noun-heads we were able to assign
some corresponding type to 484 thousand categories (78.6%). We have manually
validated 600 mappings in order to assess the quality of the category mapping
algorithm. We assumed that there is up to one valid OpenCyc type for each
Wikipedia category. We have not assigned any type if the category was
ambiguous or should be filtered out as administrative. In cases the algorithm assigned
some types to such categories, they were treated as false positives. For the other
categories we have either accepted the mapping provided by the algorithm or
manually assigned the correct mapping in cases when the algorithm’s decision
was invalid.</p>
      <p>We measured the performance of the algorithm using standard information
retrieval measures of precision and recall, employing two evaluation scenarios.
In the first one strict equivalence between the results obtained by the algorithm
and the reference mapping was required and in the second, we have extended
the set of true positives, by including results that were either specializations or
generalizations of the terms defined in the reference set. In the first scenario we
have obtained 82.5% precision, 67.5% recall and 74.2% F1 and in the second we
have obtained 92.9% precision, 76.1% recall and 83.6% F1.</p>
      <p>
        The results of the algorithm and the source code are available at
http://cycloped.io. We plan to extend the mapping and classification into other natural
languages, as well as automatically extend the OpenCyc taxonomy. Although the
results of the automatic mapping are worse than manually established
correspondence from our past efforts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the achieved coverage is much better. Moreover
the algorithms allow for providing new mappings when Wikipedia grows, making
it very useful for converting it into computable knowledge base.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lenat</surname>
          </string-name>
          , D.B.:
          <article-title>CYC: A large-scale investment in knowledge infrastructure</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pohl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Classifying the Wikipedia Articles into the OpenCyc Taxonomy</article-title>
          . In: Rizzo,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Charton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Kalyanpur</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference</source>
          . pp.
          <fpage>5</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>YAGO: a core of semantic knowledge</article-title>
          . In: Williamson,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Zurko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.E.</given-names>
            ,
            <surname>Patel-Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Shenoy</surname>
          </string-name>
          , P. (eds.)
          <source>Proceedings of the 16th international conference on World Wide Web</source>
          . pp.
          <fpage>697</fpage>
          -
          <lpage>706</lpage>
          . ACM (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Suchecki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salah</surname>
            ,
            <given-names>A.A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scharnhorst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Evolution of Wikipedia's Category Structure</article-title>
          .
          <source>Advances in Complex Systems</source>
          <volume>15</volume>
          (
          <issue>supp01</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Voss</surname>
          </string-name>
          , J.:
          <article-title>Collaborative thesaurus tagging the Wikipedia way</article-title>
          .
          <source>arXiv preprint cs/0604036</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>