<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wikipedia Category Ontology: A Framework for Utilization of the Wikipedia Category Structure by Knowledge Engineers ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masaharu Yoshioka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takanori Nakagawa</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Advance Intelligence Project</institution>
          ,
          <addr-line>RIKEN Nihonbashi 1-chome Mitsui Building, 15th floor, 1-4-1 Nihonbashi,Chuo-ku, Tokyo 103-0027</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Information Science and Technology, Hokkaido University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Global Station for Big Date and Cybersecurity, Global Institution for Collaborative Research and Education, Hokkaido University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Graduate School of Information Science and Technology, Hokkaido University N-14 W-9</institution>
          ,
          <addr-line>Kita-ku, Sapporo 060-0814</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Wikipedia categories are intended to group together pages on similar topics and are organized in a hierarchical structure. Since the editorial policy for Wikipedia categories differs from policies used by knowledge engineers, various types of relationship exist in its category structure. In this paper, we propose a novel framework called “Wikipedia category ontology” (WCO) that aims to act as a basis for interpreting the Wikipedia category structure and is based on a classification of category and relationship types. WCO enables particular Wikipedia category substructures to be extracted, including class-subclass hierarchies, class-instance, and sets of diffused categories for a particular category. WCO is available online in the form of linked open data at http://wcontology.org/.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Wikipedia5 is a free online encyclopedia that anyone can edit and that contains
a huge number of articles. A characteristic of Wikipedia is that its articles are
organized in a semistructured format and researchers have aimed to extract
structural information that is suited to the construction of knowledge resources.
Examples include DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and YAGO2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        However, information about the Wikipedia category structure has not been
used well. One approach aimed to extract particular types of information using
pattern-based rules [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Another example is YAGO2 that uses leaf category
information to estimate the class of the pages. This approach does not consider
Wikipedia editorial policy; thus, only unsystematic aspects of the information
about the Wikipedia category structure are utilized.
      </p>
      <p>
        We have been working on a project that aims to analyze Wikipedia category
structures based on the definitions of Wikipedia categories in Wikipedia itself [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In this research, we classify Wikipedia categories using set (representing classes
such as “Cities”), topic (representing instances such as “Japan”), and
set-andtopic (a combination of set and topic such as “Cities in Japan”).
      </p>
      <p>In this paper, we propose a “Wikipedia category ontology” (WCO) based on
our analysis of Wikipedia category types and our exhaustive analysis of Japanese
Wikipedia categories. This ontology aims to act as a basis for interpreting the
Wikipedia category structure when reorganizing (extracting) the Wikipedia
category structure for a particular purpose. In addition, because one of the
reorganization results represents class hierarchy information, WCO can also be used as a
class-hierarchy component of the Wikipedia ontology. This ontology is available
online in the form of linked open data (LOD) at http://wcontology.org/.</p>
      <p>The main contributions of this paper can be summarized as: (1) analyzing
the Japanese Wikipedia category structure to understand its characteristics, (2)
providing a basic vocabulary for representing the Wikipedia category structure,
and (3) providing LOD material that enables the extraction of useful information
from the Japanese Wikipedia category structure.
2
2.1</p>
      <p>WCO</p>
      <p>Editorial Policy for Wikipedia Categories
Because Wikipedia editors edit the category structure based on the editorial
policy described in Wikipedia itself, it is important to understand that policy.
In Wikipedia, the category structure is organized as overlapping “trees” using
subcategory relationships6. There are two main types of category. The topic
category refers to an entity (e.g., “Japan”) and the set category refers to a
class (e.g., “Cities”). Sometimes, for convenience, the two types are combined
to create a set-and-topic category (e.g., “Cities in Japan”)7. Figure 1 shows an
example of category names where the brown and green colors indicate names for
set and topic categories, respectively. Names with both colors are set-and-topic
categories.</p>
      <p>Another important editing policy relates to the size of the categories. In
Wikipedia, a large category will often be broken down (“diffused”) into smaller,
more-specific subcategories. For example, “Rivers of Europe” is broken down by
country using subcategory “Rivers of Europe by country” and its subcategories
such as “Rivers of Albania” and “Rivers of Austria.” Most of the case, such big
categories are divided into subcategories using constraints for selecting a part of
pages that satsify such constraints. (e.g., instances of “country” (Topic category
such as “Albania”, “Austria”) for “river”, instances of “artist” (Topic category
such as “The Beatles”) for “song”). Those categories are typical examples of
set-and-topic categories.</p>
      <p>As a result, it is necessary to traverse link to find out appropriate broken
down categories for making a list of pages categorized for such large category.</p>
    </sec>
    <sec id="sec-2">
      <title>6 https://en.wikipedia.org/wiki/Help:Categories 7 https://en.wikipedia.org/wiki/Wikipedia:Categorization</title>
      <p>
        This is different from collecting all progeny categories. For example, “Rivers of
Austria” have subcategory such as “Danube” and “Drava” and those categories
have “Populated places on the Danube” and “Bridges over the Drava” whose
pages are not appropriate for “Rivers of Europe” category.
2.2 Analysis of the Japanese Wikipedia Category Structure
Most previous work [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] aimed to extract category information using
patterns without considering other issues. This lack of consideration leads to an
inadequate understanding of the entirety of the Wikipedia structure; therefore,
we conducted an exhaustive manual analysis of the categories in the Japanese
Wikipedia as an extension of our previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>We created a database dump of the Japanese Wikipedia on October 20, 2017,
which comprised 212,346 categories. Because some categories referred only to
Wikipedia maintenance issues, we excluded these categories, which left 183,600
categories (and 451,074 parent–child category relationships between them) for
our exhaustive analysis.</p>
      <p>One of our coauthors then exhaustively checked and classified all of these
Wikipedia categories, initially in terms of the three category types set , topic, and
set-and-topic. However, because there were categories based on a combination of
topics (e.g., “1990s in Japan”), we added this as a fourth type of category and
classified the categories into the following four types (numbers in parentheses
indicate the number of categories for that type).</p>
      <p>Set category (10,748) indicates a class (usually in the plural).
Topic category (44,525) indicates a topic (usually sharing its name with a</p>
      <p>Wikipedia article on that topic).</p>
      <p>Constrained set category (117,994) is a diffused version of a set category,
with constraints.</p>
      <p>Constrained topic category (10,333) is a diffused version of a topic
category, with constraints.</p>
      <p>The most numerous type is constrained set and a category of this type
involves diffusion from an ancestor-set category. Therefore, it is necessary to
provide a framework for analyzing such diffusion-related categories if the Wikipedia
category structure is to be properly understood.</p>
      <p>It is also necessary to classify subcategory relationships among the Wikipedia
categories. One of the important issue for this classification is the role of
transitivity in the relationships.</p>
      <p>Fig. 1 shows the types and examples of category relationships. The red lines
indicate transitive relationships. In this case, “Cities” is not interpreted as the
ancestor category of “People from Mitaka, Tokyo.” “Geographically part of” and
“Age” are special cases of a “Specified constraint” when used as constraints.
“Narrower” and “Narrower transitive” are used for intransitive and transitive
category relationships respectively that are difficult to categorize using other
types.
3 WCO Resources for the Japanese Wikipedia
Based on our analysis of the categories in the Japanese Wikipedia corpus, it will
be necessary to reorganize the Wikipedia category structure for those knowledge</p>
      <p>!"#"$% "&amp;'()*)&amp;
C&gt; D9E1F)%%',@
/,0#'1"#"$%')&amp;2'#,3&amp;%'"&amp;'()*)&amp;
/$,*F$ @0,N'4"#)-)5'+,-.,</p>
      <p>L&gt; M$,K0)*8"1)FF.'*)0#',@ J&gt; HK$
!"#"$% "&amp;'+,-., :;;7% "&amp;'()*)&amp;</p>
      <p>6&gt; A&amp;%#)&amp;1$',@
engineers who would like to extract knowledge by making use of the Wikipedia
structure. To address this problem, we propose WCO, which provides a
reorganization of the Wikipedia category structure by redefining the types of categories
and the relationships between them.</p>
      <p>
        Fig. 2 gives definitions of the core vocabulary used in WCO using the resource
description framework (RDF) for Linked Open Data(LOD)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>@prefix wcoc: &lt;http://wcontology.org/core#&gt; .
@prefix rdfs:&lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix skos:&lt;http://www.w3.org/2004/02/skos/core#&gt;
wcoc:setCategory rdfs:subClassOf wcoc:category .
wcoc:topicCategory rdfs:subClassOf wcoc:category .
wcoc:constrainedSetCategory rdfs:subClassOf wcoc:setCategory .
wcoc:constrainedTopicCategory rdfs:subClassOf wcoc:topicCategory .
wcoc:narrower rdfs:subPropertyOf skos:narrower .
wcoc:instanceOf rdfs:subPropertyOf wcoc:narrower .
wcoc:usedForConstraint rdfs:subPropertyOf wcoc:narrower .
wcoc:narrowerTransitive rdfs:subPropertyOf wcoc:narrower .
wcoc:narrowerTransitive rdfs:subPropertyOf skos:narrowerTransitive .
wcoc:subclassOf rdfs:subPropertyOf wcoc:narrowerTransitive .
wcoc:age rdfs:subPropertyOf wcoc:narrowerTransitive .
wcoc:geography rdfs:subPropertyOf wcoc:narrowerTransitive .
wcoc:addConstraint rdfs:subPropertyOf wcoc:narrowerTransitive .
wcoc:specifiedConstraint rdfs:subPropertyOf wcoc:narrowerTransitive .</p>
      <p>Based on the WCO core vocabulary and the results of our analysis results, we
construct resources for representing the Japanese Wikipedia category structure.
These resources are connected to the English Wikipedia category structure via
the language links in Wikipedia and DBpedia that use owl:sameAs (prefix owl
is used for &lt;http://www.w3.org/2002/07/owl#&gt;). These resources are
accessible in LOD form (http://wcontology.org/) with SPARQL endpoint using the
Virtuoso Open Source Edition 7.2.68.</p>
      <p>We can utilize these resources using the SPARQL endpoint
http://wcontology.org/sparql. Several example queries are shown on the</p>
    </sec>
    <sec id="sec-3">
      <title>8 https://github.com/openlink/virtuoso-opensource</title>
      <p>WCO home page, http://wcontology.org/. The example queries are given
in both English and Japanese, where the original example queries were taken
from the Japanese version and the English-version queries were constructed
using an owl:sameAs link to an English-language Wikipedia category. There are
fewer English-language queries than Japanese queries because there are
categories without a language link to the English-language Wikipedia.</p>
      <p>The following are examples of queries using WCO.
– Collection of Diffused Categories</p>
      <p>By selecting transitive relationships, we can select a set of categories that
have been diffused from the target category.
– Collection of Subclasses (setCategory) of a Given Category</p>
      <p>By checking the transitive subcategories, we can extract the subclasses
(setCategory) of a given category.
4</p>
      <sec id="sec-3-1">
        <title>Conclusion</title>
        <p>In this paper, we have proposed WCO, a framework that aims to act as a basis
for interpreting the Wikipedia category structure by enabling a classification of
its category types and the types of relationship between them. Resources and
examples are available as LOD http://wcontology.org/.</p>
        <p>For the future works, we plan to utilize this WCO resource for Japanese
Wikipedia as a training data to construct ones for other languages.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Acknowledgment References</title>
        <p>This work was partially supported by JSPS KAKENHI Grant Number 18H03338.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>DBpedia - a crystallization point for the web of data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>7</volume>
          (
          <year>2009</year>
          )
          <fpage>154</fpage>
          -
          <lpage>165</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hoffart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>194</volume>
          (
          <year>2013</year>
          )
          <fpage>28</fpage>
          -
          <lpage>61</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tamagawa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakurai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tejima</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morita</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Izumi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning a large scale of ontology from japanese wikipedia</article-title>
          .
          <source>Journal of Japanese Society of Artificial Intelligence</source>
          <volume>25</volume>
          (
          <year>2010</year>
          )
          <fpage>623</fpage>
          -
          <lpage>636</lpage>
          (in Japanese).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heist</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Uncovering the semantics of wikipedia categories</article-title>
          . In Ghidini,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Maleshkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , Sv´atek, V.,
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Lefran¸cois,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gandon</surname>
          </string-name>
          , F., eds.:
          <source>The Semantic Web - ISWC</source>
          <year>2019</year>
          , Cham, Springer International Publishing (
          <year>2019</year>
          )
          <fpage>219</fpage>
          -
          <lpage>236</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yoshioka</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Analysis of japanese wikipedia category for constructing wikipedia ontology and semantic similarity measure</article-title>
          .
          <source>In: Information Retrieval Technology 10th Asia Infomation Retrieval Societies Conference, AIRS</source>
          <year>2014</year>
          , Kuching, Malaysia, December 3-
          <issue>5</issue>
          ,
          <year>2014</year>
          Proceedings. Springer-Verlag GmbH (
          <year>2014</year>
          )
          <fpage>470</fpage>
          -
          <lpage>481</lpage>
          LNCS8870.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>