<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards the Natural Ontology of Wikipedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Giovanni Nuzzolese</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aldo Gangemi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentina Presutti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Ciancarini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science and Engineering, University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIPN, University Paris 13</institution>
          ,
          <addr-line>Sorbone Cite, UMR CNRS</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>STLab-ISTC, National Research Council</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present preliminary results on the extraction of ORA: the Natural Ontology of Wikipedia. ORA4 is obtained through an automatic process that analyses the natural language de nitions of DBpedia entities provided by their Wikipedia pages. Hence, this ontology re ects the richness of terms used and agreed by the crowds, and can be updated periodically according to the evolution of Wikipedia.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Related work
The DBpedia project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and YAGO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are the most relevant approaches at
generating an ontology from semi-structured information in Wikipedia.
DBpe4 ORA is the italian translation of NOW
5 http://dbpedia.org/ontology
6 http://isotta.cs.unibo.it:8080/sparql - select the graph now
dia provides an ontology extracted from Wikipedia infoboxes based on
handgenerated mappings of infoboxes to the DBpedia ontology (DBPO). DBPO
counts 359 concepts (version 3.8) but only 2.3M entities over more than 4M
are classi ed with respect to this ontology. YAGO types are extracted from
Wikipedia categories and aligned to a subset of WordNet. The YAGO ontology
is larger that DBPO and counts 290K concepts. YAGO has a larger (although
still incomplete, 2.7M typed entities) coverage of DBpedia entities. ORA
introduces a third dimension: the terminology of the crowds; furthermore, it provides
a larger coverage (currently 3.0M typed entities). Recently, the Schema.org 7
initiative has provided alignments to the DBPO. However, such e ort does not
add value from the perspective of the intensional and extensional coverage
issues. Other relevant work related to our method includes Ontology Learning and
Population (OL&amp;P) techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Typically OL&amp;P is implemented on top of
machine learning methods, hence it requires large corpora, sometimes manually
annotated, in order to induce a set of probabilistic rules. Such rules are de ned
through a training phase that can take a long time. The method used for ORA
and implemented by T palo [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] di ers from existing approaches as it is mainly
rule-based, hence it does not require a training phase and it is faster than the
other approaches.
3
      </p>
      <p>
        Automatic extraction of an ontology for Wikipedia:
materials and methods
T palo is implemented as a pipeline of components and data sources. Each
component in the pipeline implements a step of the computation: (i) extraction of
an entity's natural language de nition from its Wikipedia abstract; (ii) natural
language deep parsing (provided by FRED [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) whose output is a RDF/OWL
representation of the entity de nition; (iii) selection of candidate types (based
on graph-pattern-based heuristics applied to FRED output); (iv) word-sense
disambiguation of candidate types; and (v) type alignment to OntoWordNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
WordNet supersenses and to a subset of and DUL+DnS Ultralite. We refer to
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for details about the design and the implementation of T palo. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we
evaluated T palo by extracting the types for a sample of 627 resources, while in
this work we want to extract the ontology of Wikipedia by running T palo on
3,769,926 DBpedia entities taken from the dbpedia long abstracts en dataset
of DBpedia, which include only entities having a Wikipedia abstract: this is a
main constrain for applying our method.
4
      </p>
      <p>The Natural Ontology of Wikipedia (ORA): results
and discussion
The process described above has been run on a Mac Pro Quad Core Intel Xeon
2.8Ghz with 10Gb RAM and took 15 days (which can be easily reduced by
parallelizing the activity on a cluster of machines with similar or more powerful
7 http:schema.org
characteristics). The process resulted in 3,023,890 typed entities and associated
taxonomies of types. Most of the missing results are due to the lack of matching
T palo heuristics, which means that by improving T palo we will improve
coverage (this is part of our current work). The resulting ontology includes 585,474
distinct classes organized in a taxonomy with 396,375 rdfs:subClassOf
axioms; 25,480 if these classes are aligned through owl:equivalentClass axioms
to 20,662 OntoWordNet synsets by means of a word-sense disambiguation
process. The di erence between the number of disambiguated classes (25,480) and
the number of identi ed synsets (20,662) means that there are at least 4,818
synonym classes in the ontology. We expect the number of actual synonyms to
be greater. Hence, we are planning to investigate some sense-similarity-based
metric in order to reduce the number of distinct classes in the ontology by
merging synonyms or at least providing explicit similarity relations with con dence
scores between classes.</p>
      <p>In order to prevent polysemy deriving from merging classes with same names
but aligned to di erent synsets, it has been adopted a criterion of uniqueness
for the generation of the URIs of these classes. For example, let us consider
the entity dbpedia:The Marriage of Heaven and Hell8. For this entity T palo
generates the following RDF:
dbpedia:The_Marriage_of_Heaven_and_Hell</p>
      <p>a fred:Book .
fred:Book
owl:equivalentClass wn30-instance:synset-book-noun-2 .</p>
      <p>Similarly, for the entity dbpedia:Book of Revelation9 T palo generates the
following RDF:
dbpedia:Book_of_Revelation</p>
      <p>a fred:CanonicalBook .
fred:CanonicalBook</p>
      <p>rdfs:subClassOf fred:Book .
fred:Book
owl:equivalentClass wn30-instance:synset-book-noun-10 .</p>
      <p>The two fred:Book classes refers to two distinct concepts. Hence, they cannot
be merged during the generation of the ontology. We solve this by appending
the ID of the closest synset in the taxonomy to the URI of the new generated
classes: this approach guarantees to prevent polysemy and to identify synonymity
at the same time. Finally, all the classes aligned to OntoWordNet have been also
aligned to WordNet supersenses and a subset of DOLCE+DnS Ultra Lite classes
by means of rdfs:subClassOf axioms. The following example shows a sample of
the ontology which has been derived by typing the two entities used as examples
previously:
8 The de nition of dbpedia:The Marriage of Heaven and Hell is: \The Marriage of</p>
      <p>Heaven and Hell is one of William Blake's books."
9 The de nition of dbpedia:Book of Revelation is: textit\The Book of Revelation is
the last canonical book of the New Testament in the Christian Bible."
dbpedia:The_Marriage_of_Heaven_and_Hell</p>
      <p>a fred:Book_102870092 .
dbpedia:Book_of_Revelation</p>
      <p>a fred:CanonicalBook_106394865 .
fred:CanonicalBook_106394865
rdfs:subClassOf fred:Book_106394865 ;
rdfs:label "Canonical Book"@en-US .
fred:Book_102870092
owl:equivalentClass wn30-instance:synset-book-noun-2 ;
rdfs:label "Book"@en-US .
fred:Book_106394865
owl:equivalentClass wn30-instance:synset-book-noun-10 ;
rdfs:subClassOf wn30-instance:supersense-noun_communication ,</p>
      <p>d0:InformationEntity ;
rdfs:label "Book"@en-US .</p>
      <p>
        Conclusion. The main result of this work is the Natural Ontology of Wikipedia
(ORA): an ontology that re ects the richness of terms used and agreed by the
crowds for de ning entities in Wikipedia. All produced datasets are available for
download10. We claim that this ontology provides an important resource that
can be used as alternative or complement for YAGO and DBPO, and that it
can enable more accurate usage of DBpedia in Semantic Web based applications
such as: mash-up tools, recommendation systems, and exploratory search tools
(see for example Aemoo [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), etc. Currently, we are working at re ning ORA and
to align it to DBPO and YAGO.
10 http://stlab.istc.cnr.it/stlab/ORA
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <source>Ontology Learning and Population from Text: Algorithms, Evaluation and Applications</source>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Velardi</surname>
          </string-name>
          .
          <article-title>The OntoWordNet Project: extension and axiomatization of conceptual relations in WordNet</article-title>
          . In in WordNet, Meersman, pages
          <fpage>3</fpage>
          <lpage>{</lpage>
          7. Springer,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Draicchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Musetti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciancarini</surname>
          </string-name>
          .
          <article-title>Automatic Typing of DBpedia Entities</article-title>
          . In
          <source>International Semantic Web Conference (1)</source>
          , volume
          <volume>7649</volume>
          of Lecture Notes in Computer Science, pages
          <volume>65</volume>
          {
          <fpage>81</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann. DBpedia - A Crystallization</surname>
          </string-name>
          <article-title>Point for the Web of Data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>154</volume>
          {
          <fpage>165</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Musetti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciancarini</surname>
          </string-name>
          . Aemoo:
          <article-title>Exploring knowledge on the web</article-title>
          .
          <source>In Proceedings of the 5th Annual ACM Web Science Conference</source>
          , pages
          <volume>272</volume>
          {
          <fpage>275</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Draicchio</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          .
          <article-title>Knowledge extraction based on discourse representation theory and linguistic frames. In Knowledge Engineering and Knowledge Management (EKAW</article-title>
          <year>2012</year>
          ), pages
          <fpage>114</fpage>
          {
          <fpage>129</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. Yago</surname>
          </string-name>
          :
          <article-title>A Core of Semantic Knowledge</article-title>
          .
          <source>In 16th international World Wide Web conference (WWW</source>
          <year>2007</year>
          ), pages
          <fpage>697</fpage>
          {
          <fpage>706</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>