-

Towards the Natural Ontology of Wikipedia

Andrea Giovanni Nuzzolese

0 2

Aldo Gangemi

1 2

Valentina Presutti

Paolo Ciancarini

0 2 0 Dept. of Computer Science and Engineering, University of Bologna , Italy 1 LIPN, University Paris 13 , Sorbone Cite, UMR CNRS , France 2 STLab-ISTC, National Research Council , Rome , Italy

In this paper we present preliminary results on the extraction of ORA: the Natural Ontology of Wikipedia. ORA4 is obtained through an automatic process that analyses the natural language de nitions of DBpedia entities provided by their Wikipedia pages. Hence, this ontology re ects the richness of terms used and agreed by the crowds, and can be updated periodically according to the evolution of Wikipedia.

Related work The DBpedia project [ 4 ] and YAGO [ 7 ] are the most relevant approaches at generating an ontology from semi-structured information in Wikipedia. DBpe4 ORA is the italian translation of NOW 5 http://dbpedia.org/ontology 6 http://isotta.cs.unibo.it:8080/sparql - select the graph now dia provides an ontology extracted from Wikipedia infoboxes based on handgenerated mappings of infoboxes to the DBpedia ontology (DBPO). DBPO counts 359 concepts (version 3.8) but only 2.3M entities over more than 4M are classi ed with respect to this ontology. YAGO types are extracted from Wikipedia categories and aligned to a subset of WordNet. The YAGO ontology is larger that DBPO and counts 290K concepts. YAGO has a larger (although still incomplete, 2.7M typed entities) coverage of DBpedia entities. ORA introduces a third dimension: the terminology of the crowds; furthermore, it provides a larger coverage (currently 3.0M typed entities). Recently, the Schema.org 7 initiative has provided alignments to the DBPO. However, such e ort does not add value from the perspective of the intensional and extensional coverage issues. Other relevant work related to our method includes Ontology Learning and Population (OL&P) techniques [ 1 ]. Typically OL&P is implemented on top of machine learning methods, hence it requires large corpora, sometimes manually annotated, in order to induce a set of probabilistic rules. Such rules are de ned through a training phase that can take a long time. The method used for ORA and implemented by T palo [ 3 ] di ers from existing approaches as it is mainly rule-based, hence it does not require a training phase and it is faster than the other approaches. 3

Automatic extraction of an ontology for Wikipedia: materials and methods T palo is implemented as a pipeline of components and data sources. Each component in the pipeline implements a step of the computation: (i) extraction of an entity's natural language de nition from its Wikipedia abstract; (ii) natural language deep parsing (provided by FRED [ 6 ]) whose output is a RDF/OWL representation of the entity de nition; (iii) selection of candidate types (based on graph-pattern-based heuristics applied to FRED output); (iv) word-sense disambiguation of candidate types; and (v) type alignment to OntoWordNet [ 2 ], WordNet supersenses and to a subset of and DUL+DnS Ultralite. We refer to [ 3 ] for details about the design and the implementation of T palo. In [ 3 ] we evaluated T palo by extracting the types for a sample of 627 resources, while in this work we want to extract the ontology of Wikipedia by running T palo on 3,769,926 DBpedia entities taken from the dbpedia long abstracts en dataset of DBpedia, which include only entities having a Wikipedia abstract: this is a main constrain for applying our method. 4

The Natural Ontology of Wikipedia (ORA): results and discussion The process described above has been run on a Mac Pro Quad Core Intel Xeon 2.8Ghz with 10Gb RAM and took 15 days (which can be easily reduced by parallelizing the activity on a cluster of machines with similar or more powerful 7 http:schema.org characteristics). The process resulted in 3,023,890 typed entities and associated taxonomies of types. Most of the missing results are due to the lack of matching T palo heuristics, which means that by improving T palo we will improve coverage (this is part of our current work). The resulting ontology includes 585,474 distinct classes organized in a taxonomy with 396,375 rdfs:subClassOf axioms; 25,480 if these classes are aligned through owl:equivalentClass axioms to 20,662 OntoWordNet synsets by means of a word-sense disambiguation process. The di erence between the number of disambiguated classes (25,480) and the number of identi ed synsets (20,662) means that there are at least 4,818 synonym classes in the ontology. We expect the number of actual synonyms to be greater. Hence, we are planning to investigate some sense-similarity-based metric in order to reduce the number of distinct classes in the ontology by merging synonyms or at least providing explicit similarity relations with con dence scores between classes.

In order to prevent polysemy deriving from merging classes with same names but aligned to di erent synsets, it has been adopted a criterion of uniqueness for the generation of the URIs of these classes. For example, let us consider the entity dbpedia:The Marriage of Heaven and Hell8. For this entity T palo generates the following RDF: dbpedia:The_Marriage_of_Heaven_and_Hell

a fred:Book . fred:Book owl:equivalentClass wn30-instance:synset-book-noun-2 .

Similarly, for the entity dbpedia:Book of Revelation9 T palo generates the following RDF: dbpedia:Book_of_Revelation

a fred:CanonicalBook . fred:CanonicalBook

rdfs:subClassOf fred:Book . fred:Book owl:equivalentClass wn30-instance:synset-book-noun-10 .

The two fred:Book classes refers to two distinct concepts. Hence, they cannot be merged during the generation of the ontology. We solve this by appending the ID of the closest synset in the taxonomy to the URI of the new generated classes: this approach guarantees to prevent polysemy and to identify synonymity at the same time. Finally, all the classes aligned to OntoWordNet have been also aligned to WordNet supersenses and a subset of DOLCE+DnS Ultra Lite classes by means of rdfs:subClassOf axioms. The following example shows a sample of the ontology which has been derived by typing the two entities used as examples previously: 8 The de nition of dbpedia:The Marriage of Heaven and Hell is: \The Marriage of

Heaven and Hell is one of William Blake's books." 9 The de nition of dbpedia:Book of Revelation is: textit\The Book of Revelation is the last canonical book of the New Testament in the Christian Bible." dbpedia:The_Marriage_of_Heaven_and_Hell

a fred:Book_102870092 . dbpedia:Book_of_Revelation

a fred:CanonicalBook_106394865 . fred:CanonicalBook_106394865 rdfs:subClassOf fred:Book_106394865 ; rdfs:label "Canonical Book"@en-US . fred:Book_102870092 owl:equivalentClass wn30-instance:synset-book-noun-2 ; rdfs:label "Book"@en-US . fred:Book_106394865 owl:equivalentClass wn30-instance:synset-book-noun-10 ; rdfs:subClassOf wn30-instance:supersense-noun_communication ,

d0:InformationEntity ; rdfs:label "Book"@en-US .

Conclusion. The main result of this work is the Natural Ontology of Wikipedia (ORA): an ontology that re ects the richness of terms used and agreed by the crowds for de ning entities in Wikipedia. All produced datasets are available for download10. We claim that this ontology provides an important resource that can be used as alternative or complement for YAGO and DBPO, and that it can enable more accurate usage of DBpedia in Semantic Web based applications such as: mash-up tools, recommendation systems, and exploratory search tools (see for example Aemoo [ 5 ]), etc. Currently, we are working at re ning ORA and to align it to DBPO and YAGO. 10 http://stlab.istc.cnr.it/stlab/ORA

Cimiano . Ontology Learning and Population from Text: Algorithms, Evaluation and Applications . Springer, 2006 .

Gangemi ,

Navigli , and

Velardi . The OntoWordNet Project: extension and axiomatization of conceptual relations in WordNet . In in WordNet, Meersman, pages 3 { 7. Springer, 2003 .

Gangemi ,

A. G.

Nuzzolese ,

Presutti ,

Draicchio ,

Musetti , and

Ciancarini . Automatic Typing of DBpedia Entities . In International Semantic Web Conference (1) , volume 7649 of Lecture Notes in Computer Science, pages 65 { 81 . Springer, 2012 .

Lehmann ,

Bizer , G. Kobilarov,

Auer ,

Becker ,

Cyganiak , and

Hellmann. DBpedia - A Crystallization Point for the Web of Data . Journal of Web Semantics , 7 ( 3 ): 154 { 165 , 2009 .

A. G.

Nuzzolese ,

Presutti ,

Gangemi ,

Musetti , and

Ciancarini . Aemoo: Exploring knowledge on the web . In Proceedings of the 5th Annual ACM Web Science Conference , pages 272 { 275 . ACM, 2013 .

Presutti ,

Draicchio , and

Gangemi . Knowledge extraction based on discourse representation theory and linguistic frames. In Knowledge Engineering and Knowledge Management (EKAW 2012 ), pages 114 { 129 . Springer, 2012 .

F. M.

Suchanek , G. Kasneci, and

Weikum. Yago : A Core of Semantic Knowledge . In 16th international World Wide Web conference (WWW 2007 ), pages 697 { 706 , New York, NY, USA, 2007 . ACM Press.