=Paper=
{{Paper
|id=None
|storemode=property
|title=Type inference through the analysis of Wikipedia links
|pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-paper-13.pdf
|volume=Vol-937
|dblpUrl=https://dblp.org/rec/conf/www/NuzzoleseGPC12
}}
==Type inference through the analysis of Wikipedia links==
<pdf width="1500px">https://ceur-ws.org/Vol-937/ldow2012-paper-13.pdf</pdf>
<pre>
      Type inference through the analysis of Wikipedia links

                                        Andrea Giovanni                   Aldo Gangemi
                                          Nuzzolese                   ISTC-CNR, Semantic
                                       ISTC-CNR, STLab             Technology Lab, Rome, Italy
                                      CS Dept., University of       aldo.gangemi@cnr.it
                                         Bologna, Italy
                                    nuzzoles@cs.unibo.it
                                     Valentina Presutti                  Paolo Ciancarini
                                     ISTC-CNR, Semantic                  ISTC-CNR, STLab
                                  Technology Lab, Rome, Italy           CS Dept., University of
                                  valentina.presutti@cnr.it                Bologna, Italy
                                                                   ciancarini@cs.unibo.it

ABSTRACT                                                           from the current type coverage of DBpedia entities. Addi-
DBpedia contains millions of untyped entities, either if we        tionally, we present two automatic classification techniques
consider the native DBpedia ontology, or Yago plus Word-           that exploit wikilinks, one based on indiction from machine
Net. Is it possible to automatically classify those entities?      learning, and the other based on abduction. Both methods
Based on previous work on wikilink invariances, we won-            showed limited results in terms of performance (i.e., preci-
dered if wikilinks convey a knowledge rich enough for their        sion and recall), which can be explained by analyzing the
classification. In this paper we give three contributions.         general figures from our measurements on the DBpedia link
Concerning the DBpedia link structure, we describe some            structure. More specifically, (i) the mapping procedure be-
measurements and notice both problems (e.g. the bias that          tween Wikipedia infobox templates and DBpedia ontology
could be induced by the incomplete ontological coverage of         classes is conducted manually, as described in [9], resulting
the DBpedia ontology), and potentials existing in current          in a lack of ontological coverage of DBPO i.e. DBPO classes
type coverage. Concerning classification, we present two           are insufficient for typing all DBpedia entities, which in turn
techniques that exploit wikilinks, one based on induction          results in lack of classification knowledge that can be used
from machine learning techniques, and the other on abduc-          by automatic techniques for identifying type candidates for
tion. Finally, we discuss the limited results of classification,   untyped entities, (ii) the granularity of the assigned types
which confirmed our fears expressed in the description of          is not homogeneous e.g. some entities are typed with very
general figures from the measurement. We also suggest some         general classes e.g. Person, while other similar entities have
new possible directions to entity classification that could be     more specialized types e.g. Musician, (iii) most of the en-
taken.                                                             tities are untyped, hence making it hard to build a proper
                                                                   training set for inductive learning techniques, (iv) the distri-
1.    INTRODUCTION                                                 bution of link types ingoing to, and outgoing from, DBpedia
DBpedia is the largest RDF data set in Linked Data ex-             entities varies between typed and untyped entities, which
tracted from the largest existing multi-domain knowledge           impacts on the ability of our abductive method to properly
source edited by the crowds, i.e. Wikipedia. Part of DB-           classify untyped entities.
pedia entities are explicitly typed with classes of the DBpe-
dia Ontology (DBPO). This huge source of semantic data             After describing the resources that we have used for our re-
provides a powerful knowledge base that can be exploited           search and discussing limitations and potentials emerging
as background knowledge for developing new generation of           from our measurements on the general figures of the DB-
knowledge extraction and interaction tools. Nevertheless,          pedia link structure (Section 2), we present two methods
most of DBpedia entities are still untyped, which limits its       for automatic classification of DBpedia entities and the re-
exploitation at its full potential. Based on previous work on      sults we have obtained (Section 3). In Section 4 we discuss
wikilink invariances [13], we wondered if wikilinks convey a       related work and in Section 5 we conclude by discussing pos-
knowledge rich enough for their automatic classification. In       sible new approaches to DBpedia entity classification that
this paper, we discuss both problems and potentials deriving       could be taken, based on more solid grounds.


                                                                   2.     MATERIALS
                                                                   DBpedia datasets describe about 18 million resources (3 mil-
                                                                   lion with an abstract and a label, less than 2 million typed),
                                                                   and include more than 107 million wikilinks.
                                                                   The classification attempted in this paper is based on the
                                                                   assumption that wikilink relations in Wikipedia convey a
Copyright is held by the author/owner(s).
LDOW2012, April 16, 2012, Lyon, France.                            rich knowledge that can be used to classify untyped entities
                                                                   referenced in those pages. In practice, given a certain entity
                                                                  The reason of this unbalance can be hypothesized to be
Table 1: Analysis of DBpedia resources with respect               caused by the high frequency of homotypes, i.e. wikilinks
to some relevant perspective for our work.                        that have the same type on both the subject and the object
 Perspective                 # of resources/axioms                of the triple. If this hypothesis is confirmed, untyped re-
 Number of wikilink axioms               107,892,317              sources should have a high ratio of untyped outgoing links.
 Wikilink axioms having                   16,745,830              As a matter of fact, homotypes are actually very frequent
 both the subject and the                                         (usually the most frequent, or in the top 3) wikilink types
 object typed with DBPO                                           (this observation has been made in the research reported by
 types                                                            [13]). Therefore, such distribution of wikilinks for typed and
 Resources used in wikilinks              15,944,381              untyped resources is unbalanced.
 relations                                                        This means that if we use wikilinks and types as the features
 rdf:type triples                          6,173,940              for training or designing a good inductive model based on
 Resources having a DBPO                   1,668,503              the corpus of typed resources, a bias is created on the appro-
 type                                                             priateness of such a model for classifying untyped resources.
 Resources typed with a                    1,518,697              However, we wanted to check anyway: i) what precision/recall
 DBPO type and used as sub-                                       can be obtained when using wikilink structures as features
 ject of wikilinks                                                for creating a type induction model from typed DBpedia re-
                                                                  sources, even considering the bias constituted by the 34%
                                                                  untyped wikilinks; and ii) how much the larger bias (63%),
described in a Wikipedia page, our classification grounds on      constituted by untyped wikilinks on untyped resources, would
the analysis of incoming and outgoing wikilinks from and to       affect the precision/recall established on typed resources.
that certain page.                                                A part of our wikilink analysis for DBpedia entity classifi-
For this analysis we used DBpedia [9], the RDF [10] Linked        cation made use of 187 Knowledge Patterns that have been
Data [1] that contains structured information extracted from      extracted from DBpedia wikilink datasets, which are called
Wikipedia.                                                        Encyclopedic Knowledge Patterns (EKPs) 4 [13]. EKPs al-
The types used to classify DBpedia resources are the classes      low to fetch most relevant entity types that provide an effec-
of the DBpedia ontology 1 (DBPO). DBPO ontology is rep-           tive and intuitive description of entities of a certain type. As
resented in OWL [11] and covers 272 classes.                      discussed in the next section, EKPs provide a background
The rdf:type statements from DBpedia resources to DBPO            knowledge to our method of abductive classification.
classes (available in the dataset dbpedia_instance_types_en)
are the result of hand-generated mappings of Wikipedia in-        3.   METHODS AND RESULTS
foboxes to DBPO, and have been generated for 1,668,503            The classification of DBpedia entities relies on an ontology
DBpedia resources2 .                                              mapping task which defines how Wikipedia infobox tem-
Those statements cover only a subset of the 15,944,381 DB-        plates are mapped to classes of the DBpedia ontology. These
pedia resources available in the dbpedia_page_links_en dataset.   mappings are manually specified using the DBpedia Map-
Hence, only 15.52% of resources in the DBpedia data set of        ping Language 5 . The mapping language makes use of Me-
wikilinks are typed with DBPO classes. In the work de-            diaWiki templates that allow to map infoboxes to DBpedia
scribed in this paper we investigate how to assign a DBPO         ontology classes. The mappings cover only a small subset of
type to the remaining 84.48% untyped DBpedia resources.3 .        all Wikipedia infoboxes. As a result, so far, only a small sub-
Table 2 shows further details about how resources are orga-       set of all DBpedia entities (1,668,503 of 15,944,381) is typed
nized in DBpedia. It emerges that the number of wikilinks         with a class of the DBpedia ontology. Probably the effort
is much bigger (107,892,317) than the number of rdf:type          spent in manually writing mappings for the classification
triples (6,173,940). When we take into account only DBPO          of DBpedia entities with respect to the DBpedia ontology
type we can observe how the number of Wikilink axioms             is too expensive and the granularity and the appropriate-
having both the subject and the object typed with DBPO            ness of obtained typings are not exhaustive. As an example,
types is limited to 16,745,830.                                   dbpedia:Walt_Disney 6 is typed as dbpo:Person, which is
                                                                  doubtlessly correct, but also trivial and less appropriate than
Furthermore, we analyzed the distribution of wikilinks across     the existing dbpo:ComicsCreator type.
typed/untyped resources. The average percentage of fully          Our work is based on the intuition that wikilink relations
DBPO typed wikilinks (i.e., wikilink axioms having both           in DBpedia, i.e. instances of the dbpo:wikiPageWikiLink
the subject and the object typed with a DBPO class) per           property, convey a rich knowledge that can be used for clas-
resource is 66% with respect to the total number of wik-          sifying DBpedia entities, but at the same time it is very
ilink axioms per resource. Instead, the average percentage        difficult to find a good training set by using only typed wik-
of wikilink axioms outgoing from untyped to typed resources       ilinks. A first reason is that typed resources having wikilinks
is 37% per resource with respect to the total number of wik-      are only the 15.52% of the total resources used in the wik-
ilink axioms per resource. This means that the ratio between      ilink data set. A second one derives from the fact untyped
typed and untyped outgoing links is 23 for typed resources        resources, i.e., the resources we are interested to type, have
vs. 13 for untyped ones.                                          4
                                                                    The EKP resource is available as a set of OWL ontologies
1
  http://wiki.dbpedia.org/Ontology                                at http://ontologydesignpatterns.org/ekp/owl/
2                                                                 5
  such figure holds for the typed resources in the English          http://mappings.dbpedia.org/index.php/Main Page
version of DBpedia 3.6                                            6
                                                                    The     prefixes    dbpo      and      dbpedia    stand
3
  Actually, excluding some entities that are not relevant: im-    for          http://dbpedia.org/ontology/             and
ages, categories, and disambiguation pages                        http://dbpedia.org/resource/ respectively.
Table 2: Number of individuals chosen for generat-                 Table 3: Some of the wikilinks outgoing from dbpe-
ing the training examples relatively to the classes                dia:Steve_Jobs and their associated type.
Place, Person, Work, Organisation and Activity.                           Links                           Class
      Class        # of indi- # of trained                                ...                                ...
                   viduals       individuals                              dbpedia:Apple_Inc.       dbpo:Company
      Place        525,786       105,157                                  dbpedia:NeXT             dbpo:Company
      Person       416,079       83,215                                   Cupertino,_California       dbpo:City
      Work         262,662       52,532                                   Forbes                  dbpo:Magazine
      Organisation 169,338       33,867                                   ...                                ...
      Activity     1,380         276

                                                                   DBPO were the rows and the feature columns of the model
only one third of their total wikilinks typed. In order to         respectively. For each individual we labeled the correspond-
test this intuition, we have investigated and compared type        ing row with its known DBPO class and we then analyzed its
induction methods over DBpedia entities based on two in-           graph of wikilinks in order to fill the matrix resulting from
ference types:                                                     the space-vector model. This was done marking with either
                                                                   0 or 1 each intersection cell between a row corresponding to
                                                                   an individual and a feature corresponding to a DBPO class
   • Inductive classification: works moving from specific          with the following criteria:
     observations to broader generalizations and theories,
     and can be informally defined as a “bottom-up” ap-
     proach;                                                          • 0 means that no wikilink exists between the selected
                                                                        individual and any other individual typed with the cor-
   • Abductive classification: works from more general rules
                                                                        responding DBPO class used as feature;
     (assumed from previous cases), in order to infer pre-
     sumably specific facts. In this case we have used EKPs           • 1 means that at least one wikilink exists between the
     and homotypes as background knowledge.                             selected individual and any other individual typed with
                                                                        the corresponding DBPO class used as a feature.
Considering that DBPO has 272 classes, and that automatic
classification on 272 classes is very difficult, we have focused
                                                                   As an example, we may want to built the feature model of
mainly on classification of entities with respect to the DBPO
                                                                   the wikilinks of the entity dbpedia:Steve_Jobs with respect
top-level. The granularity of the classification is solved in
                                                                   to the classes dbpo:Mammal, dbpo:Scientist, dbpo:Company,
this case by adopting a hierarchical-iterative type induction
                                                                   dbpo:Drug, dbpo:City, dbpo:Magazine. We fetch all the
derived by the class hierarchy in DBPO. This means that
                                                                   types related to the outgoing wikilinks from dbpedia:Steve_Jobs
the classification starts from the top level of the DBpedia
                                                                   with a simple SPARQL query like the following:
ontology composed by 27 classes, and then iteratively goes
on trying to classify an entity with one of the sub-classes of
the class selected in the previous iteration.                      PREFIX dbpedia: <http://dbpedia.org/resource/>
                                                                   SELECT ?link ?type WHERE {
                                                                       GRAPH<dbpedia_page_links_en> {
3.1    Inductive type inference                                            dbpedia:Steve_Jobs ?prop ?link
Hybridization of Machine Learning (ML) techniques with                 } .
                                                                       GRAPH<dbpedia_instance_types_en> {
the Semantic Web is quite effective [5], therefore we started              ?link a ?type
our investigation trying to use a ML-based approach to the             }
classification of DBpedia entities. We have used the k-            }
Nearest Neighbor (NN) algorithm for classifying DBpedia
entities based on the closest training examples in the la-
beled feature space, and by assigning the most voted class         Supposing that all the wikilinks retrieved (actually, those
among the training examples.                                       shown are only a subset) are the ones showed in table 3.1
We have designed two inductive classification experiments          the resulting row in feature space model deriving from db-
based on the NN algorithm: (i) a baseline experiment config-       pedia:Steve_Jobs will be the following:
ured to classify DBpedia individuals based on 272 features,
                                                                                 Mammal Scientist Company Drug City Magazine
i.e., all the DBPO classes. (ii) An experiment based on             Steve_Jobs     0       0        1      0    1      1       Person
the top-level classes in the DBPO taxonomy, i.e., 27 classes,
aimed to simplify the classification with less features to in-
vestigate. In both cases the training sample has been built        After the training, for each class the classification function
with the same approach described as follows: (i) the 20% of        generates what is called a mean, i.e., the reference value
the individuals of each class has been used for populating         used to test similarities and then to classify untyped enti-
the training sample. Table 3.1 shows how many individuals          ties. The classification has been performed by estimating
have been chosen with respect to the classes Place, Person,        the euclidean distance of the features of unlabeled individu-
Work, Organisation and Activity. (ii) The NN algorithm             als with respect to each mean. The lower values of euclidean
has been trained on a labeled feature space model in which         distance have been chosen to classify individuals.
the individuals of the training sample and the classes of the      The performance of this approach has been evaluated on the
remaining 80% untyped individuals. We remind that such                • cell values that contain path popularity values occur-
evaluation is made considering the “best case” of resources             ring between the type in the row and the type in the
with a known type, in which the percentage of typed wik-                column.
ilinks is high (66%). The precision deriving from the baseline
experiment has been 31.65%. The precision of the classifica-
tion of individuals with respect to the DBPO top-level has         The following is an example of feature space model built
been 40.27%.                                                       for the class labels Person, Place and Oganisation over
                                                                   the features Event, Work, Organisation, Person, Activity,
3.2    Abductive type inference                                    Place:
Abduction was introduced by Peirce [7, 14], and refers to                          Event   Work   Organisation   Person   Activity Place
a process oriented to an explanatory hypothesis about the           ...            ...     ...    ...            ...      ...      ...
precondition P reasoning on the consequence C. Compared             Person         4.45    8.4    22.2           18.29    6.8      27.5
                                                                    Organisation   3.21    7.78   24.13          13.23    2.5      31.93
to pure deduction, induction and abduction have a lower             Place          3.14    1.86   8.71           8.74     1.1      61.15
strength, much are practical when the set of assumptions is         ...            ...     ...    ...            ...      ...      ...
not complete with respect to the observed world. Induction
cannot be made fully conclusive, since the inference from          In the abductive approach, differently from the inductive
a set of cases can be only made certain in a closed world          one, the labeled feature space model is not used for training
(which is not the case with the Web). Abduction, on its            the classification function. In fact, it already provides gen-
turn, is formally equivalent to the logical fallacy affirming      eral rules based on path popularity values over the wikilink
the consequent or Post hoc ergo propter hoc, because there         domain used to infer types.
are multiple possible explanations for C. In other words,          Instead, entities to be classified are represented in the same
abducing P from C involves determining that C is sufficient        way as in the inductive approach. Therefore, they are vector
(or nearly sufficient), but not necessary, for P.                  models of length j, where j is the number of features used
We have used an abductive approach to infer the type of            in the model. Values in vectors consists of either 0 or 1, in
DBpedia entities with two classification methods:                  which, again:

   • EKP-based. We assumed Encyclopedic Knowledge Pat-
                                                                      • 0 means that no wikilink exists between the selected
     terns (EKPs) extracted from wikilink relations in Wikipedia
                                                                        individual and any other individual typed with the cor-
     [13] as our background knowledge, and as the adbuc-
                                                                        responding DBPO class used in the feature;
     tive consequence C, on which we infer the type of enti-
     ties. In this context, entity types are our precondition
                                                                      • 1 means that at least one wikilink exists between the
     P. We want to infer types by analysing the similarity
                                                                        selected individual and any other individual typed with
     between rules derived by EKPs, and the configurations
                                                                        the corresponding DBPO class used in the feature.
     of wikilink relations obtained from untyped entities in
     DBpedia.
   • Homotype-based. We define homotypes as wikilinks              Individuals are classified by finding similarities to EKPs by
     that have the same type on both the subject and the           applying a similarity metric based on the analysis of path
     object of the triple. Since homotypes are usually the         popularity values. This suggests what is the closest EKP to
     most frequent (or in the top 3) wikilink types [13],          the configuration of wikilinks that the individual presents.
     we want to detect emerging homotypes for untyped              Given a labeled feature space model M and a vector I de-
     resources by summing the number of outgoing and in-           fined as described before, the similarity function is defined
     coming wikilinks.                                             as follows:
                                                                                               PF
                                                                                                 j=0 Mi,j ∗ Ij
                                                                                      S(i) = PF
                                                                                                    j=0 Mi,j
EKP-based type adbuction. EKPs are defined as sets of
                                                                   where
type paths that occur most often above a certain threshold
t [13]. In this context, a path relates the types of two en-
tities having a wikilink. The frequency of paths composing            • F is the number of DBPO classes used as features;
an EKP is calculated by the path popularity. The path pop-
ularity is defined as the ratio of how many distinct resources        • Mi,j is the path popularity value between the i-th class
of a certain type participate as subject in a path to the total         used as label and the j-th class used as feature;
number of resources of that type. We use path popularity
values of each EKP in order to estimate the similarity be-            • Ij is the 0 or 1 value that corresponds to the fact that
tween the wikilinks of an entity and a EKP. For doing that              the entity has or has not a wikilink relation with an
we build a labeled feature space model i × j composed by:               entity typed with the j-th class used as feature;

   • i rows, each one labeled by a different top-level class          • F is the number of features used in the model and
     of the DBPO taxonomy. This means 0 ≤ i ≤ 27;                       0 ≤ j ≤ F;

   • j features as columns consisting in the number of all            • 0 ≤ i ≤ L, where L is the number of available labels
     the classes in the DBPO. This means 0 ≤ j ≤ 272;                   in the model.
        (a) Results of the abductive type inference experiment based on DBpedia top-level classes used both as labels
        and features.


                (b) Results of the abductive type inference experiment based on classes Activity, Device,
                Event, Infrastructure, MeanOfTransportation, Organisation, Person, Place and Work
                used both as labels and features.


                        Figure 1: Results of the experiments based on abductive type inference.


The function S returns a similarity score calculated locally       negatives, etc.
with respect to a type, while we are interested in the score
that maximizes the similarity, hence:                              What also emerges from figure 1(a) is that a small subset
                                                                   (Device, Event, Infrastructure, MeanOfTransportation,
                  T = max S(i), ∀ 0 ≤ i ≤ L                        Organisation, Person, Place and Work) of classes contains
                                                                   the majority of the individuals. For that reason, we have
Assuming that for a certain entity Y with respect to the           tried to apply the abductive approach only to that subset
types Event, Work, Organisation, Person, Activity, Place           plus the class Activity, which instead has a very low preci-
has the following wikilink configuration:                          sion.
        Event    Work    Organisation Person Activity Place
                                                                   In this second experiment we have used those 9 classes as la-
 Y        0      1       0            1      0        0            bels for classifying DBpedia entities over the same 9 classes
                                                                   as features. Under these assumptions, the global precision
We obtain that the entity Y is classified as Person, since the     has fallen from 44.4% of the previous experiment down to
value S(P erson) = (1 ∗ 8.4 + 1 ∗ 18.29)/87.64 = 0.3 is the        36.5, while the recall has grown from 44.4% up to 79.5%.
highest similarity value.                                          Figure 1(b) shows the results of this experiment.

In our first abductive experiment we have focused only on
the top-level classes of the DBpedia taxonomy both for class       Homotype-based abduction. In this case we use abduc-
labels and for features. Both L and F are then equal to the        tion in order to guess the type of DBpedia individuals by
number of the top-level classes in the DBpedia taxonomy,           using homotypes as background knowledge. An homotype
i.e., 27.                                                          is usually the most frequent (or in the top 3) wikilink type.
                                                                   For that reason, we want to infer the type of a resource by
The precision and recall of this experiment are both, as           detecting the emerging homotype. If most of the incoming
shown in figure 1(a), 44.4%. The figure shows, besides the         and outgoing wikilinks of a resource R is of a same type T ,
precision and recall relative to single classes, additional met-   then we can infer, under the homotype assumption, that the
rics, like the number of true positives, false positives, false
type of R is T .                                                shows the details regarding global and local precision, recall
Given an individual i, we can then define the set W of all      and also reports the number of true positives, false positives
the classes used to type individuals having an incoming or      and false negatives.
an outgoing wikilink relation with i, as:                       The homotype-based approach produces a side-effect. In
                                                                fact, more than one class can emerge with the same fre-
        W (i) = {hn, Xi|n = #x ∈ X s.t i → x ∨ i ← x}
                                                                quency of wikilinks in H. In some cases, we get ex-aequo
where                                                           classifications, i.e. multi-typings. Multi-typing introduces
                                                                noise that produces a higher number of false positives. For
                                                                example, if an entity e, whose actual type is T1 , is classified
   • the notation ∈ stands here for the rdf:type property;      with types T1 , T2 and T3 , then T1 will be counted as a true
   • the → and ← stand for outgoing and incoming                positive, but at the same time T2 and T3 will be counted
     dbpo:wikiPageWikiLink properties;                          as false negatives. In general multi-typings are not desir-
                                                                able. Figure 4 shows the number of ex-aequo typings for
   • n is the number of wikilinks typed by a same class X;      each cluster found, i.e. 88,845 occurrences with 2 ex-aequo
                                                                classes, 11,572 occurrences with 3 ex-aequo classes, 1,118
   • < n, X > is any distinct couple that states how many       occurrences with 4 ex-aequo classes, 44 occurrences with 5
     instances of the class X occur in the wikilinks of i.      ex-aequo classes and, finally, 1 occurrence with 6 ex-aequo
                                                                classes, for a total of 101,580 ex-aequo classifications.
The homotype range H, emerging for an individual i, is
formalized as the type X having the highest value n in W (i),
i.e.,
        H(i) = {X   |   hn, Xi ∈ W (i)
                        =⇒ ∀hn0 , X 0 i ∈ W (i) , n ≥ n0 }
where, the notation ∈ has here the classical meaning from set
theory. Figure 2 shows how the homotype range is selected
                                                                Figure 3: Precision and recall of the homotype-
                                                                based classification.

                                                                In order to reduce the noise of multi-typings in the evalua-
                                                                tion of the homotype-based approach, we have adjusted the
                                                                performance analysis by applying the following criteria in
                                                                case of ex-aequo:


                                                                     • if among the ex-aequo classes there is the actual type
                                                                       of the entity, then count 1 true positive and 0 false
                                                                       positives;
                                                                     • if among the ex-aequo classes there is not the actual
                                                                       type of the entity, then count 0 true positives and 1
                                                                       false positive;


Figure 2: Incouming + outgoing wikilinks of dbpe-               This increased the precision up to 55.07%.
dia:Steve_Jobs grouped by their types.
                                                                Figure 5 shows the trend of the precision and recall through
grouping the sum of outgoing and incoming wikilinks for         the various experiments. Considering the abductive ap-
the resource dbpedia:Steve_Jobs grouped by their types.         proach has reported the best results, we have decided to
According to the definition of homotype given, the resource     run both the EKP-based and homotype-based classification
dbpedia:Steve_Jobs as shown in figure 2 should be typed         on a sample of 1,000 untyped resources, to be manually eval-
with the class dbpo:Person.                                     uated. The classifier was again limited to the 9 core top-level
For the classification experiment based on the homotype def-    classes as described before.
inition, we have defined a threshold  that represents the      Manual evaluation reported precision and recall with EKP-
minimum number of wikilinks that a resource should have         based classification as 5.98% and 7.9% respectively, while
in order to be classifiable. In fact, below a certain number    with homotype-based classification as 13.93% and 65.51%
of wikilinks, the homotype is less distinctive. The threshold   respectively. Figure 6 shows the details of the results of the
has been fixed to be  = 10 because, as reported by [13],       classifications performed on the sample of 1,000 untyped re-
the average number of outgoing wikilinks per resource in the    sources for the 9 top-level classes.
dbpedia_page_links_en data set of DBpedia is 10.
The homotype-based classification of individuals from the       4.    RELATED WORK
control set (with a known type) from the                        There is valuable research on knowledge extraction based on
dbpedia_instance_type_en data set produced a global pre-        the exploitation of Wikipedia. Ontology mining aims at dis-
cision of 52.14% and a global recall of 85.87%. Figure 3        covering hidden knowledge from ontological knowledge bases
                                                                 (a) Results of the EKP-based classification of the sample of
                                                                 1,000 untyped resources


Figure 4: Distribution of ax aequo classification
among the 5 clusters found.


                                                                 (b) Results of the Homotype-based classification of the sam-
                                                                 ple of 1,000 untyped resources


Figure 5: Precision and recall trend through the                 Figure 6: Results of the classifications over the
various experiments.                                             sample of 1,000 DBpedia untyped resources with
                                                                 the EKP-based classifier 6(a) and homotype-based
                                                                 one 6(b)
by using data mining and machine learning techniques [5].
[4] proposes an extension of the k-Nearest Neighbor algo-
rithm for Description Logic KBs based on the exploitation        which required accurate ontological analysis. There is a sig-
of an entropy-based dissimilarity measure, while [2, 6] makes    nificant overlap between YAGO and DBPO typed entities,
use of Support Vector Machine (SVN) [3] rather than NN           and entities that have only YAGO classes cover a small part
to perform automatic classification. SVM performs instance       of the entities untyped with DBPO. Furthermore, there is a
classification by implicitly mapping (through a kernel func-     lack of mapping between YAGO and DBPO, which makes
tion) training data to the input values in a higher dimen-       it difficult to exploit their merged coverage. Yago has a
sional feature space where instances can be classified by        larger coverage than DBPO, but it has only an overlap with
means of a linear classifier. The main difference between        DBPO coverage; moreover, the granularity of Yago cate-
these and our approach is that they analyse all semantic         gories is finer, and not easily reusable, because the top level
properties used for linking individuals, while we have look      is very large. The size of the ontology, and the fact that
for invariances on the usage of only one property, i.e.,         Yago adopts multi-typing, complicate helplessly the task of
dbpo:wikiPageWikiLink, which flattens the reasoning space.       automatic classification of types.
The current procedure for assigning type to DBpedia en-
tity is completely manual. As extensively described in [9],      5.   DISCUSSION AND CONCLUSION
a limited number of infobox templates have been defined          We want to discover the types of all DBpedia resources.
based on empirical observation of invariances in infoboxes of    Currently only a subset of them (about 15%, about 22% in
Wikipedia pages having the same ontological type. Based on       recent version 3.7) have explicit types that are coherently
previous work on wikilink invariances [13], we investigate the   organized into an ontology (DBpedia Ontology, DBPO).
automatic classification of DBpedia entities. The ontology       We investigated type inference in two directions: (i) an in-
that we use for the classification is the DBpedia Ontology       ductive approach typical of machine learning; (ii) an abduc-
(DBPO) that results from the manual procedure described          tive approach, in which we firstly used two already available
above, hence we inherit its limited ontological coverage. [16]   feature sets over the wikilink domain in order to infer types:
presents YAGO, an ontology extracted from Wikipedia cat-         (a) the EKPs [13] that have been extracted from Wikipedia
egories and infoboxes that has been combined with taxo-          with a statistical analysis over type paths generated by wik-
nomic relations from WordNet. Here the approach can be           ilinks, and (b) the notion of emerging homotype, i.e. a link
described as a reengineering task for transforming a the-        that has the same types in the subject and object of the
saurus, i.e. Wikipedia category taxonomy, to an ontology,        RDF triple.
                                                                    foboxes, but are also relatively established in existing foun-
                                                                    dational ontologies and ontology design patterns [8].
                                                                    An area of improvement for DBPO and DBpedia typing is
                                                                    therefore an extension of the ontology and the ways to use
                                                                    the extension to type untyped resources.
                                                                    A second area of improvement might be related to the impre-
                                                                    cision deriving from the massive overlap among the above-
                                                                    mentioned four core classes of DBPO.
                                                                    However, we might consider a more critical attitude. While
                                                                    we deem important the role of ontologies in accurately dis-
                                                                    tinguishing semantic types of things, specially in domains
                                                                    and tasks requiring fine-grained types, it might be interest-
                                                                    ing to explore an alternative vision of systematically ambigu-
                                                                    ous ontologies, where some types tend to merge because of
                                                                    their mutual dependence in the real world. For example, an
                                                                    organization is typically dependent on persons, places, and
                                                                    works, which on their turn can be often dependent on it:
                                                                    social organization is a major example.
Figure 7: Overlaps of classified entities among the                 Shifting our perspective, the distinctivity weakness of both
classes Place, Person, Organisation and Work.                       EKPs and homotypes in classifying DBpedia resources can
                                                                    find some basis in recent work [15], which confirms known
                                                                    semiotic assumptions about the centrality of (systematic)
                                                                    ambiguity in language, and poses significant challenges to
We observed two possible classification biases in the DBpe-
                                                                    ontological theories assuming semantic “pedantry”. An in-
dia datasets. The first bias is the ratio between the number
                                                                    teresting research topic in ontology design may investigate
of typed resources having wikilinks and the total number of
                                                                    the consequences of assigning high value to “clusterability” of
resources with wikilinks, i.e., 1,518,697 out of more than 15
                                                                    (systematically dependent) meaning dimensions rather than
million. We suspected that this bias may be the result of
                                                                    to their distinctivity.
a partial ontological coverage, which derives from the man-
                                                                    There are many directions that these results open up. Some
ual mappings of Wikipedia infoboxes used to extract DB-
                                                                    of them include the update of EKPs to DBpedia 3.7 and a
pedia resources to DBPO. The second, related, bias comes
                                                                    further analysis in order to discover more distinctive features
from the ratio between the average typed wikilinks owned
                                                                    in their extraction. Another direction involves a mixed ap-
by typed resources and the average typed wikilinks owned
                                                                    proach based both on inference and crowdsourcing through,
by untyped resources resources, that is 66% for the former
                                                                    for instance, exploratory search methods on Linked Data fol-
and 37% for the latter.
                                                                    lowing the direction we took with Aemoo [12]. Other useful
The results seem to confirm the factor caused by those bi-
                                                                    directions include use of indexing techniques, deep parsing
ases over the classification. In fact, while the results in clas-
                                                                    of natural language, or social network analyses.
sifying a test set of typed DBpedia resources – precisions
44.4% and 55.07% with EKPs and homotypes respectively –
are relatively satisfactory (specially considering the amount       6.   REFERENCES
of classes), we observed a dramatic fall of the precision in         [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked
classifying untyped DBpedia resources, decreasing to 5.98%               Data-The story so far. International Journal on
with EKPs and 13.93% with homotypes.                                     Semantic Web and Information Systems, 4(2):1–22,
This means that, besides being still valid the results about             2009.
the cognitive soundness of EKPs in providing an effective            [2] S. Bloehdorn and Y. Sure. Kernel Methods for Mining
and intuitive description of entities of a certain type (as re-          Instance Data in Ontologies. In K. Aberer, K.-S. Choi,
ported in [13]), our hypothesis about the distinctive capacity           N. Noy, D. Allemang, K.-I. Lee, L. J. B. Nixon,
of EKPs is weak. This seems due to wide overlaps among                   J. Golbeck, P. Mika, D. Maynard, G. Schreiber, and
EKPs. The same overlaps emerge when applying homotype-                   P. Cudré-Mauroux, editors, Proceedings of the 6th
based classification as well. Figure 7 shows the overlaps                International Semantic Web Conference and the 2nd
among the 4 largest classes of DBpedia, i.e., dbpo:Place,                Asian Semantic Web Conference (ISWC 2007 +
dbpo:Person, dbpo:Organisation and dbpo:Work.                            ASWC 2007), volume 4825 of Lecture Notes in
   Notice that the decrease in precision from the test set               Computer Science, pages 58–71, Busan, Korea,
to the untyped resource set spans between -38% and -41%                  November 2007. Springer Verlag.
across the different approaches, probably revealing an ap-           [3] N. Christianini and J. Shawe-Taylor. Support Vector
proximate 40% DBpedia resources that are true negatives                  Machines and Other kernel-based Learning Methods.
with reference to the existing DBPO, so providing a rough                Cambridge University Press, 2000.
quantification of the missing ontological coverage of the cur-       [4] C. d’Amato, N. Fanizzi, and F. Esposito. Query
rent DBPO.                                                               Answering and Ontology Population: an Inductive
For instance, a quick exploration of untyped resources imme-             Approach. In M. Hauswirth, M. Koubarakis, and
diately evidences the need for types representing important              S. Bechhofer, editors, Proceedings of the 5th European
notions such as Plan, Agreement, ScientificDiscipline, Col-              Semantic Web Conference (ESWC 2008), volume 5021
lection, Concept, etc.. It’s quite interesting to remark that            of Lecture Notes in Computer Science, Tenerife, Spain,
these notions are on one hand harder to generalize from in-              June 2008. Springer Verlag.
 [5] C. d’Amato, N. Fanizzi, and F. Esposito. Inductive
     Learning for the Semantic Web: What does it buy?
     Semantic Web, 1(1):53–59, 2010.
 [6] N. Fanizzi, C. d’Amato, and F. Esposito. Statistical
     Learning for Inductive Query Answering on OWL
     Ontologies. In A. P. Sheth, S. Staab, M. Dean,
     M. Paolucci, D. Maynard, T. W. Finin, and
     K. Thirunarayan, editors, Proceedings of the 7th
     International Semantic Web Conference (ISWC
     2008), volume 5318 of Lecture Notes in Computer
     Science, pages 195–212, Karlsruhe, Germany, October
     2008. Springer.
 [7] H. Frankfurt. Peirce’s notion of abduction. The
     Journal of Philosophy, 55(14):593–597, 1958.
 [8] A. Gangemi and V. Presutti. Handbook on Ontologies,
     chapter Ontology Design Patterns. Springer, 2nd
     edition, 2009.
 [9] J. Lehmann, C. Bizer, G. Kobilarov, S. Auer,
     C. Becker, R. Cyganiak, and S. Hellmann. DBpedia -
     A Crystallization Point for the Web of Data. Journal
     of Web Semantics, 7(3):154–165, 2009.
[10] M. Miller and M. F. RDF Primer. W3c
     recommendation, W3C, feb 2004.
     http://www.w3.org/TR/2004/REC-rdf-primer-
     20040210/.
[11] B. Motik, B. Parsia, and P. F. Patel-Schneider. OWL
     2 web ontology language structural specification and
     functional-style syntax. W3C recommendation, W3C,
     Oct. 2009. http://www.w3.org/TR/2009/REC-owl2-
     syntax-20091027/.
[12] A. Musetti, A. G. Nuzzolese, F. Draicchio, V. Presutti,
     E. Blomqvist, A. Gangemi, and P. Ciancarini. Aemoo:
     Exploratory search based on knowledge patterns over
     the semantic web. Semantic Web Challenge.
[13] A. G. Nuzzolese, A. Gangemi, V. Presutti, and
     P. Ciancarini. Encyclopedic Knowledge Patterns from
     Wikipedia Links. In L. Aroyo, N. Noy, and C. Welty,
     editors, Proceedings of the 10th International Semantic
     Web Conference (ISWC2011), pages 520–536.
     Springer, 2011.
[14] C. Peirce. Pragmatism as a principle and method of
     right thinking: The 1903 Harvard lectures on
     pragmatism. State Univ of New York Pr, 1997.
[15] S. T. Piantadosi, H. Tily, and E. Gibson. The
     communicative function of ambiguity in language.
     Cognition, 122(3):280 – 291, 2012.
[16] F. Suchanek, G. Kasneci, and G. Weikum. Yago - A
     Large Ontology from Wikipedia and WordNet.
     Elsevier Journal of Web Semantics, 6(3):203–217,
     2008.

</pre>