<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Type inference through the analysis of Wikipedia links</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valentina Presutti ISTC-CNR</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Semantic Technology Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy valentina.presutti@cnr.it</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aldo Gangemi ISTC-CNR, Semantic Technology Lab</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Andrea Giovanni Nuzzolese ISTC-CNR, STLab CS Dept., University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Paolo Ciancarini ISTC-CNR, STLab CS Dept., University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>16</volume>
      <issue>2012</issue>
      <abstract>
        <p>DBpedia contains millions of untyped entities, either if we consider the native DBpedia ontology, or Yago plus WordNet. Is it possible to automatically classify those entities? Based on previous work on wikilink invariances, we wondered if wikilinks convey a knowledge rich enough for their classi cation. In this paper we give three contributions. Concerning the DBpedia link structure, we describe some measurements and notice both problems (e.g. the bias that could be induced by the incomplete ontological coverage of the DBpedia ontology), and potentials existing in current type coverage. Concerning classi cation, we present two techniques that exploit wikilinks, one based on induction from machine learning techniques, and the other on abduction. Finally, we discuss the limited results of classi cation, which con rmed our fears expressed in the description of general gures from the measurement. We also suggest some new possible directions to entity classi cation that could be taken.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        DBpedia is the largest RDF data set in Linked Data
extracted from the largest existing multi-domain knowledge
source edited by the crowds, i.e. Wikipedia. Part of
DBpedia entities are explicitly typed with classes of the
DBpedia Ontology (DBPO). This huge source of semantic data
provides a powerful knowledge base that can be exploited
as background knowledge for developing new generation of
knowledge extraction and interaction tools. Nevertheless,
most of DBpedia entities are still untyped, which limits its
exploitation at its full potential. Based on previous work on
wikilink invariances [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we wondered if wikilinks convey a
knowledge rich enough for their automatic classi cation. In
this paper, we discuss both problems and potentials deriving
from the current type coverage of DBpedia entities.
Additionally, we present two automatic classi cation techniques
that exploit wikilinks, one based on indiction from machine
learning, and the other based on abduction. Both methods
showed limited results in terms of performance (i.e.,
precision and recall), which can be explained by analyzing the
general gures from our measurements on the DBpedia link
structure. More speci cally, (i) the mapping procedure
between Wikipedia infobox templates and DBpedia ontology
classes is conducted manually, as described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], resulting
in a lack of ontological coverage of DBPO i.e. DBPO classes
are insu cient for typing all DBpedia entities, which in turn
results in lack of classi cation knowledge that can be used
by automatic techniques for identifying type candidates for
untyped entities, (ii) the granularity of the assigned types
is not homogeneous e.g. some entities are typed with very
general classes e.g. Person, while other similar entities have
more specialized types e.g. Musician, (iii) most of the
entities are untyped, hence making it hard to build a proper
training set for inductive learning techniques, (iv) the
distribution of link types ingoing to, and outgoing from, DBpedia
entities varies between typed and untyped entities, which
impacts on the ability of our abductive method to properly
classify untyped entities.
      </p>
      <p>After describing the resources that we have used for our
research and discussing limitations and potentials emerging
from our measurements on the general gures of the
DBpedia link structure (Section 2), we present two methods
for automatic classi cation of DBpedia entities and the
results we have obtained (Section 3). In Section 4 we discuss
related work and in Section 5 we conclude by discussing
possible new approaches to DBpedia entity classi cation that
could be taken, based on more solid grounds.</p>
    </sec>
    <sec id="sec-2">
      <title>2. MATERIALS</title>
      <p>DBpedia datasets describe about 18 million resources (3
million with an abstract and a label, less than 2 million typed),
and include more than 107 million wikilinks.</p>
      <p>The classi cation attempted in this paper is based on the
assumption that wikilink relations in Wikipedia convey a
rich knowledge that can be used to classify untyped entities
referenced in those pages. In practice, given a certain entity
The reason of this unbalance can be hypothesized to be
caused by the high frequency of homotypes, i.e. wikilinks
that have the same type on both the subject and the object
of the triple. If this hypothesis is con rmed, untyped
resources should have a high ratio of untyped outgoing links.</p>
      <p>
        As a matter of fact, homotypes are actually very frequent
(usually the most frequent, or in the top 3) wikilink types
(this observation has been made in the research reported by
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). Therefore, such distribution of wikilinks for typed and
untyped resources is unbalanced.
      </p>
      <p>This means that if we use wikilinks and types as the features
for training or designing a good inductive model based on
the corpus of typed resources, a bias is created on the
appropriateness of such a model for classifying untyped resources.</p>
      <p>However, we wanted to check anyway: i) what precision/recall
can be obtained when using wikilink structures as features
for creating a type induction model from typed DBpedia
resources, even considering the bias constituted by the 34%
untyped wikilinks; and ii) how much the larger bias (63%),
constituted by untyped wikilinks on untyped resources, would
a ect the precision/recall established on typed resources.</p>
      <p>
        A part of our wikilink analysis for DBpedia entity classi
cation made use of 187 Knowledge Patterns that have been
extracted from DBpedia wikilink datasets, which are called
Encyclopedic Knowledge Patterns (EKPs) 4 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. EKPs
allow to fetch most relevant entity types that provide an e
ective and intuitive description of entities of a certain type. As
discussed in the next section, EKPs provide a background
knowledge to our method of abductive classi cation.
described in a Wikipedia page, our classi cation grounds on
the analysis of incoming and outgoing wikilinks from and to
that certain page.
      </p>
      <p>
        For this analysis we used DBpedia [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the RDF [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] Linked
Data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that contains structured information extracted from
Wikipedia.
      </p>
      <p>
        The types used to classify DBpedia resources are the classes
of the DBpedia ontology 1 (DBPO). DBPO ontology is
represented in OWL [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and covers 272 classes.
      </p>
      <p>The rdf:type statements from DBpedia resources to DBPO
classes (available in the dataset dbpedia_instance_types_en)
are the result of hand-generated mappings of Wikipedia in- 3. METHODS AND RESULTS
foboxes to DBPO, and have been generated for 1,668,503 The classi cation of DBpedia entities relies on an ontology
DBpedia resources2. mapping task which de nes how Wikipedia infobox
temThose statements cover only a subset of the 15,944,381 DB- plates are mapped to classes of the DBpedia ontology. These
pedia resources available in the dbpedia_page_links_en dataset. mappings are manually speci ed using the DBpedia
MapHence, only 15.52% of resources in the DBpedia data set of ping Language 5. The mapping language makes use of
Mewikilinks are typed with DBPO classes. In the work de- diaWiki templates that allow to map infoboxes to DBpedia
scribed in this paper we investigate how to assign a DBPO ontology classes. The mappings cover only a small subset of
type to the remaining 84.48% untyped DBpedia resources.3. all Wikipedia infoboxes. As a result, so far, only a small
subTable 2 shows further details about how resources are orga- set of all DBpedia entities (1,668,503 of 15,944,381) is typed
nized in DBpedia. It emerges that the number of wikilinks with a class of the DBpedia ontology. Probably the e ort
is much bigger (107,892,317) than the number of rdf:type spent in manually writing mappings for the classi cation
triples (6,173,940). When we take into account only DBPO of DBpedia entities with respect to the DBpedia ontology
type we can observe how the number of Wikilink axioms is too expensive and the granularity and the
appropriatehaving both the subject and the object typed with DBPO ness of obtained typings are not exhaustive. As an example,
types is limited to 16,745,830. dbpedia:Walt_Disney 6 is typed as dbpo:Person, which is
doubtlessly correct, but also trivial and less appropriate than
the existing dbpo:ComicsCreator type.</p>
      <p>Our work is based on the intuition that wikilink relations
in DBpedia, i.e. instances of the dbpo:wikiPageWikiLink
property, convey a rich knowledge that can be used for
classifying DBpedia entities, but at the same time it is very
di cult to nd a good training set by using only typed
wikilinks. A rst reason is that typed resources having wikilinks
are only the 15.52% of the total resources used in the
wikilink data set. A second one derives from the fact untyped
resources, i.e., the resources we are interested to type, have
Furthermore, we analyzed the distribution of wikilinks across
typed/untyped resources. The average percentage of fully
DBPO typed wikilinks (i.e., wikilink axioms having both
the subject and the object typed with a DBPO class) per
resource is 66% with respect to the total number of
wikilink axioms per resource. Instead, the average percentage
of wikilink axioms outgoing from untyped to typed resources
is 37% per resource with respect to the total number of
wikilink axioms per resource. This means that the ratio between
typed and untyped outgoing links is 23 for typed resources
vs. 13 for untyped ones.
1http://wiki.dbpedia.org/Ontology
2such gure holds for the typed resources in the English
version of DBpedia 3.6
3Actually, excluding some entities that are not relevant:
images, categories, and disambiguation pages
4The EKP resource is available as a set of OWL ontologies
at http://ontologydesignpatterns.org/ekp/owl/
5http://mappings.dbpedia.org/index.php/Main Page
6The pre xes dbpo and dbpedia stand
for http://dbpedia.org/ontology/ and
http://dbpedia.org/resource/ respectively.
only one third of their total wikilinks typed. In order to
test this intuition, we have investigated and compared type
induction methods over DBpedia entities based on two
inference types:</p>
      <p>Inductive classi cation: works moving from speci c
observations to broader generalizations and theories,
and can be informally de ned as a \bottom-up"
approach;
Abductive classi cation: works from more general rules
(assumed from previous cases), in order to infer
presumably speci c facts. In this case we have used EKPs
and homotypes as background knowledge.</p>
      <p>Considering that DBPO has 272 classes, and that automatic
classi cation on 272 classes is very di cult, we have focused
mainly on classi cation of entities with respect to the DBPO
top-level. The granularity of the classi cation is solved in
this case by adopting a hierarchical-iterative type induction
derived by the class hierarchy in DBPO. This means that
the classi cation starts from the top level of the DBpedia
ontology composed by 27 classes, and then iteratively goes
on trying to classify an entity with one of the sub-classes of
the class selected in the previous iteration.</p>
    </sec>
    <sec id="sec-3">
      <title>3.1 Inductive type inference</title>
      <p>
        Hybridization of Machine Learning (ML) techniques with
the Semantic Web is quite e ective [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], therefore we started
our investigation trying to use a ML-based approach to the
classi cation of DBpedia entities. We have used the
kNearest Neighbor (NN) algorithm for classifying DBpedia
entities based on the closest training examples in the
labeled feature space, and by assigning the most voted class
among the training examples.
      </p>
      <p>We have designed two inductive classi cation experiments
based on the NN algorithm: (i) a baseline experiment con
gured to classify DBpedia individuals based on 272 features,
i.e., all the DBPO classes. (ii) An experiment based on
the top-level classes in the DBPO taxonomy, i.e., 27 classes,
aimed to simplify the classi cation with less features to
investigate. In both cases the training sample has been built
with the same approach described as follows: (i) the 20% of
the individuals of each class has been used for populating
the training sample. Table 3.1 shows how many individuals
have been chosen with respect to the classes Place, Person,
Work, Organisation and Activity. (ii) The NN algorithm
has been trained on a labeled feature space model in which
the individuals of the training sample and the classes of the</p>
      <p>DBPO were the rows and the feature columns of the model
respectively. For each individual we labeled the
corresponding row with its known DBPO class and we then analyzed its
graph of wikilinks in order to ll the matrix resulting from
the space-vector model. This was done marking with either
0 or 1 each intersection cell between a row corresponding to
an individual and a feature corresponding to a DBPO class
with the following criteria:
0 means that no wikilink exists between the selected
individual and any other individual typed with the
corresponding DBPO class used as feature;
1 means that at least one wikilink exists between the
selected individual and any other individual typed with
the corresponding DBPO class used as a feature.</p>
      <p>As an example, we may want to built the feature model of
the wikilinks of the entity dbpedia:Steve_Jobs with respect
to the classes dbpo:Mammal, dbpo:Scientist, dbpo:Company,
dbpo:Drug, dbpo:City, dbpo:Magazine. We fetch all the
types related to the outgoing wikilinks from dbpedia:Steve_Jobs
with a simple SPARQL query like the following:
PREFIX dbpedia: &lt;http://dbpedia.org/resource/&gt;
SELECT ?link ?type WHERE {</p>
      <p>GRAPH&lt;dbpedia_page_links_en&gt; {</p>
      <p>dbpedia:Steve_Jobs ?prop ?link
} .</p>
      <p>GRAPH&lt;dbpedia_instance_types_en&gt; {</p>
      <p>?link a ?type
}</p>
      <p>}
Supposing that all the wikilinks retrieved (actually, those
shown are only a subset) are the ones showed in table 3.1
the resulting row in feature space model deriving from
dbpedia:Steve_Jobs will be the following:</p>
      <p>Mammal Scientist Company Drug City Magazine
Steve_Jobs 0 0 1 0 1 1
Person
After the training, for each class the classi cation function
generates what is called a mean, i.e., the reference value
used to test similarities and then to classify untyped
entities. The classi cation has been performed by estimating
the euclidean distance of the features of unlabeled
individuals with respect to each mean. The lower values of euclidean
distance have been chosen to classify individuals.</p>
      <p>The performance of this approach has been evaluated on the
remaining 80% untyped individuals. We remind that such
evaluation is made considering the \best case" of resources
with a known type, in which the percentage of typed
wikilinks is high (66%). The precision deriving from the baseline
experiment has been 31.65%. The precision of the classi
cation of individuals with respect to the DBPO top-level has
been 40.27%.</p>
    </sec>
    <sec id="sec-4">
      <title>3.2 Abductive type inference</title>
      <p>
        Abduction was introduced by Peirce [
        <xref ref-type="bibr" rid="ref14 ref7">7, 14</xref>
        ], and refers to
a process oriented to an explanatory hypothesis about the
precondition P reasoning on the consequence C. Compared
to pure deduction, induction and abduction have a lower
strength, much are practical when the set of assumptions is
not complete with respect to the observed world. Induction
cannot be made fully conclusive, since the inference from
a set of cases can be only made certain in a closed world
(which is not the case with the Web). Abduction, on its
turn, is formally equivalent to the logical fallacy a rming
the consequent or Post hoc ergo propter hoc, because there
are multiple possible explanations for C. In other words,
abducing P from C involves determining that C is su cient
(or nearly su cient), but not necessary, for P.
      </p>
      <p>
        We have used an abductive approach to infer the type of
DBpedia entities with two classi cation methods:
EKP-based. We assumed Encyclopedic Knowledge
Patterns (EKPs) extracted from wikilink relations in Wikipedia
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as our background knowledge, and as the
adbuctive consequence C, on which we infer the type of
entities. In this context, entity types are our precondition
P. We want to infer types by analysing the similarity
between rules derived by EKPs, and the con gurations
of wikilink relations obtained from untyped entities in
DBpedia.
      </p>
      <p>
        Homotype-based. We de ne homotypes as wikilinks
that have the same type on both the subject and the
object of the triple. Since homotypes are usually the
most frequent (or in the top 3) wikilink types [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
we want to detect emerging homotypes for untyped
resources by summing the number of outgoing and
incoming wikilinks.
      </p>
      <p>
        EKP-based type adbuction. EKPs are de ned as sets of
type paths that occur most often above a certain threshold
t [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In this context, a path relates the types of two
entities having a wikilink. The frequency of paths composing
an EKP is calculated by the path popularity. The path
popularity is de ned as the ratio of how many distinct resources
of a certain type participate as subject in a path to the total
number of resources of that type. We use path popularity
values of each EKP in order to estimate the similarity
between the wikilinks of an entity and a EKP. For doing that
we build a labeled feature space model i j composed by:
i rows, each one labeled by a di erent top-level class
of the DBPO taxonomy. This means 0 i 27;
j features as columns consisting in the number of all
the classes in the DBPO. This means 0 j 272;
cell values that contain path popularity values
occurring between the type in the row and the type in the
column.
      </p>
      <p>The following is an example of feature space model built
for the class labels Person, Place and Oganisation over
the features Event, Work, Organisation, Person, Activity,
Place:</p>
      <p>Event Work
. . . . . . . . .</p>
      <p>Person 4.45 8.4
Organisation 3.21 7.78
Place 3.14 1.86
. . . . . . . . .</p>
      <p>Organisation Person Activity Place
. . . . . . . . . . . .
22.2 18.29 6.8 27.5
24.13 13.23 2.5 31.93
8.71 8.74 1.1 61.15
. . . . . . . . . . . .</p>
      <p>In the abductive approach, di erently from the inductive
one, the labeled feature space model is not used for training
the classi cation function. In fact, it already provides
general rules based on path popularity values over the wikilink
domain used to infer types.</p>
      <p>Instead, entities to be classi ed are represented in the same
way as in the inductive approach. Therefore, they are vector
models of length j, where j is the number of features used
in the model. Values in vectors consists of either 0 or 1, in
which, again:
0 means that no wikilink exists between the selected
individual and any other individual typed with the
corresponding DBPO class used in the feature;
1 means that at least one wikilink exists between the
selected individual and any other individual typed with
the corresponding DBPO class used in the feature.
Individuals are classi ed by nding similarities to EKPs by
applying a similarity metric based on the analysis of path
popularity values. This suggests what is the closest EKP to
the con guration of wikilinks that the individual presents.
Given a labeled feature space model M and a vector I
dened as described before, the similarity function is de ned
as follows:</p>
      <p>S(i) =</p>
      <p>PF
j=0 Mi;j Ij
PF
j=0 Mi;j
where</p>
      <sec id="sec-4-1">
        <title>F is the number of DBPO classes used as features;</title>
        <p>Mi;j is the path popularity value between the i-th class
used as label and the j-th class used as feature;
Ij is the 0 or 1 value that corresponds to the fact that
the entity has or has not a wikilink relation with an
entity typed with the j-th class used as feature;
F is the number of features used in the model and
0 j F ;
0 i L, where L is the number of available labels
in the model.
(a) Results of the abductive type inference experiment based on DBpedia top-level classes used both as labels
and features.</p>
        <p>(b) Results of the abductive type inference experiment based on classes Activity, Device,
Event, Infrastructure, MeanOfTransportation, Organisation, Person, Place and Work
used both as labels and features.</p>
        <p>The function S returns a similarity score calculated locally
with respect to a type, while we are interested in the score
that maximizes the similarity, hence:</p>
        <p>T = max S(i); 8 0
i</p>
        <p>L
Assuming that for a certain entity Y with respect to the
types Event, Work, Organisation, Person, Activity, Place
has the following wikilink con guration:</p>
        <p>Y</p>
        <p>Event Work
0 1</p>
        <p>Organisation Person Activity Place
0 1 0 0
We obtain that the entity Y is classi ed as Person, since the
value S(P erson) = (1 8:4 + 1 18:29)=87:64 = 0:3 is the
highest similarity value.</p>
        <p>In our rst abductive experiment we have focused only on
the top-level classes of the DBpedia taxonomy both for class
labels and for features. Both L and F are then equal to the
number of the top-level classes in the DBpedia taxonomy,
i.e., 27.</p>
        <p>The precision and recall of this experiment are both, as
shown in gure 1(a), 44.4%. The gure shows, besides the
precision and recall relative to single classes, additional
metrics, like the number of true positives, false positives, false
negatives, etc.</p>
        <p>What also emerges from gure 1(a) is that a small subset
(Device, Event, Infrastructure, MeanOfTransportation,
Organisation, Person, Place and Work) of classes contains
the majority of the individuals. For that reason, we have
tried to apply the abductive approach only to that subset
plus the class Activity, which instead has a very low
precision.</p>
        <p>In this second experiment we have used those 9 classes as
labels for classifying DBpedia entities over the same 9 classes
as features. Under these assumptions, the global precision
has fallen from 44.4% of the previous experiment down to
36.5, while the recall has grown from 44.4% up to 79.5%.</p>
        <p>Figure 1(b) shows the results of this experiment.</p>
        <p>Homotype-based abduction. In this case we use
abduction in order to guess the type of DBpedia individuals by
using homotypes as background knowledge. An homotype
is usually the most frequent (or in the top 3) wikilink type.</p>
        <p>For that reason, we want to infer the type of a resource by
detecting the emerging homotype. If most of the incoming
and outgoing wikilinks of a resource R is of a same type T ,
then we can infer, under the homotype assumption, that the
type of R is T .</p>
        <p>Given an individual i, we can then de ne the set W of all
the classes used to type individuals having an incoming or
an outgoing wikilink relation with i, as:</p>
        <p>W (i) = fhn; Xijn = #x 2 X s.t i ! x _ i</p>
        <p>xg
where
the notation 2 stands here for the rdf:type property;
the ! and stand for outgoing and incoming
dbpo:wikiPageWikiLink properties;
n is the number of wikilinks typed by a same class X;
&lt; n; X &gt; is any distinct couple that states how many
instances of the class X occur in the wikilinks of i.</p>
        <p>The homotype range H, emerging for an individual i, is
formalized as the type X having the highest value n in W (i),
i.e.,</p>
        <p>H(i) = fX
j hn; Xi 2 W (i)
=) 8hn0; X0i 2 W (i) ; n</p>
        <p>n0g
where, the notation 2 has here the classical meaning from set
theory. Figure 2 shows how the homotype range is selected
grouping the sum of outgoing and incoming wikilinks for
the resource dbpedia:Steve_Jobs grouped by their types.</p>
        <p>According to the de nition of homotype given, the resource
dbpedia:Steve_Jobs as shown in gure 2 should be typed
with the class dbpo:Person.</p>
        <p>
          For the classi cation experiment based on the homotype
definition, we have de ned a threshold that represents the
minimum number of wikilinks that a resource should have
in order to be classi able. In fact, below a certain number
of wikilinks, the homotype is less distinctive. The threshold
has been xed to be = 10 because, as reported by [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
the average number of outgoing wikilinks per resource in the
dbpedia_page_links_en data set of DBpedia is 10.
        </p>
        <p>The homotype-based classi cation of individuals from the
control set (with a known type) from the
dbpedia_instance_type_en data set produced a global
precision of 52.14% and a global recall of 85.87%. Figure 3
shows the details regarding global and local precision, recall
and also reports the number of true positives, false positives
and false negatives.</p>
        <p>The homotype-based approach produces a side-e ect. In
fact, more than one class can emerge with the same
frequency of wikilinks in H. In some cases, we get ex-aequo
classi cations, i.e. multi-typings. Multi-typing introduces
noise that produces a higher number of false positives. For
example, if an entity e, whose actual type is T1, is classi ed
with types T1, T2 and T3, then T1 will be counted as a true
positive, but at the same time T2 and T3 will be counted
as false negatives. In general multi-typings are not
desirable. Figure 4 shows the number of ex-aequo typings for
each cluster found, i.e. 88,845 occurrences with 2 ex-aequo
classes, 11,572 occurrences with 3 ex-aequo classes, 1,118
occurrences with 4 ex-aequo classes, 44 occurrences with 5
ex-aequo classes and, nally, 1 occurrence with 6 ex-aequo
classes, for a total of 101,580 ex-aequo classi cations.</p>
        <p>In order to reduce the noise of multi-typings in the
evaluation of the homotype-based approach, we have adjusted the
performance analysis by applying the following criteria in
case of ex-aequo:
if among the ex-aequo classes there is the actual type
of the entity, then count 1 true positive and 0 false
positives;
if among the ex-aequo classes there is not the actual
type of the entity, then count 0 true positives and 1
false positive;</p>
      </sec>
      <sec id="sec-4-2">
        <title>This increased the precision up to 55.07%.</title>
        <p>Figure 5 shows the trend of the precision and recall through
the various experiments. Considering the abductive
approach has reported the best results, we have decided to
run both the EKP-based and homotype-based classi cation
on a sample of 1,000 untyped resources, to be manually
evaluated. The classi er was again limited to the 9 core top-level
classes as described before.</p>
        <p>Manual evaluation reported precision and recall with
EKPbased classi cation as 5.98% and 7.9% respectively, while
with homotype-based classi cation as 13.93% and 65.51%
respectively. Figure 6 shows the details of the results of the
classi cations performed on the sample of 1,000 untyped
resources for the 9 top-level classes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. RELATED WORK</title>
      <p>
        There is valuable research on knowledge extraction based on
the exploitation of Wikipedia. Ontology mining aims at
discovering hidden knowledge from ontological knowledge bases
by using data mining and machine learning techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposes an extension of the k-Nearest Neighbor
algorithm for Description Logic KBs based on the exploitation
of an entropy-based dissimilarity measure, while [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ] makes
use of Support Vector Machine (SVN) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] rather than NN
to perform automatic classi cation. SVM performs instance
classi cation by implicitly mapping (through a kernel
function) training data to the input values in a higher
dimensional feature space where instances can be classi ed by
means of a linear classi er. The main di erence between
these and our approach is that they analyse all semantic
properties used for linking individuals, while we have look
for invariances on the usage of only one property, i.e.,
dbpo:wikiPageWikiLink, which attens the reasoning space.
The current procedure for assigning type to DBpedia
entity is completely manual. As extensively described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
a limited number of infobox templates have been de ned
based on empirical observation of invariances in infoboxes of
Wikipedia pages having the same ontological type. Based on
previous work on wikilink invariances [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we investigate the
automatic classi cation of DBpedia entities. The ontology
that we use for the classi cation is the DBpedia Ontology
(DBPO) that results from the manual procedure described
above, hence we inherit its limited ontological coverage. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
presents YAGO, an ontology extracted from Wikipedia
categories and infoboxes that has been combined with
taxonomic relations from WordNet. Here the approach can be
described as a reengineering task for transforming a
thesaurus, i.e. Wikipedia category taxonomy, to an ontology,
(a) Results of the EKP-based classi cation of the sample of
1,000 untyped resources
(b) Results of the Homotype-based classi cation of the
sample of 1,000 untyped resources
which required accurate ontological analysis. There is a
signi cant overlap between YAGO and DBPO typed entities,
and entities that have only YAGO classes cover a small part
of the entities untyped with DBPO. Furthermore, there is a
lack of mapping between YAGO and DBPO, which makes
it di cult to exploit their merged coverage. Yago has a
larger coverage than DBPO, but it has only an overlap with
DBPO coverage; moreover, the granularity of Yago
categories is ner, and not easily reusable, because the top level
is very large. The size of the ontology, and the fact that
Yago adopts multi-typing, complicate helplessly the task of
automatic classi cation of types.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. DISCUSSION AND CONCLUSION</title>
      <p>
        We want to discover the types of all DBpedia resources.
Currently only a subset of them (about 15%, about 22% in
recent version 3.7) have explicit types that are coherently
organized into an ontology (DBpedia Ontology, DBPO).
We investigated type inference in two directions: (i) an
inductive approach typical of machine learning; (ii) an
abductive approach, in which we rstly used two already available
feature sets over the wikilink domain in order to infer types:
(a) the EKPs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that have been extracted from Wikipedia
with a statistical analysis over type paths generated by
wikilinks, and (b) the notion of emerging homotype, i.e. a link
that has the same types in the subject and object of the
RDF triple.
We observed two possible classi cation biases in the
DBpedia datasets. The rst bias is the ratio between the number
of typed resources having wikilinks and the total number of
resources with wikilinks, i.e., 1,518,697 out of more than 15
million. We suspected that this bias may be the result of
a partial ontological coverage, which derives from the
manual mappings of Wikipedia infoboxes used to extract
DBpedia resources to DBPO. The second, related, bias comes
from the ratio between the average typed wikilinks owned
by typed resources and the average typed wikilinks owned
by untyped resources resources, that is 66% for the former
and 37% for the latter.
      </p>
      <p>The results seem to con rm the factor caused by those
biases over the classi cation. In fact, while the results in
classifying a test set of typed DBpedia resources { precisions
44.4% and 55.07% with EKPs and homotypes respectively {
are relatively satisfactory (specially considering the amount
of classes), we observed a dramatic fall of the precision in
classifying untyped DBpedia resources, decreasing to 5.98%
with EKPs and 13.93% with homotypes.</p>
      <p>
        This means that, besides being still valid the results about
the cognitive soundness of EKPs in providing an e ective
and intuitive description of entities of a certain type (as
reported in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), our hypothesis about the distinctive capacity
of EKPs is weak. This seems due to wide overlaps among
EKPs. The same overlaps emerge when applying
homotypebased classi cation as well. Figure 7 shows the overlaps
among the 4 largest classes of DBpedia, i.e., dbpo:Place,
dbpo:Person, dbpo:Organisation and dbpo:Work.
      </p>
      <p>Notice that the decrease in precision from the test set
to the untyped resource set spans between -38% and -41%
across the di erent approaches, probably revealing an
approximate 40% DBpedia resources that are true negatives
with reference to the existing DBPO, so providing a rough
quanti cation of the missing ontological coverage of the
current DBPO.</p>
      <p>
        For instance, a quick exploration of untyped resources
immediately evidences the need for types representing important
notions such as Plan, Agreement, Scienti cDiscipline,
Collection, Concept, etc.. It's quite interesting to remark that
these notions are on one hand harder to generalize from
infoboxes, but are also relatively established in existing
foundational ontologies and ontology design patterns [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
An area of improvement for DBPO and DBpedia typing is
therefore an extension of the ontology and the ways to use
the extension to type untyped resources.
      </p>
      <p>A second area of improvement might be related to the
imprecision deriving from the massive overlap among the
abovementioned four core classes of DBPO.</p>
      <p>However, we might consider a more critical attitude. While
we deem important the role of ontologies in accurately
distinguishing semantic types of things, specially in domains
and tasks requiring ne-grained types, it might be
interesting to explore an alternative vision of systematically
ambiguous ontologies, where some types tend to merge because of
their mutual dependence in the real world. For example, an
organization is typically dependent on persons, places, and
works, which on their turn can be often dependent on it:
social organization is a major example.</p>
      <p>
        Shifting our perspective, the distinctivity weakness of both
EKPs and homotypes in classifying DBpedia resources can
nd some basis in recent work [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which con rms known
semiotic assumptions about the centrality of (systematic)
ambiguity in language, and poses signi cant challenges to
ontological theories assuming semantic \pedantry". An
interesting research topic in ontology design may investigate
the consequences of assigning high value to \clusterability" of
(systematically dependent) meaning dimensions rather than
to their distinctivity.
      </p>
      <p>
        There are many directions that these results open up. Some
of them include the update of EKPs to DBpedia 3.7 and a
further analysis in order to discover more distinctive features
in their extraction. Another direction involves a mixed
approach based both on inference and crowdsourcing through,
for instance, exploratory search methods on Linked Data
following the direction we took with Aemoo [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Other useful
directions include use of indexing techniques, deep parsing
of natural language, or social network analyses.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data-The story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bloehdorn</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sure</surname>
          </string-name>
          .
          <article-title>Kernel Methods for Mining Instance Data in Ontologies</article-title>
          . In K. Aberer, K.-S. Choi,
          <string-name>
            <given-names>N.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Allemang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J. B.</given-names>
            <surname>Nixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Golbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          , G. Schreiber, and P. Cudre-Mauroux, editors,
          <source>Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference (ISWC</source>
          <year>2007</year>
          +
          <article-title>ASWC 2007)</article-title>
          , volume
          <volume>4825</volume>
          of Lecture Notes in Computer Science, pages
          <volume>58</volume>
          {
          <fpage>71</fpage>
          ,
          <string-name>
            <surname>Busan</surname>
          </string-name>
          , Korea,
          <year>November 2007</year>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Christianini</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          .
          <article-title>Support Vector Machines</article-title>
          and
          <article-title>Other kernel-based Learning Methods</article-title>
          . Cambridge University Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C. d'Amato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>Query Answering and Ontology Population: an Inductive Approach</article-title>
          . In M. Hauswirth,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          , and S. Bechhofer, editors,
          <source>Proceedings of the 5th European Semantic Web Conference (ESWC</source>
          <year>2008</year>
          ), volume
          <volume>5021</volume>
          of Lecture Notes in Computer Science, Tenerife, Spain, June 2008. Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>C. d'Amato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>Inductive Learning for the Semantic Web: What does it buy?</article-title>
          <source>Semantic Web</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>53</volume>
          {
          <fpage>59</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanizzi</surname>
          </string-name>
          , C. d'Amato, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>Statistical Learning for Inductive Query Answering on OWL Ontologies</article-title>
          . In A. P. Sheth,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paolucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Thirunarayan, editors,
          <source>Proceedings of the 7th International Semantic Web Conference (ISWC</source>
          <year>2008</year>
          ), volume
          <volume>5318</volume>
          of Lecture Notes in Computer Science, pages
          <volume>195</volume>
          {
          <fpage>212</fpage>
          ,
          <string-name>
            <surname>Karlsruhe</surname>
          </string-name>
          , Germany,
          <year>October 2008</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Frankfurt</surname>
          </string-name>
          .
          <article-title>Peirce's notion of abduction</article-title>
          .
          <source>The Journal of Philosophy</source>
          ,
          <volume>55</volume>
          (
          <issue>14</issue>
          ):
          <volume>593</volume>
          {
          <fpage>597</fpage>
          ,
          <year>1958</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          . Handbook on Ontologies,
          <source>chapter Ontology Design Patterns. Springer, 2nd edition</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann. DBpedia - A Crystallization</surname>
          </string-name>
          <article-title>Point for the Web of Data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>154</volume>
          {
          <fpage>165</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miller and M. F. RDF</surname>
          </string-name>
          <article-title>Primer. W3c recommendation, W3C</article-title>
          , feb
          <year>2004</year>
          . http://www.w3.org/TR/2004/REC-rdf-primer20040210/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Patel-Schneider</surname>
          </string-name>
          .
          <article-title>OWL 2 web ontology language structural speci cation and functional-style syntax</article-title>
          .
          <source>W3C recommendation, W3C</source>
          , Oct.
          <year>2009</year>
          . http://www.w3.org/TR/2009/REC-owl2
          <string-name>
            <surname>-</surname>
          </string-name>
          syntax-20091027/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Musetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Draicchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Blomqvist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciancarini</surname>
          </string-name>
          . Aemoo:
          <article-title>Exploratory search based on knowledge patterns over the semantic web</article-title>
          .
          <source>Semantic Web Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciancarini</surname>
          </string-name>
          .
          <article-title>Encyclopedic Knowledge Patterns from Wikipedia Links</article-title>
          . In L. Aroyo,
          <string-name>
            <given-names>N.</given-names>
            <surname>Noy</surname>
          </string-name>
          , and C. Welty, editors,
          <source>Proceedings of the 10th International Semantic Web Conference (ISWC2011)</source>
          , pages
          <fpage>520</fpage>
          {
          <fpage>536</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peirce</surname>
          </string-name>
          .
          <article-title>Pragmatism as a principle and method of right thinking: The 1903 Harvard lectures on pragmatism</article-title>
          .
          <source>State Univ of New York Pr</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Piantadosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tily</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Gibson.</surname>
          </string-name>
          <article-title>The communicative function of ambiguity in language</article-title>
          .
          <source>Cognition</source>
          ,
          <volume>122</volume>
          (
          <issue>3</issue>
          ):
          <volume>280</volume>
          {
          <fpage>291</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. Yago - A Large</surname>
          </string-name>
          <article-title>Ontology from Wikipedia and WordNet</article-title>
          .
          <source>Elsevier Journal of Web Semantics</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ):
          <volume>203</volume>
          {
          <fpage>217</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>