<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Mentions to Ontology: A Pilot Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Octavian Popescu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Pianta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luciano Serafini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuela Speranza</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrei Tamilin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ITC-irst</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>- In this paper we propose a pilot study aimed at an indepth comprehension of the phenomena underlying Ontology Population from text. The study has been carried out on a collection of Italian news articles, which have been manually annotated at several semantic levels. More specifically, we have annotated all the textual expressions (i.e. mentions) referring to Persons; each mention has been in turn decomposed into a number of attribute/value pairs; co-reference relations among mentions have been established, resulting in the identification of entities, which, finally, have been used to populate an ontology. There are two significant results of such a study. First, a number of factors have been empirically identified which determine the difficulty of Ontology Population from Text and which can now be taken into account while designing automatic systems. Second, the resulting dataset is a valuable resource for training and testing single components of Ontology Population systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>n this paper we propose an empirical investigation into
Ithe relations between language and knowledge, aiming at
the definition of a computational framework for automatic
Ontology Population (OP) from text.</p>
      <p>
        While Ontology Population from text has received an
increasing attention in recent years
        <xref ref-type="bibr" rid="ref5">(see for instance, Buitelaar
et al. 2005)</xref>
        , mostly due to its strong relationship with the
Semantic Web perspective, very little has been done in order
to provide a clear definition of the task and to establish shared
evaluation procedures and benchmarks. In this paper we
propose a pilot study aimed at an in-depth comprehension of
the phenomena underlying Ontology Population from Text
(OPTM). Specifically, we are interested in highlighting the
following aspects of the task:
• What are the major sources of difficulty of the task?
• How does OP from text relate to well known tasks in
Natural Language Processing, such as Named Entity
Recognition?
      </p>
      <p>• What kinds of reasoning capabilities are crucial for the
task?</p>
      <p>• Is there any way to simplify the task so that it can be
addressed in a modular way?</p>
      <p>• Can we devise useful metrics to evaluate system
performance?</p>
      <p>We addressed the above questions through a pilot study on a
limited amount of textual data. We added two restrictions with
respect to the general OP task: first, we considered textual
mentions instead of full text; second, we focused on
information related to PERSON entities instead of considering
all possible entities (e.g. ORGANIZATION, LOCATION, etc.).</p>
      <p>
        Mentions, as defined within the ACE (Automatic Content
Extraction)1 Entity Detection Task
        <xref ref-type="bibr" rid="ref9">(Linguistic Data
Consortium, 2004)</xref>
        are portions of text that refer to entities. As
an example, given a particular textual context, the two
mentions “George W. Bush” and “the U.S President” refer to
the same entity, i.e. a particular instance of PERSON whose first
name is “George”, whose middle initial is “W.”, whose family
name is “Bush” and whose role is “President of the U.S.”.
      </p>
      <p>As for PERSON entities, they were selected for our pilot
study because they occur very frequently in the news document
collection we analyzed. Most of the results we obtained,
however, are likely to be generalized over the other types of
entities.</p>
      <p>Given the above-mentioned restrictions, the contribution of
this paper is a thorough study of Ontology Population from
Textual Mentions (OPTM). We have manually extracted a
number of relevant details concerning entities of type PERSON
from the document collection and then used them to populate a
small pre-existing ontology. This led to two significant results
of such a study. First, a number of factors have been
empirically identified which determine the difficulty of
Ontology Population from Text and which can now be taken
into account while designing automatic systems. Second, the
resulting dataset is a valuable resource for training and testing
single components of Ontology Population.</p>
      <p>We show that the difficulty of the OPTM task is directly
correlated to two factors: (A) the difficulty of identifying
attribute/value pairs inside a given mention and (B) the
difficulty of establishing co-reference between entities based
on the values of their attributes.</p>
      <p>
        There are several advantages of OPTM that makes it
appealing for OLP. First, mentions provide an obvious
simplification with respect to the more general task of
Ontology Population from text
        <xref ref-type="bibr" rid="ref5">(cfr. Buitelaar et al. 2005)</xref>
        ; in
addition, mentions are well defined and there are systems for
automatic mention recognition which can provide the input for
that task. Second, since mentions have been introduced as an
evolution of the traditional Named Entity Recognition task
        <xref ref-type="bibr" rid="ref10 ref11 ref12">(see Tanev and Magnini, 2006)</xref>
        , they guarantee a reasonable
level of complexity, which makes OPTM challenging both for
the Computational Linguistics and the Knowledge
Representation communities. Third, there already exist data
annotated with mentions, delivered under the ACE initiative
        <xref ref-type="bibr" rid="ref6 ref9">(Ferro et al. 2005, Linguistic Data Consortium 2004)</xref>
        , which
make it possible to exploit machine learning approaches. The
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.nist.gov/speech/tests/ace</title>
      <p>availability of annotated data allows for a better estimation of
the performance of OPTM; in particular, it is possible to
evaluate the recall of the task, i.e. the proportion of
information correctly assigned to an entity out of the total
amount of information provided by a certain mention.</p>
      <p>The paper is structured as follows. Section II provides some
background on Ontology Population and reports on relevant
related work; Section III describes the dataset of the PERSON
pilot study and compares it to the ACE dataset. Section IV
introduces a new methodology for the semantic annotation of
attribute/value pairs within textual mentions. In section V we
describe the Ontology we plan on using. Finally, Section VI
reports on a quantitative and qualitative analysis of the data,
which help determining the main sources of difficulty of the
task. Conclusions are drawn in Section VII.</p>
      <p>.</p>
    </sec>
    <sec id="sec-3">
      <title>II. RELATED WORK</title>
      <p>
        Automatic Ontology Population (OP) from texts has
recently emerged as a new field of application for knowledge
acquisition techniques
        <xref ref-type="bibr" rid="ref5">(Buitelaar et al., 2005)</xref>
        . Although there
is no widely accepted definition for the OP task, a useful
approximation has been suggested by
        <xref ref-type="bibr" rid="ref3">(Bontcheva and
Cunningham, 2003)</xref>
        as Ontology Driven Information
Extraction with the goal of extracting and classifying instances
of concepts and relations defined in an ontology, in place of
filling a template. A similar task has been approached in a
variety of similar perspectives, including term clustering
        <xref ref-type="bibr" rid="ref1 ref8">(Lin,
1998; Almuhareb and Poesio, 2004)</xref>
        and term categorization
        <xref ref-type="bibr" rid="ref2">(Avancini et al., 2003)</xref>
        . A rather different task is Ontology
Learning, where new concepts and relations are supposed to be
acquired with the consequence of changing the definition of
the Ontology itself
        <xref ref-type="bibr" rid="ref13">(Velardi et al. 2005)</xref>
        .
      </p>
      <p>The interest in OP is also reflected in the large number of
research projects which consider knowledge extraction from
text a key technology for feeding Semantic Web applications.
Among such projects, it is worth mentioning Vikef (Making
the Semantic Web Fly), whose main aim is to bridge the gap
between implicit information expressed in scientific
documents and its explicit representation found in knowledge
bases; and Parmenides, which is attempting to develop
technologies for the semi-automatic building and maintenance
of domain-specific ontologies.</p>
      <p>The work presented in this paper has been inspired by the
ACE Entity Detection task, which requires that the entities
mentioned in a text (e.g. PERSON, ORGANIZATION, LOCATION
and GEO-POLITICAL ENTITY) be detected. As the same entity
may be mentioned more than once in the same text, ACE
defines two inter-connected levels of annotation: the level of
the entity, which provides a representation of an object in the
world, and the level of the entity mention, which provides
information about the textual references to that object. The
information contained in the textual references to entities may
be translated into a knowledge base, and eventually into an
Ontology.</p>
    </sec>
    <sec id="sec-4">
      <title>III. DATA SET</title>
      <p>
        The input of OPTM consists of textual mentions derived
from the Italian Content Annotation Bank (I-CAB), which
consists of 525 news documents taken from the local
newspaper ‘L’Adige’2, for a total of around 180,000 words
        <xref ref-type="bibr" rid="ref10 ref11 ref12">(Magnini et al., 2006)</xref>
        . The annotation of I-CAB has been
carried out manually within the Ontotext project3, following
the ACE annotation guidelines for the Entity Detection task.
ICAB is annotated with expressions of type
TEMPORAL_EXPRESSION and four types of entities: PERSON,
ORGANIZATION, GEO-POLITICAL ENTITY and LOCATION. Due to
the morpho-syntactic differences between the two languages,
the ACE annotation guidelines for English had to be adapted
to Italian; for instance, two specific new tags, PROCLIT and
ENCLIT, have been created to annotate clitics attached to the
beginning or the end of certain words (e.g. &lt;veder[lo]&gt;/to see
him).
      </p>
      <p>According to the ACE definition, entity mentions are
portions of text referring to entities; the extent of this portion
of text consists of an entire nominal phrase, thus including
modifiers, prepositional phrases and dependent clauses (e.g.&lt;il
[ricercatore] che lavora presso l’ITC-irst&gt;/the resercher who
works at ITC- irst).</p>
      <p>Mentions are classified according to four syntactic
categories: NAM (proper names), NOM (nominal
constructions), PRO (pronouns) and PRE (modifiers).
NAM
NOM
PRE
PRO
)
s
ionACEENG NWIRE
ten (5186)
m
f
o
r
e
b
m
u
n
ltt
a
o
(s I-CAB (28353)
u
p
r
o
C
0
20</p>
      <p>40 60 80
Percent. of mentions per synt. cat.</p>
      <p>100
120</p>
      <p>In spite of the adaptations to Italian, it is interesting to
notice that a comparison between I-CAB and the newswire
portion of the ACE 2004 Evaluation corpus (see Figure 1)
shows a similar proportion of NAM and NOM mentions in the
two corpora. On the other hand, there is a low percentage of
PRO mentions in Italian, which can be explained by the fact
that, unlike in English, subject pronouns in Italian can be
omitted. As for the large difference in the total number of
mentions annotated in the two corpora (22,500 and 5,186 in
ICAB and ACE NWIRE respectively), this is proportional to
their size (around 180,000 words for I-CAB and 25,900 words
for ACE NWIRE), considering that some of the ACE entities</p>
    </sec>
    <sec id="sec-5">
      <title>2 http://www.ladige.it/ 3 http://tcc.itc.it/projects/ontotext/index.html (i.e. FACILITY, VEHICLE, AND WEAPON) are not annotated in ICAB.</title>
      <p>As shown in Figure 2, the two corpora also present a similar
distribution as far as the number of mentions per entity is
concerned. In fact, in both cases more than 60% of the entities
are mentioned only once, while around 15% are mentioned
twice. Between 10% and 15% are mentioned three or four
times, while around 6% are mentioned between five and eight
times. The fact that the percentage of entities mentioned more
than eight times in a document is higher in the ACE corpus
than in I-CAB can be partly explained by the fact that the news
stories in ACE are on average slightly longer than those in
ACE (around 470 versus 350 words per document).</p>
      <p>I-CAB</p>
      <p>ACE ENG NWIRE
s 70
e
iitt 60
lnE50
ta 40
o
fT30
to20
n
ce 10
r
eP 0</p>
      <p>[1] [2] [3 to 4] [5 to 8] &gt; 8</p>
      <p>Number of Mentions (of the entity in a document)</p>
      <p>IV. ATTRIBUTES for TYPE PERSON</p>
      <p>After the annotation of mentions of type PERSON reported in
the previous section, each mention was additionally annotated
in order to individuate the semantic information expressed by
the mention regarding a specific entity. As an example, given
the mention “the Italian President Ciampi”, the following
attribute/value pairs were annotated: [PROVENANCE: Italian],
[ROLE: President] and [LAST_NAME: Ciampi].</p>
      <p>The definition of the set of attributes for PERSON followed
an iterative process where we considered increasing amounts
of mentions from which we derived relevant attributes. The
final set of attributes is listed in the first column of Table 1,
with respective examples reported in the second column.</p>
      <p>
        A strict methodology is required in order to ensure accurate
annotation. As general guidelines for annotation, articles and
prepositions are not admitted at the beginning of the textual
extent of a value, an exception being made in the case of the
articles in nicknames
        <xref ref-type="bibr" rid="ref10 ref11 ref12">(see Magnini et al., 2006B for a full
description of the criteria used to decide on border cases)</xref>
        .
      </p>
      <p>Attributes can be grouped into bigger units, as in the case of
the attribute JOB, which is composed of three attributes,
ACTIVITY, ROLE, and AFFILIATION, which are not independent
of each other. ACTIVITY refers to the actual activity performed
by the person, while ROLE refers to the position they occupy.
So, for instance, “politician” is a possible value of the attribute
ACTIVITY, while “leader of the Labour Party” refers to the
ROLE a person plays inside an organization. Each group of
these three attributes is associated with a mention and all the
information within a group has to be derived from the same
mention. If different pieces of information derive from distinct
mentions, we will have two separate groups. For instance, the
three co-referring mentions “the journalist of Radio Liberty”,
“the redactor of breaking news”, and “a spare time
astronomer” lead to three different groups of ACTIVITY, ROLE
and AFFILIATION. The obvious inference that the first two
mentions belong conceptually to the same group is not drawn.
This step is to be taken at a further stage.</p>
      <p>attributes
FIRST_NAME
MIDDLE_NAME
LAST_NAME
NICKNAME
TITLE
SEX
ACTIVITY
AFFILIATION
ROLE</p>
    </sec>
    <sec id="sec-6">
      <title>PROVENANCE</title>
      <p>FAMILY_RELATION
AGE_CATEGORY
HONORARY
MISCELLANEA
values
Ralph, Greg</p>
      <p>J., W.</p>
      <p>McCarthy, Newton
Spider, Enigmista</p>
      <p>Prof., Mr.</p>
      <p>actress
author, doctor
The New York Times
manager, president</p>
      <p>South American
father, cousin</p>
      <p>boy, girl
the world champion 2000</p>
      <p>The men with red shoes</p>
      <p>We started with the set of 525 documents belonging to the
ICAB corpus (see section III), for which we have manually
annotated all PERSON entities (10039 mentions, see Table 2).
The annotation individuates both the entities mentioned within
a single document, called document entities, and the entities
mentioned across the whole set of news stories, called
collection entities. In addition, for the purposes of this work,
we decided to filter out the following mentions: (i) mentions
consisting only of one non-gender discriminative pronoun; (ii)
nested mentions, i.e. in case inside a mention there is a smaller
one, for example as in “the president Ciampi”, with “Ciampi”
being the included one, only the largest mention was
considered. In this way we obtained a set of 7233 mentions
which represents the object of our study.</p>
    </sec>
    <sec id="sec-7">
      <title>Number of documents</title>
      <p>Number of mentions
Number meaningful mentions
Number of distinct meaningful mentions
Number of document entities</p>
      <p>Number of collection entities</p>
      <p>The average number of meaningful mentions for an entity in
a certain document is 2.20, while the average number of
distinct meaningful mentions is 1.47. However, the variation
from the average is high, only 14% of document entities are
mentioned exactly twice. In fact, there are relatively few
entities whose mentions in news have a broad coverage in
terms of attributes, and there are quite a few whose mentions
contain just the name. A detailed analysis is carried out in
Section VI.</p>
    </sec>
    <sec id="sec-8">
      <title>V. ONTOLOGY</title>
      <p>The ontology adopted for the OPTM task is composed of
two main parts. The first part mirrors the mention attribute
structure and contains axioms (restrictions) on the attribute
values. In this part, which we refer as the Entity T-Box
(ETbox), we define three main classes corresponding to the three
main entities, PERSON, ORGANIZATION and GEO-POLITICAL
ENTITY. Each of these classes is associated with the mention
attributes. An example of how the attributes are encoded in
axioms in the ET-box is provided in Table 3.</p>
      <sec id="sec-8-1">
        <title>ONTOLOGY AXIOM</title>
        <p>PERSON
⊆(&gt;0)HAS_FIRST_NAME
PERSON ⊆
(=1)HAS_LAST_NAME
DOMAIN(HAS_FIRST_NAME) =
PERSON
RANGE(HAS_PROVENANCE) =
GEOPOLITICALENTITY</p>
        <sec id="sec-8-1-1">
          <title>Encoded restriction</title>
          <p>Every person has at
least a first name
Every person has
exactly one last name
the first argument of
the relation
has_first_name must be
a person
The second argument
of the relation
HAS_PROVENENCE
must be a geopolitical
entity</p>
          <p>The second component of the ontology, called world
knowledge (WK), encodes the basic knowledge about the
world already available (see Table 4 for examples of axioms).
This ontology has been semi-automatically constructed starting
from the large amount of basic information available on the
web. Examples of such knowledge are the sets of countries,
main cities, country capitals, Italian municipalities, etc.</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>ONTOLOGY AXIOM</title>
        <p>As can be seen from the above examples, WK is composed of
two types of knowledge: factual knowledge (the first two
axioms in Table 4) and generic commonsense knowledge. The
first type of knowledge can be obtained from the many
ontological resources available on the web (see for instance
swoogle.umbc.edu), while we have manually encoded the
second in the ontology.</p>
        <p>The process of OPTM combines the ontology ET-box with
WK axioms and values of attributes recognized in textual
mentions, and performs two main steps:</p>
        <p>1. For each entry recognized in the text we create a
new individual in the ontology, along with the individuals
corresponding to the attribute values</p>
        <p>2. We normalize the values by comparing the “string”
values with the individuals present in the WK.</p>
        <p>As an example of this process, consider the entry in Table 5.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>FIRST_NAME</title>
      <p>LAST_NAME</p>
    </sec>
    <sec id="sec-10">
      <title>PROVENANCE</title>
    </sec>
    <sec id="sec-11">
      <title>ACTIVITY</title>
      <p>Bob, B.</p>
      <p>Marley</p>
      <p>Caribbean
musician, guitar player</p>
      <p>
        In the second phase, we attempt to match the values to the
individuals in the WK and the Ontology is modified according
to the result of the matching process. This process is based on
the semantic matching approach described in
        <xref ref-type="bibr" rid="ref4">(Bouquet, 2003)</xref>
        .
      </p>
      <p>In this phase the WK-part of the ontology take a crucial
role. The main goal of this phase is to find the best match
between the values of an attribute and the individuals which
are already present in the WK A-box. This process can have
two outputs. When a good-enough match is found between an
attribute value and an individual of the WK A-box, then an
equality assertion is added. Suppose for instance that the WK
A-box contains the statement</p>
      <p>STATE(Caribbean)
then the mapping process will find a high match between the
value “Caribbean” (as a string) and the individual Caribbean
(due to the syntactic similarity between the two strings, and the
fact that both are associated to individuals of type
GEOPOLITICALENTITY). As a consequence the assertion</p>
      <p>Geo_pol_entity35 = Caribbean
is asserted in the A-box. Notice that the above assertion
connects an individual of the WK with the value of an entity
contained in the entity repository of the mentions.</p>
      <p>When the mapping process does not produce a “good“
mapping (where good is defined w.r.t., a suitable distance
measure not described here) the value is transformed into an
individual and added to the WK A-box. For instance, suppose
that the mapping of the value “guitar player” will not produce
a good matching value, then the new assertion</p>
      <p>ACTIVITY(GuitarPlayer)
is added to the WK A-box and the assertion</p>
      <p>activity44 = GuitarPlayer
is added to the A-box that links WK with the A-box of the
mentions.
pairs inside a given mention and (B) the difficulty of
establishing the co-reference of entities based on the values of
their attributes.</p>
      <p>In table 7 we find the distribution of the values of the
attributes defined for PERSON. The first column lists the set of
attributes; the second column lists the number of occurrences
of each attribute, the third lists the number of different values
that the attribute actually takes; the fourth column lists the
number of collection entities which have that attribute. Using
this table as base table we try to determine the parameters
which give us no clues on the two factors above
.</p>
      <p>AttrVibI.utePERSON DOAccTuArrSeEnTceANALYDSifIfSerent Collection
The difficulty of the OPoTfMatttarsibkuitsediinrectlyvcaolrureeslaftoerd with entities with
two factors: (A) the difficultymofe nidtieonntisfying theaatttrribute/value attribute</p>
      <sec id="sec-11-1">
        <title>Distinct values</title>
        <p>within distinct
mentions</p>
        <p>Variability of
values in
attribute
13%
1%
39%
0%
0%
38%
6%
8%
4%
4%
0%
2%
1%
5%
29%
60%
45%
60%
34%
50%
33%
68%
39%
48%
34%
34%
91%
97%
#mentions
34 (0,04%)
19
4
4
0
0
A. Difficulty of identifying attribute/value pairs</p>
        <p>The identification of attribute/value pairs requires the
correct decomposition of the mentions into non overlapping
parts, each one carrying the value of one attribute. We are
interested in estimating the distribution of attributes inside the
mentions. Table 8 shows on the second and fourth columns
the number of mentions which contain respectively 1, 2, 3, …,
12 attributes. As we can see, the number of mentions having
more than 6 attributes is insignificant. On the other hand, the
number of mentions containing more than one attribute is
3564, which represents 49,27% of the total, therefore one in
two mentions is a complex mention. Usually, a complex
mention contains a SEX value, therefore a two attribute
mention practically has just one that might help in establishing
co-reference. However, 92% of the mentions with up to 5
attributes, which covers 96% of all mentions, contain a NAME
attribute, which, presumably, is an important piece of evidence
in deciding on co-reference.
1
2
3
4
5
6</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>FIRST_NAME</title>
    </sec>
    <sec id="sec-13">
      <title>MIDDLE_NAME</title>
    </sec>
    <sec id="sec-14">
      <title>LAST_NAME</title>
    </sec>
    <sec id="sec-15">
      <title>NICKNAME</title>
    </sec>
    <sec id="sec-16">
      <title>TITLE</title>
      <p>SEX</p>
    </sec>
    <sec id="sec-17">
      <title>ACTIVITY</title>
    </sec>
    <sec id="sec-18">
      <title>AFFILIATION</title>
    </sec>
    <sec id="sec-19">
      <title>ROLE</title>
    </sec>
    <sec id="sec-20">
      <title>PROVENANCE</title>
    </sec>
    <sec id="sec-21">
      <title>FAMILY_RELATION</title>
    </sec>
    <sec id="sec-22">
      <title>AGE_CATEGORY</title>
    </sec>
    <sec id="sec-23">
      <title>HONORARY MISCELLANEA</title>
      <p>A second difficulty of correctly identifying the
attribute/value pairs comes from the combinatorial capacities
of attributes inside a complex mention. If the diversity of
attribute patterns in a complex mention is high, then the
difficulty of their recognition is also high. Table 9 shows that
the whole set of attributes is very well represented in the
complex mentions and, interestingly, the number of attributes
varies independently of the number of mentions, therefore
their combinatorial capacity is high. The difficulty of their
recognition varies accordingly.</p>
      <p>The distribution of attributes inside mentions is presented in
the second column of Table 7 in parenthesis. The figures give
the probability that a person is mentioned by making reference
to a certain attribute. For example, one may expect the
LAST_NAME attribute to be present in 57% of mentions, and
the NICKNAME attribute to be present in 0,001% of the total. In
the fifth column we compute the same figures without
repetition, considering the distinct values and distinct
mentions. Considering also the figures that show the linguistic
variability of values, we may obtain the probability of seeing a
previously unseen value of a given attribute. The last column
of Table 7 shows the variability of values for each attribute.
For example, taking randomly a mention of FIRST_NAME, only
in 29% of the cases that value is seen in the dataset just once.</p>
      <p>The fifth column, distinct values within distinct mentions,
and the sixth, variability of values in attribute, offer us insight
into the difficulty of recognizing attribute/value pairs. The
variability might be considered as representative of the amount
of training a system needs in order to have a satisfactory
coverage of cases. Intuitively, some of the attributes are close
classes, while some other attributes, e.g. those who have name
values, are open classes.</p>
      <p>Probably, the importance of recognizing certain types of
attributes is bigger than for other ones. If the occurrence of a
new value of an important attribute is a rare event, a system
must be very precise in catching these cases. We may assume
that a high precision is more difficult to achieve than a lower
one. The “distinct” column gives us a clue on this issue. For
example, the relatively low figures for ACTIVITY, AFFILIATION,
ROLE but their importance with respect to the OPTM task, tell
us that sparseness could be an issue and therefore a precise
system of their treatment must be used. Otherwise it will be
hard to achieve the expected results.</p>
      <p>Finally, we may notice that 39% of the mentions carry some
other information than SEX and name related values,
MISCELLANEA excluded. Therefore in all those cases the
ontology is enriched with substantial information about the
respective persons.</p>
      <p>B. Difficulty of establishing Co-references among entities
The task of correctly identifying a value of a certain
attribute inside a given mention is worth to be undertaken if
the respective values play a role in other tasks, especially in
the co-reference task. A relevant factor for co-reference is the
perplexity of an attribute, i.e. the percentage of the entities
characterized by a particular value, computed as the ratio
between distinct values for a certain attribute and collection
entities having that attribute (column III / IV in table 7). For
example the perplexity of LAST_NAME is 14% (see Table 10).
Therefore if we take randomly some values of LAST_NAME,
86% of them are pointing to just one person. In the case of
SEX and MISCELLANEA, the perplexity is not defined.</p>
      <p>By comparing the perplexity of LAST_NAME and
MIDDLE_NAME one might erroneously conclude that the latter
is more discriminative. This fact is due to the small number of
examples of MIDDLE_NAME values within the PERSON dataset.
Considering the occurrences of one attribute independently of
another we may use the usual rule of thumb for Bernoulli
attribute</p>
    </sec>
    <sec id="sec-24">
      <title>FIRST_NAME</title>
      <p>MIDDLE_NAME
LAST_NAME</p>
    </sec>
    <sec id="sec-25">
      <title>NICKNAME TITLE SEX</title>
    </sec>
    <sec id="sec-26">
      <title>ACTIVITY</title>
    </sec>
    <sec id="sec-27">
      <title>AFFILIATION</title>
    </sec>
    <sec id="sec-28">
      <title>ROLE</title>
    </sec>
    <sec id="sec-29">
      <title>PROVENANCE</title>
      <p>FAMILY_RELATION
AGE_CATEGORY
HONORARY
MISCELLANEA
Distribution. That is, it is highly likely that the perplexity of
FIRST_NAME, LAST_NAME, ACTIVITY, AFFILIATION, ROLE and
PROVENANCE will not change with the addition of new
examples, as the actual numbers are high.</p>
      <p>We can estimate the probability that two entities selected
from different documents co-refer. Actually, this is the
estimate of the probability that two entities co-refer
conditioned by the fact that they have been correctly identified
inside the documents. We can compute such probability as the
complementary of the ratio between the number of different
entities and the number of the document entities in the
collection.</p>
      <p># collection - entities
P(co - ref ) = 1</p>
      <p># document - entities</p>
      <p>From Table 2 we read these values as 2574 and 3284
respectively, therefore, for the PERSON data set, the probability
of intra-document co-reference is approximately 22%. We
consider that this figure is only partially indicative, and that it
is very likely for it to be increased after inspection of bigger
corpora. This is an aposteori probability because the number
of collection-entities is known only after the whole set of
mentions has been processed.</p>
      <p>An global estimator of the difficulty of the co-reference is
the expectation that a correct identified mention refers to a
new entity. This estimator shows the density of
collectionentities in the mentions space: let us call it co-reference
density. We can estimate the co-reference-density as the ratio
between the number of different entities and the number of
mentions.</p>
      <p># collection - entities
coref - density =</p>
      <p># mentions</p>
      <p>The co-reference density takes values in the interval with
limits [0-1]. The case when the co-reference density tends to 0
means that all the mentions refer to the same entity, while
when the value tends to 1 it means that each mention in the
collection refers to a different entity. Both the limits render the
co-reference task superfluous. The figure for co-reference
density we found in our corpus is 2574/7233 ≈ 0.35, and it is
far from being close to one of the extremes.</p>
      <p>A measure, that can be used as a baseline for the
coreference task, is the value of co-reference density conditioned
by the fact that one knows in advance whether two mentions
that are identical also co-refer. Let us call this measure
pseudo-co-reference-density. It shows the maximum accuracy
of a system that deals with ambiguity by ignoring it. We
approximate it as the ratio between the number of different
entities and the number of distinct mentions.</p>
      <p># collection - entities
p - coref - density =</p>
      <p># distinct - mentions</p>
      <p>The pseudo-co-reference for our dataset is 2574/4851 ≈
0.55. This information is not directly expressed in the
collection, so it should be approximated. The difference
between co-reference density and pseudo co-reference density
shows the increase in recall, if one considers that two identical
mentions refer to the same entity with probability 1. On the
other hand, the loss in accuracy might be too large (consider
for example the case when two different persons happen to
have the same first name).</p>
      <p>For our dataset the co-ref is ≈0,22 which means that 22% of
the document entities occur in more than one document. The
detailed distribution is presented in Table 11, where on the
first and third columns we list the number of collection entities
that occur in the number of documents that is specified in the
second and fourth respectively.</p>
      <p>#documents
#entities
#documents
#entities
1
2
3
4
5
6
3
4
1
1</p>
    </sec>
    <sec id="sec-30">
      <title>VII. CONCLUSION</title>
      <p>We have presented the results of a pilot study on Ontology
Population restricted to PERSON entities. One of the main
motivation of the study was to individuate critical factors that
determine the difficulty of the task.</p>
      <p>The first conclusion we draw is that textual mentions of
PERSON entities are highly structured. As a matter of fact, most
of the mentions bring information that can be easily classified
in a limited number of attributes, while only 3% of them are
categorized as MISCELLANEA. These figures highly suggested
that the Ontology Population from Textual Mentions (OPTM)
approach is feasible and promising.</p>
      <p>Secondly, we show that 50% of the mentions carry more
than the value of a single attribute. This fact, combined with
the relatively low perplexity figures for some attributes, most
notably LAST_NAME, suggests a co-reference procedure based
on attributes values.</p>
      <p>Thirdly, we have computed the values of three estimators of
difficulty for entity co-reference. One of them, the
pseudo-coreference-density, might be naturally used as a baseline for the
task. It has been also discovered that the co-reference-density
is far away from its possible extremes, 0 and 1, showing that
simple string matching procedures might not achieve good
results.</p>
      <p>Our future work will be focused on two main issues: (i) the
use of the PERSON dataset as training corpus for resolving the
entity co-reference task, as a first step towards implementing a
full OPTM system; and (ii) a controlled extension of the
dataset with new data in order to understand which figures are
likely to remain stable.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Almuhareb</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poesio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Attribute-based and valuebased clustering: An evaluation</article-title>
          .
          <source>In: Proceedings of EMNLP</source>
          <year>2004</year>
          , Barcelona,
          <year>2004</year>
          ,
          <fpage>158</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Avancini</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Expanding Domain-Specific Lexicons by Term Categorization</article-title>
          .
          <source>In: Proceedings of SAC</source>
          <year>2003</year>
          ,
          <volume>793</volume>
          -
          <fpage>79</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>The Semantic Web: A New Opportunity and Challenge for HLT</article-title>
          .
          <source>In: Proceedings of the Workshop HLT for the Semantic Web and Web Services at ISWC</source>
          <year>2003</year>
          ,
          <string-name>
            <given-names>Sanibel</given-names>
            <surname>Island</surname>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bouquet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serafini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zanobini</surname>
            <given-names>S..</given-names>
          </string-name>
          <article-title>Semantic coordination: a new approach and an application</article-title>
          ,
          <source>In Sencond Internatinal Semantic Web Conference</source>
          , volume
          <volume>2870</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>130</fpage>
          --
          <lpage>145</lpage>
          . Springer Verlag,
          <year>September 2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buitelaar</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Magnini</surname>
            <given-names>B</given-names>
          </string-name>
          . (Eds.)
          <article-title>Ontology Learning from Text: Methods, Evaluation and applications</article-title>
          . IOS Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundheim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and Wilson G. (
          <year>2005</year>
          ).
          <article-title>TIDES 2005 Standard for the Annotation of Temporal Expressions</article-title>
          .
          <source>Technical report, MITRE.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Negri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speranza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sprugnoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Italian Content Annotation Bank (I-CAB): Temporal Expressions (V. 1.0</article-title>
          .).
          <source>Technical Report T-0505-12</source>
          . ITC-irst, Trento.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Automatic Retrieval and Clustering of Similar Words</article-title>
          .
          <source>In: Proceedings of COLING-ACL98</source>
          , Montreal, Canada,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Linguistic</given-names>
            <surname>Data Consortium</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <source>ACE (Automatic Content Extraction) English Annotation Guidelines for Entities, version 5.6.1</source>
          <year>2005</year>
          .
          <volume>05</volume>
          .23.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Girardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Speranza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Bartalesi</given-names>
            <surname>Lenzi</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          .
          <article-title>I-CAB: the Italian Content Annotation Bank</article-title>
          ,
          <source>In: Proceedings of LREC2006</source>
          , Genova, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Popescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Speranza</surname>
          </string-name>
          .
          <article-title>Ontology Population from Textual Mentions: Task Definition and Benchmark</article-title>
          .
          <source>Proceedings of the OLP2 workshop on Ontology Population and Learning</source>
          , Sidney, Australia,
          <year>2006</year>
          . Joint with ACL/
          <year>Coling 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tanev</surname>
            <given-names>H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Magnini B. Weakly Supervised</surname>
          </string-name>
          <article-title>Approaches for Ontology Population</article-title>
          .
          <source>Proceedings of EACL-2006</source>
          , Trento,
          <fpage>3</fpage>
          -7 April,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Velardi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuchiarelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Evaluation of Ontolearn, a Methodology for Automatic Population of Domain Ontologies</article-title>
          . In: Buitelaar,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (eds.):
          <article-title>Ontology Learning from Text: Methods, Evaluation and Applications</article-title>
          , IOS Press, Amsterdam,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>