=Paper=
{{Paper
|id=Vol-201/paper-6
|storemode=property
|title=From Mentions to Ontology: A Pilot Sudy
|pdfUrl=https://ceur-ws.org/Vol-201/30.pdf
|volume=Vol-201
|dblpUrl=https://dblp.org/rec/conf/swap/PopescuMPSS06
}}
==From Mentions to Ontology: A Pilot Sudy==
1
From Mentions to Ontology: A Pilot Study
Octavian Popescu, Bernardo Magnini, Emanuele Pianta, Luciano Serafini, Manuela Speranza and
Andrei Tamilin, ITC-irst, 38050, Povo (TN), Italy
all possible entities (e.g. ORGANIZATION, LOCATION, etc.).
Abstract— In this paper we propose a pilot study aimed at an in- Mentions, as defined within the ACE (Automatic Content
depth comprehension of the phenomena underlying Ontology Extraction)1 Entity Detection Task (Linguistic Data
Population from text. The study has been carried out on a Consortium, 2004) are portions of text that refer to entities. As
collection of Italian news articles, which have been manually
annotated at several semantic levels. More specifically, we have
an example, given a particular textual context, the two
annotated all the textual expressions (i.e. mentions) referring to mentions “George W. Bush” and “the U.S President” refer to
Persons; each mention has been in turn decomposed into a the same entity, i.e. a particular instance of PERSON whose first
number of attribute/value pairs; co-reference relations among name is “George”, whose middle initial is “W.”, whose family
mentions have been established, resulting in the identification of name is “Bush” and whose role is “President of the U.S.”.
entities, which, finally, have been used to populate an ontology. As for PERSON entities, they were selected for our pilot
There are two significant results of such a study. First, a number
of factors have been empirically identified which determine the
study because they occur very frequently in the news document
difficulty of Ontology Population from Text and which can now collection we analyzed. Most of the results we obtained,
be taken into account while designing automatic systems. Second, however, are likely to be generalized over the other types of
the resulting dataset is a valuable resource for training and entities.
testing single components of Ontology Population systems. Given the above-mentioned restrictions, the contribution of
this paper is a thorough study of Ontology Population from
I. INTRODUCTION Textual Mentions (OPTM). We have manually extracted a
n this paper we propose an empirical investigation into number of relevant details concerning entities of type PERSON
I the relations between language and knowledge, aiming at
the definition of a computational framework for automatic
from the document collection and then used them to populate a
small pre-existing ontology. This led to two significant results
Ontology Population (OP) from text. of such a study. First, a number of factors have been
While Ontology Population from text has received an empirically identified which determine the difficulty of
increasing attention in recent years (see for instance, Buitelaar Ontology Population from Text and which can now be taken
et al. 2005), mostly due to its strong relationship with the into account while designing automatic systems. Second, the
Semantic Web perspective, very little has been done in order resulting dataset is a valuable resource for training and testing
to provide a clear definition of the task and to establish shared single components of Ontology Population.
evaluation procedures and benchmarks. In this paper we We show that the difficulty of the OPTM task is directly
propose a pilot study aimed at an in-depth comprehension of correlated to two factors: (A) the difficulty of identifying
the phenomena underlying Ontology Population from Text attribute/value pairs inside a given mention and (B) the
(OPTM). Specifically, we are interested in highlighting the difficulty of establishing co-reference between entities based
following aspects of the task: on the values of their attributes.
• What are the major sources of difficulty of the task? There are several advantages of OPTM that makes it
• How does OP from text relate to well known tasks in appealing for OLP. First, mentions provide an obvious
Natural Language Processing, such as Named Entity simplification with respect to the more general task of
Recognition? Ontology Population from text (cfr. Buitelaar et al. 2005); in
• What kinds of reasoning capabilities are crucial for the addition, mentions are well defined and there are systems for
task? automatic mention recognition which can provide the input for
• Is there any way to simplify the task so that it can be that task. Second, since mentions have been introduced as an
addressed in a modular way? evolution of the traditional Named Entity Recognition task
• Can we devise useful metrics to evaluate system (see Tanev and Magnini, 2006), they guarantee a reasonable
performance? level of complexity, which makes OPTM challenging both for
We addressed the above questions through a pilot study on a the Computational Linguistics and the Knowledge
limited amount of textual data. We added two restrictions with Representation communities. Third, there already exist data
respect to the general OP task: first, we considered textual annotated with mentions, delivered under the ACE initiative
mentions instead of full text; second, we focused on (Ferro et al. 2005, Linguistic Data Consortium 2004), which
information related to PERSON entities instead of considering make it possible to exploit machine learning approaches. The
1
http://www.nist.gov/speech/tests/ace
2
availability of annotated data allows for a better estimation of III. DATA SET
the performance of OPTM; in particular, it is possible to The input of OPTM consists of textual mentions derived
evaluate the recall of the task, i.e. the proportion of from the Italian Content Annotation Bank (I-CAB), which
information correctly assigned to an entity out of the total consists of 525 news documents taken from the local
amount of information provided by a certain mention. newspaper ‘L’Adige’2, for a total of around 180,000 words
The paper is structured as follows. Section II provides some (Magnini et al., 2006). The annotation of I-CAB has been
background on Ontology Population and reports on relevant carried out manually within the Ontotext project3, following
related work; Section III describes the dataset of the PERSON the ACE annotation guidelines for the Entity Detection task. I-
pilot study and compares it to the ACE dataset. Section IV CAB is annotated with expressions of type
introduces a new methodology for the semantic annotation of TEMPORAL_EXPRESSION and four types of entities: PERSON,
attribute/value pairs within textual mentions. In section V we ORGANIZATION, GEO-POLITICAL ENTITY and LOCATION. Due to
describe the Ontology we plan on using. Finally, Section VI the morpho-syntactic differences between the two languages,
reports on a quantitative and qualitative analysis of the data, the ACE annotation guidelines for English had to be adapted
which help determining the main sources of difficulty of the to Italian; for instance, two specific new tags, PROCLIT and
task. Conclusions are drawn in Section VII.
ENCLIT, have been created to annotate clitics attached to the
.
beginning or the end of certain words (e.g. /to see
him).
II. RELATED WORK
According to the ACE definition, entity mentions are
Automatic Ontology Population (OP) from texts has
portions of text referring to entities; the extent of this portion
recently emerged as a new field of application for knowledge
of text consists of an entire nominal phrase, thus including
acquisition techniques (Buitelaar et al., 2005). Although there
modifiers, prepositional phrases and dependent clauses (e.g./the resercher who
approximation has been suggested by (Bontcheva and
works at ITC- irst).
Cunningham, 2003) as Ontology Driven Information
Mentions are classified according to four syntactic
Extraction with the goal of extracting and classifying instances
categories: NAM (proper names), NOM (nominal
of concepts and relations defined in an ontology, in place of
constructions), PRO (pronouns) and PRE (modifiers).
filling a template. A similar task has been approached in a
variety of similar perspectives, including term clustering (Lin,
1998; Almuhareb and Poesio, 2004) and term categorization
Corpus (total number of mentions)
ACE ENG NWIRE
(Avancini et al., 2003). A rather different task is Ontology (5186)
Learning, where new concepts and relations are supposed to be NAM
NOM
acquired with the consequence of changing the definition of
PRE
the Ontology itself (Velardi et al. 2005). PRO
The interest in OP is also reflected in the large number of I-CAB (28353)
research projects which consider knowledge extraction from
text a key technology for feeding Semantic Web applications.
Among such projects, it is worth mentioning Vikef (Making 0 20 40 60 80 100 120
Percent. of m entions per synt. cat.
the Semantic Web Fly), whose main aim is to bridge the gap
between implicit information expressed in scientific
documents and its explicit representation found in knowledge Fig. 1. Distribution of the four different ACE mention types in
bases; and Parmenides, which is attempting to develop I-CAB and in the ACE 2004 Evaluation corpus (Newswire)
technologies for the semi-automatic building and maintenance
In spite of the adaptations to Italian, it is interesting to
of domain-specific ontologies.
notice that a comparison between I-CAB and the newswire
The work presented in this paper has been inspired by the
portion of the ACE 2004 Evaluation corpus (see Figure 1)
ACE Entity Detection task, which requires that the entities
shows a similar proportion of NAM and NOM mentions in the
mentioned in a text (e.g. PERSON, ORGANIZATION, LOCATION
two corpora. On the other hand, there is a low percentage of
and GEO-POLITICAL ENTITY) be detected. As the same entity
PRO mentions in Italian, which can be explained by the fact
may be mentioned more than once in the same text, ACE
that, unlike in English, subject pronouns in Italian can be
defines two inter-connected levels of annotation: the level of
omitted. As for the large difference in the total number of
the entity, which provides a representation of an object in the
mentions annotated in the two corpora (22,500 and 5,186 in I-
world, and the level of the entity mention, which provides
CAB and ACE NWIRE respectively), this is proportional to
information about the textual references to that object. The
their size (around 180,000 words for I-CAB and 25,900 words
information contained in the textual references to entities may
for ACE NWIRE), considering that some of the ACE entities
be translated into a knowledge base, and eventually into an
Ontology.
2
http://www.ladige.it/
3
http://tcc.itc.it/projects/ontotext/index.html
3
(i.e. FACILITY, VEHICLE, AND WEAPON) are not annotated in I- these three attributes is associated with a mention and all the
CAB. information within a group has to be derived from the same
As shown in Figure 2, the two corpora also present a similar mention. If different pieces of information derive from distinct
distribution as far as the number of mentions per entity is mentions, we will have two separate groups. For instance, the
concerned. In fact, in both cases more than 60% of the entities three co-referring mentions “the journalist of Radio Liberty”,
are mentioned only once, while around 15% are mentioned “the redactor of breaking news”, and “a spare time
twice. Between 10% and 15% are mentioned three or four astronomer” lead to three different groups of ACTIVITY, ROLE
times, while around 6% are mentioned between five and eight and AFFILIATION. The obvious inference that the first two
times. The fact that the percentage of entities mentioned more mentions belong conceptually to the same group is not drawn.
than eight times in a document is higher in the ACE corpus This step is to be taken at a further stage.
than in I-CAB can be partly explained by the fact that the news
stories in ACE are on average slightly longer than those in attributes values
ACE (around 470 versus 350 words per document). FIRST_NAME Ralph, Greg
MIDDLE_NAME J., W.
LAST_NAME McCarthy, Newton
NICKNAME Spider, Enigmista
Percent of Total Entities
70
60 TITLE Prof., Mr.
I-CAB
50
ACE ENG NWIRE
SEX actress
40 ACTIVITY author, doctor
30 AFFILIATION The New York Times
20 ROLE manager, president
10
0 PROVENANCE South American
[1] [2] [3 to 4] [5 to 8] >8
FAMILY_RELATION father, cousin
Number of Mentions (of the entity in a document) AGE_CATEGORY boy, girl
HONORARY the world champion 2000
Fig. 2. Intra-document co-reference in I-CAB and in the ACE MISCELLANEA The men with red shoes
2004 Evaluation corpus (Newswire)
Table 1. The attribute structure of PERSON
IV. ATTRIBUTES for TYPE PERSON We started with the set of 525 documents belonging to the I-
After the annotation of mentions of type PERSON reported in CAB corpus (see section III), for which we have manually
the previous section, each mention was additionally annotated annotated all PERSON entities (10039 mentions, see Table 2).
in order to individuate the semantic information expressed by The annotation individuates both the entities mentioned within
the mention regarding a specific entity. As an example, given a single document, called document entities, and the entities
the mention “the Italian President Ciampi”, the following mentioned across the whole set of news stories, called
attribute/value pairs were annotated: [PROVENANCE: Italian], collection entities. In addition, for the purposes of this work,
[ROLE: President] and [LAST_NAME: Ciampi]. we decided to filter out the following mentions: (i) mentions
The definition of the set of attributes for PERSON followed consisting only of one non-gender discriminative pronoun; (ii)
an iterative process where we considered increasing amounts nested mentions, i.e. in case inside a mention there is a smaller
of mentions from which we derived relevant attributes. The one, for example as in “the president Ciampi”, with “Ciampi”
final set of attributes is listed in the first column of Table 1, being the included one, only the largest mention was
with respective examples reported in the second column. considered. In this way we obtained a set of 7233 mentions
A strict methodology is required in order to ensure accurate which represents the object of our study.
annotation. As general guidelines for annotation, articles and
Number of documents 525
prepositions are not admitted at the beginning of the textual
extent of a value, an exception being made in the case of the Number of mentions 10039
articles in nicknames (see Magnini et al., 2006B for a full Number meaningful mentions 7233
description of the criteria used to decide on border cases). Number of distinct meaningful mentions 4851
Attributes can be grouped into bigger units, as in the case of Number of document entities 3284
the attribute JOB, which is composed of three attributes, Number of collection entities 2574
ACTIVITY, ROLE, and AFFILIATION, which are not independent
Table 2. The PERSON Dataset
of each other. ACTIVITY refers to the actual activity performed
by the person, while ROLE refers to the position they occupy. The average number of meaningful mentions for an entity in
So, for instance, “politician” is a possible value of the attribute a certain document is 2.20, while the average number of
ACTIVITY, while “leader of the Labour Party” refers to the distinct meaningful mentions is 1.47. However, the variation
ROLE a person plays inside an organization. Each group of from the average is high, only 14% of document entities are
4
mentioned exactly twice. In fact, there are relatively few ontological resources available on the web (see for instance
entities whose mentions in news have a broad coverage in swoogle.umbc.edu), while we have manually encoded the
terms of attributes, and there are quite a few whose mentions second in the ontology.
contain just the name. A detailed analysis is carried out in The process of OPTM combines the ontology ET-box with
Section VI. WK axioms and values of attributes recognized in textual
mentions, and performs two main steps:
V. ONTOLOGY 1. For each entry recognized in the text we create a
The ontology adopted for the OPTM task is composed of new individual in the ontology, along with the individuals
two main parts. The first part mirrors the mention attribute corresponding to the attribute values
structure and contains axioms (restrictions) on the attribute 2. We normalize the values by comparing the “string”
values. In this part, which we refer as the Entity T-Box (ET- values with the individuals present in the WK.
box), we define three main classes corresponding to the three As an example of this process, consider the entry in Table 5.
main entities, PERSON, ORGANIZATION and GEO-POLITICAL
ENTITY. Each of these classes is associated with the mention FIRST_NAME Bob, B.
attributes. An example of how the attributes are encoded in LAST_NAME Marley
axioms in the ET-box is provided in Table 3. PROVENANCE Caribbean
ACTIVITY musician, guitar player
ONTOLOGY AXIOM Encoded restriction
PERSON Every person has at Table 5. Attributes/Values examples
⊆(>0)HAS_FIRST_NAME least a first name In the first phase we add the axioms in Table 6 to the
PERSON ⊆ Every person has ontology.
(=1)HAS_LAST_NAME exactly one last name
DOMAIN(HAS_FIRST_NAME) = Person(person23)
the first argument of
PERSON HAS_FIRST_NAME(person23,first_name76)
the relation
has_first_name must be HAS_LAST_NAME(person23,last_name93)
a person HAS_PROVENANCE(person23,geo_pol_entity35)
RANGE(HAS_PROVENANCE) = The second argument HAS_ACTIVITY(person23,activity43)
GEOPOLITICALENTITY of the relation HAS_ACTIVITY(person23,activity44)
HAS_PROVENENCE
HAS_VALUE(first_name56, “Bob”)
must be a geopolitical
entity HAS_VALUE(first_name76, “B.”)
HAS_VALUE(geo_pol_entity35, “Caribbean”)
Table 3. Description of Ontology axioms HAS_VALUE(activity43, “musician”)
The second component of the ontology, called world HAS_VALUE(activity44,“guitar player”)
knowledge (WK), encodes the basic knowledge about the
Table 6. Adding axioms to the Ontology
world already available (see Table 4 for examples of axioms).
This ontology has been semi-automatically constructed starting In the second phase, we attempt to match the values to the
from the large amount of basic information available on the individuals in the WK and the Ontology is modified according
web. Examples of such knowledge are the sets of countries, to the result of the matching process. This process is based on
main cities, country capitals, Italian municipalities, etc. the semantic matching approach described in (Bouquet, 2003).
ONTOLOGY AXIOM Encoded In this phase the WK-part of the ontology take a crucial
restriction role. The main goal of this phase is to find the best match
COUNTRY(Italy) Italy is a country between the values of an attribute and the individuals which
HAS_CAPITAL(Italy,Rome) Rome is the capital are already present in the WK A-box. This process can have
of Italy two outputs. When a good-enough match is found between an
CONTINENT ⊆ A country is a attribute value and an individual of the WK A-box, then an
GEOPOLITICALENTITY geopolitical entity equality assertion is added. Suppose for instance that the WK
TOWN ⊆ GEOPOLITICALENTITY A town is a A-box contains the statement
geopolitical entity STATE(Caribbean)
then the mapping process will find a high match between the
Table 4. Description of Ontology axioms related to WK value “Caribbean” (as a string) and the individual Caribbean
(due to the syntactic similarity between the two strings, and the
As can be seen from the above examples, WK is composed of fact that both are associated to individuals of type
two types of knowledge: factual knowledge (the first two GEOPOLITICALENTITY). As a consequence the assertion
axioms in Table 4) and generic commonsense knowledge. The Geo_pol_entity35 = Caribbean
first type of knowledge can be obtained from the many
5
is asserted in the A-box. Notice that the above assertion pairs inside a given mention and (B) the difficulty of
connects an individual of the WK with the value of an entity establishing the co-reference of entities based on the values of
contained in the entity repository of the mentions. their attributes.
When the mapping process does not produce a “good“ In table 7 we find the distribution of the values of the
mapping (where good is defined w.r.t., a suitable distance attributes defined for PERSON. The first column lists the set of
measure not described here) the value is transformed into an attributes; the second column lists the number of occurrences
individual and added to the WK A-box. For instance, suppose of each attribute, the third lists the number of different values
that the mapping of the value “guitar player” will not produce that the attribute actually takes; the fourth column lists the
a good matching value, then the new assertion number of collection entities which have that attribute. Using
ACTIVITY(GuitarPlayer) this table as base table we try to determine the parameters
is added to the WK A-box and the assertion which give us no clues on the two factors above
activity44 = GuitarPlayer .
is added to the A-box that links WK with the A-box of the
mentions.
VI. PERSON DATASET
Attribute OccurrenceANALYSIS Different Collection Distinct values Variability of
The difficulty of the OPTM task is directly
of attribute in correlated
values for with entities with within distinct values in
two factors: (A) the difficultymentions
of identifying theattribute
attribute/value attribute mentions attribute
FIRST_NAME 2299 (31%) 676 1592 13% 29%
MIDDLE_NAME 110 (1%) 67 74 1% 60%
LAST_NAME 4173 (57%) 1906 2191 39% 45%
NICKNAME 73 (1%) 44 41 0% 60%
TITLE 73 (1%) 25 47 0% 34%
SEX 3658 (50%) 1864 1743 38% 50%
ACTIVITY 973 (13%) 322 569 6% 33%
AFFILIATION 566 (7%) 389 409 8% 68%
ROLE 531 (7%) 211 317 4% 39%
PROVENANCE 469 (6%) 226 367 4% 48%
FAMILY_RELATION 133 (1%) 46 94 0% 34%
AGE_CATEGORY 307 (4%) 106 163 2% 34%
HONORARY 69 (0%) 63 53 1% 91%
MISCELLANEA 278 (3%) 270 227 5% 97%
Table 7. Distribution of values of attributes for PERSON
A. Difficulty of identifying attribute/value pairs The difficulty of correct identification of the attribute/value
The identification of attribute/value pairs requires the pairs is directly linked to the complexity of a mention. Two
correct decomposition of the mentions into non overlapping values inside the mention belong to the same entity. Without
parts, each one carrying the value of one attribute. We are recognizing the correct frontiers of a complex mention
interested in estimating the distribution of attributes inside the virtually 50% of the cases are treated badly.
mentions. Table 8 shows on the second and fourth columns #attributes #mentions #attributes #mentions
the number of mentions which contain respectively 1, 2, 3, …,
12 attributes. As we can see, the number of mentions having 1 3669 (50%) 7 34 (0,04%)
more than 6 attributes is insignificant. On the other hand, the 2 1292 (17%) 8 19
number of mentions containing more than one attribute is
3564, which represents 49,27% of the total, therefore one in 3 1269 (17%) 9 4
two mentions is a complex mention. Usually, a complex 4 486 (6%) 10 4
mention contains a SEX value, therefore a two attribute 5 310 (4%) 11 0
mention practically has just one that might help in establishing
co-reference. However, 92% of the mentions with up to 5 6 146 (2%) 12 0
attributes, which covers 96% of all mentions, contain a NAME
attribute, which, presumably, is an important piece of evidence Table 8. Number of attributes carried by mentions
in deciding on co-reference.
6
2 3 4 Probably, the importance of recognizing certain types of
attribute attribute attribute attribute attributes is bigger than for other ones. If the occurrence of a
mention mention mention new value of an important attribute is a rare event, a system
FIRST_NAME 398 915 413 must be very precise in catching these cases. We may assume
that a high precision is more difficult to achieve than a lower
MIDDLE_NAME 5 20 34
one. The “distinct” column gives us a clue on this issue. For
LAST_NAME 467 1025 426 example, the relatively low figures for ACTIVITY, AFFILIATION,
NICKNAME 27 16 2 ROLE but their importance with respect to the OPTM task, tell
TITLE 14 16 13 us that sparseness could be an issue and therefore a precise
SEX 806 1240 501 system of their treatment must be used. Otherwise it will be
hard to achieve the expected results.
ACTIVITY 273 135 413
Finally, we may notice that 39% of the mentions carry some
AFFILIATION 82 91 80 other information than SEX and name related values,
ROLE 126 81 94 MISCELLANEA excluded. Therefore in all those cases the
PROVENANCE 81 134 156 ontology is enriched with substantial information about the
FAMILY_RELATION 76 24 103 respective persons.
AGE_CATEGORY 139 62 12
B. Difficulty of establishing Co-references among entities
HONORARY 20 7 31 The task of correctly identifying a value of a certain
MISCELLANEA 80 59 11 attribute inside a given mention is worth to be undertaken if
the respective values play a role in other tasks, especially in
Table 9. Distribution of attributes into complex mentions the co-reference task. A relevant factor for co-reference is the
A second difficulty of correctly identifying the perplexity of an attribute, i.e. the percentage of the entities
attribute/value pairs comes from the combinatorial capacities characterized by a particular value, computed as the ratio
of attributes inside a complex mention. If the diversity of between distinct values for a certain attribute and collection
attribute patterns in a complex mention is high, then the entities having that attribute (column III / IV in table 7). For
difficulty of their recognition is also high. Table 9 shows that example the perplexity of LAST_NAME is 14% (see Table 10).
the whole set of attributes is very well represented in the Therefore if we take randomly some values of LAST_NAME,
complex mentions and, interestingly, the number of attributes 86% of them are pointing to just one person. In the case of
varies independently of the number of mentions, therefore SEX and MISCELLANEA, the perplexity is not defined.
their combinatorial capacity is high. The difficulty of their
recognition varies accordingly. attribute perplexity
The distribution of attributes inside mentions is presented in FIRST_NAME 58%
the second column of Table 7 in parenthesis. The figures give MIDDLE_NAME 10%
the probability that a person is mentioned by making reference LAST_NAME 14%
to a certain attribute. For example, one may expect the NICKNAME 0%
LAST_NAME attribute to be present in 57% of mentions, and TITLE 47%
the NICKNAME attribute to be present in 0,001% of the total. In SEX -
the fifth column we compute the same figures without ACTIVITY 44%
repetition, considering the distinct values and distinct
AFFILIATION 5%
mentions. Considering also the figures that show the linguistic
variability of values, we may obtain the probability of seeing a ROLE 34%
previously unseen value of a given attribute. The last column PROVENANCE 52%
of Table 7 shows the variability of values for each attribute. FAMILY_RELATION 39%
For example, taking randomly a mention of FIRST_NAME, only AGE_CATEGORY 35%
in 29% of the cases that value is seen in the dataset just once. HONORARY 0%
The fifth column, distinct values within distinct mentions, MISCELLANEA -
and the sixth, variability of values in attribute, offer us insight
into the difficulty of recognizing attribute/value pairs. The Table 10. Perplexity of PERSON attributes
variability might be considered as representative of the amount By comparing the perplexity of LAST_NAME and
of training a system needs in order to have a satisfactory MIDDLE_NAME one might erroneously conclude that the latter
coverage of cases. Intuitively, some of the attributes are close is more discriminative. This fact is due to the small number of
classes, while some other attributes, e.g. those who have name examples of MIDDLE_NAME values within the PERSON dataset.
values, are open classes. Considering the occurrences of one attribute independently of
another we may use the usual rule of thumb for Bernoulli
7
Distribution. That is, it is highly likely that the perplexity of collection, so it should be approximated. The difference
FIRST_NAME, LAST_NAME, ACTIVITY, AFFILIATION, ROLE and between co-reference density and pseudo co-reference density
PROVENANCE will not change with the addition of new shows the increase in recall, if one considers that two identical
examples, as the actual numbers are high. mentions refer to the same entity with probability 1. On the
We can estimate the probability that two entities selected other hand, the loss in accuracy might be too large (consider
from different documents co-refer. Actually, this is the for example the case when two different persons happen to
estimate of the probability that two entities co-refer have the same first name).
conditioned by the fact that they have been correctly identified For our dataset the co-ref is ≈0,22 which means that 22% of
inside the documents. We can compute such probability as the the document entities occur in more than one document. The
complementary of the ratio between the number of different detailed distribution is presented in Table 11, where on the
entities and the number of the document entities in the first and third columns we list the number of collection entities
collection. that occur in the number of documents that is specified in the
# collection − entities second and fourth respectively.
P(co − ref ) = 1 −
# document − entities #documents #entities #documents #entities
1 2155 6 6
From Table 2 we read these values as 2574 and 3284
(84%)
respectively, therefore, for the PERSON data set, the probability 2 286 (11%) 7 3
of intra-document co-reference is approximately 22%. We
consider that this figure is only partially indicative, and that it 3 71 (2%) 8 4
is very likely for it to be increased after inspection of bigger 4 31 (1%) 9 1
corpora. This is an aposteori probability because the number
of collection-entities is known only after the whole set of 5 15 (0,5%) 16 1
mentions has been processed.
An global estimator of the difficulty of the co-reference is Table 11. Intra-document co-reference
the expectation that a correct identified mention refers to a
VII. CONCLUSION
new entity. This estimator shows the density of collection-
We have presented the results of a pilot study on Ontology
entities in the mentions space: let us call it co-reference
Population restricted to PERSON entities. One of the main
density. We can estimate the co-reference-density as the ratio
motivation of the study was to individuate critical factors that
between the number of different entities and the number of
determine the difficulty of the task.
mentions.
The first conclusion we draw is that textual mentions of
# collection − entities
coref − density = PERSON entities are highly structured. As a matter of fact, most
# mentions of the mentions bring information that can be easily classified
in a limited number of attributes, while only 3% of them are
The co-reference density takes values in the interval with categorized as MISCELLANEA. These figures highly suggested
limits [0-1]. The case when the co-reference density tends to 0 that the Ontology Population from Textual Mentions (OPTM)
means that all the mentions refer to the same entity, while approach is feasible and promising.
when the value tends to 1 it means that each mention in the Secondly, we show that 50% of the mentions carry more
collection refers to a different entity. Both the limits render the than the value of a single attribute. This fact, combined with
co-reference task superfluous. The figure for co-reference the relatively low perplexity figures for some attributes, most
density we found in our corpus is 2574/7233 ≈ 0.35, and it is notably LAST_NAME, suggests a co-reference procedure based
far from being close to one of the extremes. on attributes values.
A measure, that can be used as a baseline for the co- Thirdly, we have computed the values of three estimators of
reference task, is the value of co-reference density conditioned difficulty for entity co-reference. One of them, the pseudo-co-
by the fact that one knows in advance whether two mentions reference-density, might be naturally used as a baseline for the
that are identical also co-refer. Let us call this measure task. It has been also discovered that the co-reference-density
pseudo-co-reference-density. It shows the maximum accuracy is far away from its possible extremes, 0 and 1, showing that
of a system that deals with ambiguity by ignoring it. We simple string matching procedures might not achieve good
approximate it as the ratio between the number of different results.
entities and the number of distinct mentions. Our future work will be focused on two main issues: (i) the
# collection − entities use of the PERSON dataset as training corpus for resolving the
p − coref − density = entity co-reference task, as a first step towards implementing a
# distinct − mentions full OPTM system; and (ii) a controlled extension of the
dataset with new data in order to understand which figures are
The pseudo-co-reference for our dataset is 2574/4851 ≈ likely to remain stable.
0.55. This information is not directly expressed in the
8
REFERENCES
1. Almuhareb, A., Poesio, M. (2004). Attribute-based and value-
based clustering: An evaluation. In: Proceedings of EMNLP
2004, Barcelona, 2004, 158-165.
2. Avancini, H., Lavelli, A., Magnini, B., Sebastiani, F., Zanoli, R.
(2003). Expanding Domain-Specific Lexicons by Term
Categorization. In: Proceedings of SAC 2003, 793-79.
3. Bontcheva, K., Cunningham, H. (2003). The Semantic Web: A
New Opportunity and Challenge for HLT. In: Proceedings of the
Workshop HLT for the Semantic Web and Web Services at
ISWC 2003, Sanibel Island, 2003.
4. Bouquet, P., Serafini, L., and Zanobini S.. Semantic
coordination: a new approach and an application, In Sencond
Internatinal Semantic Web Conference, volume 2870 of Lecture
Notes in Computer Science, pages 130--145. Springer Verlag,
September 2003
5. Buitelaar P., Cimiano P. and Magnini B. (Eds.) Ontology
Learning from Text: Methods, Evaluation and applications. IOS
Press, 2005.
6. Ferro, L., Gerber, L., Mani, I., Sundheim, B. and Wilson G.
(2005). TIDES 2005 Standard for the Annotation of Temporal
Expressions. Technical report, MITRE.
7. Lavelli, A., Magnini, B., Negri, M., Pianta, E., Speranza, M.,
Sprugnoli, R. (2005). Italian Content Annotation Bank (I-CAB):
Temporal Expressions (V. 1.0.). Technical Report T-0505-12.
ITC-irst, Trento.
8. Lin, D. (1998). Automatic Retrieval and Clustering of Similar
Words. In: Proceedings of COLING-ACL98, Montreal, Canada,
1998.
9. Linguistic Data Consortium (2004). ACE (Automatic Content
Extraction) English Annotation Guidelines for Entities, version
5.6.1 2005.05.23.
10. B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Romano, M.
Speranza, V. Bartalesi Lenzi and R. Sprugnoli. I-CAB: the
Italian Content Annotation Bank, In: Proceedings of LREC-
2006, Genova, Italy.
11. B. Magnini, E. Pianta, O. Popescu and M. Speranza. Ontology
Population from Textual Mentions: Task Definition and
Benchmark. Proceedings of the OLP2 workshop on Ontology
Population and Learning, Sidney, Australia, 2006. Joint with
ACL/Coling 2006.
12. Tanev H. and Magnini B. Weakly Supervised Approaches for
Ontology Population. Proceedings of EACL-2006, Trento, 3-7
April, 2006.
13. Velardi, P., Navigli, R., Cuchiarelli, A., Neri, F. (2004).
Evaluation of Ontolearn, a Methodology for Automatic
Population of Domain Ontologies. In: Buitelaar, P., Cimiano, P.,
Magnini, B. (eds.): Ontology Learning from Text: Methods,
Evaluation and Applications, IOS Press, Amsterdam, 2005.