1


               From Mentions to Ontology: A Pilot Study
            Octavian Popescu, Bernardo Magnini, Emanuele Pianta, Luciano Serafini, Manuela Speranza and
                                 Andrei Tamilin, ITC-irst, 38050, Povo (TN), Italy
                                                                     all possible entities (e.g. ORGANIZATION, LOCATION, etc.).
Abstract— In this paper we propose a pilot study aimed at an in-        Mentions, as defined within the ACE (Automatic Content
depth comprehension of the phenomena underlying Ontology             Extraction)1 Entity Detection Task (Linguistic Data
Population from text. The study has been carried out on a            Consortium, 2004) are portions of text that refer to entities. As
collection of Italian news articles, which have been manually
annotated at several semantic levels. More specifically, we have
                                                                     an example, given a particular textual context, the two
annotated all the textual expressions (i.e. mentions) referring to   mentions “George W. Bush” and “the U.S President” refer to
Persons; each mention has been in turn decomposed into a             the same entity, i.e. a particular instance of PERSON whose first
number of attribute/value pairs; co-reference relations among        name is “George”, whose middle initial is “W.”, whose family
mentions have been established, resulting in the identification of   name is “Bush” and whose role is “President of the U.S.”.
entities, which, finally, have been used to populate an ontology.       As for PERSON entities, they were selected for our pilot
There are two significant results of such a study. First, a number
of factors have been empirically identified which determine the
                                                                     study because they occur very frequently in the news document
difficulty of Ontology Population from Text and which can now        collection we analyzed. Most of the results we obtained,
be taken into account while designing automatic systems. Second,     however, are likely to be generalized over the other types of
the resulting dataset is a valuable resource for training and        entities.
testing single components of Ontology Population systems.               Given the above-mentioned restrictions, the contribution of
                                                                     this paper is a thorough study of Ontology Population from
                         I. INTRODUCTION                             Textual Mentions (OPTM). We have manually extracted a
        n this paper we propose an empirical investigation into      number of relevant details concerning entities of type PERSON
  I  the relations between language and knowledge, aiming at
     the definition of a computational framework for automatic
                                                                     from the document collection and then used them to populate a
                                                                     small pre-existing ontology. This led to two significant results
Ontology Population (OP) from text.                                  of such a study. First, a number of factors have been
   While Ontology Population from text has received an               empirically identified which determine the difficulty of
increasing attention in recent years (see for instance, Buitelaar    Ontology Population from Text and which can now be taken
et al. 2005), mostly due to its strong relationship with the         into account while designing automatic systems. Second, the
Semantic Web perspective, very little has been done in order         resulting dataset is a valuable resource for training and testing
to provide a clear definition of the task and to establish shared    single components of Ontology Population.
evaluation procedures and benchmarks. In this paper we                  We show that the difficulty of the OPTM task is directly
propose a pilot study aimed at an in-depth comprehension of          correlated to two factors: (A) the difficulty of identifying
the phenomena underlying Ontology Population from Text               attribute/value pairs inside a given mention and (B) the
(OPTM). Specifically, we are interested in highlighting the          difficulty of establishing co-reference between entities based
following aspects of the task:                                       on the values of their attributes.
   • What are the major sources of difficulty of the task?              There are several advantages of OPTM that makes it
   • How does OP from text relate to well known tasks in             appealing for OLP. First, mentions provide an obvious
Natural Language Processing, such as Named Entity                    simplification with respect to the more general task of
Recognition?                                                         Ontology Population from text (cfr. Buitelaar et al. 2005); in
   • What kinds of reasoning capabilities are crucial for the        addition, mentions are well defined and there are systems for
task?                                                                automatic mention recognition which can provide the input for
   • Is there any way to simplify the task so that it can be         that task. Second, since mentions have been introduced as an
addressed in a modular way?                                          evolution of the traditional Named Entity Recognition task
   • Can we devise useful metrics to evaluate system                 (see Tanev and Magnini, 2006), they guarantee a reasonable
performance?                                                         level of complexity, which makes OPTM challenging both for
   We addressed the above questions through a pilot study on a       the Computational Linguistics and the Knowledge
limited amount of textual data. We added two restrictions with       Representation communities. Third, there already exist data
respect to the general OP task: first, we considered textual         annotated with mentions, delivered under the ACE initiative
mentions instead of full text; second, we focused on                 (Ferro et al. 2005, Linguistic Data Consortium 2004), which
information related to PERSON entities instead of considering        make it possible to exploit machine learning approaches. The

                                                                       1
                                                                           http://www.nist.gov/speech/tests/ace
                                                                                                                                                                                             2

availability of annotated data allows for a better estimation of                             III. DATA SET
the performance of OPTM; in particular, it is possible to             The input of OPTM consists of textual mentions derived
evaluate the recall of the task, i.e. the proportion of            from the Italian Content Annotation Bank (I-CAB), which
information correctly assigned to an entity out of the total       consists of 525 news documents taken from the local
amount of information provided by a certain mention.               newspaper ‘L’Adige’2, for a total of around 180,000 words
   The paper is structured as follows. Section II provides some    (Magnini et al., 2006). The annotation of I-CAB has been
background on Ontology Population and reports on relevant          carried out manually within the Ontotext project3, following
related work; Section III describes the dataset of the PERSON      the ACE annotation guidelines for the Entity Detection task. I-
pilot study and compares it to the ACE dataset. Section IV         CAB       is    annotated      with    expressions      of    type
introduces a new methodology for the semantic annotation of        TEMPORAL_EXPRESSION and four types of entities: PERSON,
attribute/value pairs within textual mentions. In section V we     ORGANIZATION, GEO-POLITICAL ENTITY and LOCATION. Due to
describe the Ontology we plan on using. Finally, Section VI        the morpho-syntactic differences between the two languages,
reports on a quantitative and qualitative analysis of the data,    the ACE annotation guidelines for English had to be adapted
which help determining the main sources of difficulty of the       to Italian; for instance, two specific new tags, PROCLIT and
task. Conclusions are drawn in Section VII.
                                                                   ENCLIT, have been created to annotate clitics attached to the
   .
                                                                   beginning or the end of certain words (e.g. <veder[lo]>/to see
                                                                   him).
                      II. RELATED WORK
                                                                      According to the ACE definition, entity mentions are
    Automatic Ontology Population (OP) from texts has
                                                                   portions of text referring to entities; the extent of this portion
recently emerged as a new field of application for knowledge
                                                                   of text consists of an entire nominal phrase, thus including
acquisition techniques (Buitelaar et al., 2005). Although there
                                                                   modifiers, prepositional phrases and dependent clauses (e.g.<il
is no widely accepted definition for the OP task, a useful
                                                                   [ricercatore] che lavora presso l’ITC-irst>/the resercher who
approximation has been suggested by (Bontcheva and
                                                                   works at ITC- irst).
Cunningham, 2003) as Ontology Driven Information
                                                                      Mentions are classified according to four syntactic
Extraction with the goal of extracting and classifying instances
                                                                   categories: NAM (proper names), NOM (nominal
of concepts and relations defined in an ontology, in place of
                                                                   constructions), PRO (pronouns) and PRE (modifiers).
filling a template. A similar task has been approached in a
variety of similar perspectives, including term clustering (Lin,
1998; Almuhareb and Poesio, 2004) and term categorization
                                                                     Corpus (total number of mentions)


                                                                                                         ACE ENG NWIRE
(Avancini et al., 2003). A rather different task is Ontology                                                 (5186)
Learning, where new concepts and relations are supposed to be                                                                                                                          NAM
                                                                                                                                                                                       NOM
acquired with the consequence of changing the definition of
                                                                                                                                                                                       PRE
the Ontology itself (Velardi et al. 2005).                                                                                                                                             PRO

    The interest in OP is also reflected in the large number of                                            I-CAB (28353)
research projects which consider knowledge extraction from
text a key technology for feeding Semantic Web applications.
Among such projects, it is worth mentioning Vikef (Making                                                                  0   20         40        60         80          100   120
                                                                                                                                    Percent. of m entions per synt. cat.
the Semantic Web Fly), whose main aim is to bridge the gap
between implicit information expressed in scientific
documents and its explicit representation found in knowledge       Fig. 1. Distribution of the four different ACE mention types in
bases; and Parmenides, which is attempting to develop              I-CAB and in the ACE 2004 Evaluation corpus (Newswire)
technologies for the semi-automatic building and maintenance
                                                                      In spite of the adaptations to Italian, it is interesting to
of domain-specific ontologies.
                                                                   notice that a comparison between I-CAB and the newswire
    The work presented in this paper has been inspired by the
                                                                   portion of the ACE 2004 Evaluation corpus (see Figure 1)
ACE Entity Detection task, which requires that the entities
                                                                   shows a similar proportion of NAM and NOM mentions in the
mentioned in a text (e.g. PERSON, ORGANIZATION, LOCATION
                                                                   two corpora. On the other hand, there is a low percentage of
and GEO-POLITICAL ENTITY) be detected. As the same entity
                                                                   PRO mentions in Italian, which can be explained by the fact
may be mentioned more than once in the same text, ACE
                                                                   that, unlike in English, subject pronouns in Italian can be
defines two inter-connected levels of annotation: the level of
                                                                   omitted. As for the large difference in the total number of
the entity, which provides a representation of an object in the
                                                                   mentions annotated in the two corpora (22,500 and 5,186 in I-
world, and the level of the entity mention, which provides
                                                                   CAB and ACE NWIRE respectively), this is proportional to
information about the textual references to that object. The
                                                                   their size (around 180,000 words for I-CAB and 25,900 words
information contained in the textual references to entities may
                                                                   for ACE NWIRE), considering that some of the ACE entities
be translated into a knowledge base, and eventually into an
Ontology.
                                                                                      2
                                                                                                         http://www.ladige.it/
                                                                                      3
                                                                                                         http://tcc.itc.it/projects/ontotext/index.html
                                                                                                                                                    3

 (i.e. FACILITY, VEHICLE, AND WEAPON) are not annotated in I-                        these three attributes is associated with a mention and all the
 CAB.                                                                                information within a group has to be derived from the same
    As shown in Figure 2, the two corpora also present a similar                     mention. If different pieces of information derive from distinct
 distribution as far as the number of mentions per entity is                         mentions, we will have two separate groups. For instance, the
 concerned. In fact, in both cases more than 60% of the entities                     three co-referring mentions “the journalist of Radio Liberty”,
 are mentioned only once, while around 15% are mentioned                             “the redactor of breaking news”, and “a spare time
 twice. Between 10% and 15% are mentioned three or four                              astronomer” lead to three different groups of ACTIVITY, ROLE
 times, while around 6% are mentioned between five and eight                         and AFFILIATION. The obvious inference that the first two
 times. The fact that the percentage of entities mentioned more                      mentions belong conceptually to the same group is not drawn.
 than eight times in a document is higher in the ACE corpus                          This step is to be taken at a further stage.
 than in I-CAB can be partly explained by the fact that the news
 stories in ACE are on average slightly longer than those in                          attributes                                            values
 ACE (around 470 versus 350 words per document).                                      FIRST_NAME                                      Ralph, Greg
                                                                                      MIDDLE_NAME                                            J., W.
                                                                                      LAST_NAME                                 McCarthy, Newton
                                                                                      NICKNAME                                   Spider, Enigmista
 Percent of Total Entities


                             70

                             60                                                       TITLE                                              Prof., Mr.
                                                      I-CAB
                             50
                                                      ACE ENG NWIRE
                                                                                      SEX                                                  actress
                             40                                                       ACTIVITY                                      author, doctor
                             30                                                       AFFILIATION                              The New York Times
                             20                                                       ROLE                                      manager, president
                             10

                             0                                                        PROVENANCE                                   South American
                                    [1]       [2]       [3 to 4]   [5 to 8]   >8
                                                                                      FAMILY_RELATION                                 father, cousin
                                  Number of Mentions (of the entity in a document)    AGE_CATEGORY                                         boy, girl
                                                                                      HONORARY                            the world champion 2000
Fig. 2. Intra-document co-reference in I-CAB and in the ACE                           MISCELLANEA                           The men with red shoes
2004 Evaluation corpus (Newswire)
                                                                                     Table 1. The attribute structure of PERSON

              IV. ATTRIBUTES for TYPE PERSON                                           We started with the set of 525 documents belonging to the I-
    After the annotation of mentions of type PERSON reported in                      CAB corpus (see section III), for which we have manually
 the previous section, each mention was additionally annotated                       annotated all PERSON entities (10039 mentions, see Table 2).
 in order to individuate the semantic information expressed by                       The annotation individuates both the entities mentioned within
 the mention regarding a specific entity. As an example, given                       a single document, called document entities, and the entities
 the mention “the Italian President Ciampi”, the following                           mentioned across the whole set of news stories, called
 attribute/value pairs were annotated: [PROVENANCE: Italian],                        collection entities. In addition, for the purposes of this work,
 [ROLE: President] and [LAST_NAME: Ciampi].                                          we decided to filter out the following mentions: (i) mentions
     The definition of the set of attributes for PERSON followed                     consisting only of one non-gender discriminative pronoun; (ii)
 an iterative process where we considered increasing amounts                         nested mentions, i.e. in case inside a mention there is a smaller
 of mentions from which we derived relevant attributes. The                          one, for example as in “the president Ciampi”, with “Ciampi”
 final set of attributes is listed in the first column of Table 1,                   being the included one, only the largest mention was
 with respective examples reported in the second column.                             considered. In this way we obtained a set of 7233 mentions
    A strict methodology is required in order to ensure accurate                     which represents the object of our study.
 annotation. As general guidelines for annotation, articles and
                                                                                      Number of documents                                       525
 prepositions are not admitted at the beginning of the textual
 extent of a value, an exception being made in the case of the                        Number of mentions                                      10039
 articles in nicknames (see Magnini et al., 2006B for a full                          Number meaningful mentions                               7233
 description of the criteria used to decide on border cases).                         Number of distinct meaningful mentions                   4851
    Attributes can be grouped into bigger units, as in the case of                    Number of document entities                              3284
 the attribute JOB, which is composed of three attributes,                            Number of collection entities                            2574
 ACTIVITY, ROLE, and AFFILIATION, which are not independent
                                                                                     Table 2. The PERSON Dataset
 of each other. ACTIVITY refers to the actual activity performed
 by the person, while ROLE refers to the position they occupy.                          The average number of meaningful mentions for an entity in
 So, for instance, “politician” is a possible value of the attribute                 a certain document is 2.20, while the average number of
 ACTIVITY, while “leader of the Labour Party” refers to the                          distinct meaningful mentions is 1.47. However, the variation
 ROLE a person plays inside an organization. Each group of                           from the average is high, only 14% of document entities are
                                                                                                                                4

mentioned exactly twice. In fact, there are relatively few       ontological resources available on the web (see for instance
entities whose mentions in news have a broad coverage in         swoogle.umbc.edu), while we have manually encoded the
terms of attributes, and there are quite a few whose mentions    second in the ontology.
contain just the name. A detailed analysis is carried out in       The process of OPTM combines the ontology ET-box with
Section VI.                                                      WK axioms and values of attributes recognized in textual
                                                                 mentions, and performs two main steps:
                          V. ONTOLOGY                                   1. For each entry recognized in the text we create a
    The ontology adopted for the OPTM task is composed of             new individual in the ontology, along with the individuals
two main parts. The first part mirrors the mention attribute          corresponding to the attribute values
structure and contains axioms (restrictions) on the attribute           2. We normalize the values by comparing the “string”
values. In this part, which we refer as the Entity T-Box (ET-         values with the individuals present in the WK.
box), we define three main classes corresponding to the three      As an example of this process, consider the entry in Table 5.
main entities, PERSON, ORGANIZATION and GEO-POLITICAL
ENTITY. Each of these classes is associated with the mention      FIRST_NAME                                             Bob, B.
attributes. An example of how the attributes are encoded in       LAST_NAME                                              Marley
axioms in the ET-box is provided in Table 3.                      PROVENANCE                                         Caribbean
                                                                  ACTIVITY                               musician, guitar player
 ONTOLOGY AXIOM                      Encoded restriction
 PERSON                              Every person has at          Table 5. Attributes/Values examples
 ⊆(>0)HAS_FIRST_NAME                 least a first name            In the first phase we add the axioms in Table 6 to the
 PERSON ⊆                            Every person has            ontology.
 (=1)HAS_LAST_NAME                   exactly one last name
 DOMAIN(HAS_FIRST_NAME) =                                         Person(person23)
                                     the first argument of
 PERSON                                                           HAS_FIRST_NAME(person23,first_name76)
                                     the relation
                                     has_first_name must be       HAS_LAST_NAME(person23,last_name93)
                                     a person                     HAS_PROVENANCE(person23,geo_pol_entity35)
 RANGE(HAS_PROVENANCE) =             The second argument          HAS_ACTIVITY(person23,activity43)
 GEOPOLITICALENTITY                  of the relation              HAS_ACTIVITY(person23,activity44)
                                     HAS_PROVENENCE
                                                                  HAS_VALUE(first_name56, “Bob”)
                                     must be a geopolitical
                                     entity                       HAS_VALUE(first_name76, “B.”)
                                                                  HAS_VALUE(geo_pol_entity35, “Caribbean”)
 Table 3. Description of Ontology axioms                          HAS_VALUE(activity43, “musician”)
   The second component of the ontology, called world             HAS_VALUE(activity44,“guitar player”)
knowledge (WK), encodes the basic knowledge about the
                                                                  Table 6. Adding axioms to the Ontology
world already available (see Table 4 for examples of axioms).
This ontology has been semi-automatically constructed starting      In the second phase, we attempt to match the values to the
from the large amount of basic information available on the      individuals in the WK and the Ontology is modified according
web. Examples of such knowledge are the sets of countries,       to the result of the matching process. This process is based on
main cities, country capitals, Italian municipalities, etc.      the semantic matching approach described in (Bouquet, 2003).
 ONTOLOGY AXIOM                          Encoded                    In this phase the WK-part of the ontology take a crucial
                                         restriction             role. The main goal of this phase is to find the best match
 COUNTRY(Italy)                          Italy is a country      between the values of an attribute and the individuals which
 HAS_CAPITAL(Italy,Rome)                 Rome is the capital     are already present in the WK A-box. This process can have
                                         of Italy                two outputs. When a good-enough match is found between an
 CONTINENT ⊆                             A country is a          attribute value and an individual of the WK A-box, then an
 GEOPOLITICALENTITY                      geopolitical entity     equality assertion is added. Suppose for instance that the WK
 TOWN ⊆ GEOPOLITICALENTITY               A town is a             A-box contains the statement
                                         geopolitical entity                            STATE(Caribbean)
                                                                 then the mapping process will find a high match between the
 Table 4. Description of Ontology axioms related to WK           value “Caribbean” (as a string) and the individual Caribbean
                                                                 (due to the syntactic similarity between the two strings, and the
As can be seen from the above examples, WK is composed of        fact that both are associated to individuals of type
two types of knowledge: factual knowledge (the first two         GEOPOLITICALENTITY). As a consequence the assertion
axioms in Table 4) and generic commonsense knowledge. The                     Geo_pol_entity35 = Caribbean
first type of knowledge can be obtained from the many
                                                                                                                                         5

is asserted in the A-box. Notice that the above assertion            pairs inside a given mention and (B) the difficulty of
connects an individual of the WK with the value of an entity         establishing the co-reference of entities based on the values of
contained in the entity repository of the mentions.                  their attributes.
   When the mapping process does not produce a “good“                   In table 7 we find the distribution of the values of the
mapping (where good is defined w.r.t., a suitable distance           attributes defined for PERSON. The first column lists the set of
measure not described here) the value is transformed into an         attributes; the second column lists the number of occurrences
individual and added to the WK A-box. For instance, suppose          of each attribute, the third lists the number of different values
that the mapping of the value “guitar player” will not produce       that the attribute actually takes; the fourth column lists the
a good matching value, then the new assertion                        number of collection entities which have that attribute. Using
                  ACTIVITY(GuitarPlayer)                             this table as base table we try to determine the parameters
is added to the WK A-box and the assertion                           which give us no clues on the two factors above
               activity44 = GuitarPlayer                                .
is added to the A-box that links WK with the A-box of the
mentions.

              VI. PERSON DATASET
         Attribute            OccurrenceANALYSIS  Different        Collection       Distinct values        Variability of
  The difficulty of the OPTM     task is directly
                            of attribute  in      correlated
                                                  values for with entities with     within distinct          values in
two factors: (A) the difficultymentions
                               of identifying theattribute
                                                   attribute/value attribute          mentions               attribute

    FIRST_NAME                 2299 (31%)                 676              1592                 13%                  29%
    MIDDLE_NAME                  110 (1%)                  67                74                  1%                  60%
    LAST_NAME                  4173 (57%)                1906              2191                 39%                  45%
    NICKNAME                      73 (1%)                  44                41                  0%                  60%
    TITLE                         73 (1%)                  25                47                  0%                  34%
    SEX                        3658 (50%)                1864              1743                 38%                  50%
    ACTIVITY                    973 (13%)                 322               569                  6%                  33%
    AFFILIATION                  566 (7%)                 389               409                  8%                  68%
    ROLE                         531 (7%)                 211               317                  4%                  39%
    PROVENANCE                   469 (6%)                 226               367                  4%                  48%
    FAMILY_RELATION              133 (1%)                  46                94                  0%                  34%
    AGE_CATEGORY                 307 (4%)                 106               163                  2%                  34%
    HONORARY                      69 (0%)                  63                53                  1%                  91%
    MISCELLANEA                  278 (3%)                 270               227                  5%                  97%

                                    Table 7. Distribution of values of attributes for PERSON

   A. Difficulty of identifying attribute/value pairs                   The difficulty of correct identification of the attribute/value
   The identification of attribute/value pairs requires the          pairs is directly linked to the complexity of a mention. Two
correct decomposition of the mentions into non overlapping           values inside the mention belong to the same entity. Without
parts, each one carrying the value of one attribute. We are          recognizing the correct frontiers of a complex mention
interested in estimating the distribution of attributes inside the   virtually 50% of the cases are treated badly.
mentions. Table 8 shows on the second and fourth columns              #attributes        #mentions     #attributes          #mentions
the number of mentions which contain respectively 1, 2, 3, …,
12 attributes. As we can see, the number of mentions having           1                3669 (50%)      7                    34 (0,04%)
more than 6 attributes is insignificant. On the other hand, the       2                1292 (17%)      8                           19
number of mentions containing more than one attribute is
3564, which represents 49,27% of the total, therefore one in          3                1269 (17%)      9                            4
two mentions is a complex mention. Usually, a complex                 4                   486 (6%)     10                           4
mention contains a SEX value, therefore a two attribute               5                   310 (4%)     11                           0
mention practically has just one that might help in establishing
co-reference. However, 92% of the mentions with up to 5               6                   146 (2%)     12                           0
attributes, which covers 96% of all mentions, contain a NAME
attribute, which, presumably, is an important piece of evidence      Table 8. Number of attributes carried by mentions
in deciding on co-reference.
                                                                                                                                    6


                                   2             3             4         Probably, the importance of recognizing certain types of
 attribute                 attribute     attribute     attribute      attributes is bigger than for other ones. If the occurrence of a
                            mention       mention       mention       new value of an important attribute is a rare event, a system
 FIRST_NAME                      398            915          413      must be very precise in catching these cases. We may assume
                                                                      that a high precision is more difficult to achieve than a lower
 MIDDLE_NAME                       5             20           34
                                                                      one. The “distinct” column gives us a clue on this issue. For
 LAST_NAME                       467           1025          426      example, the relatively low figures for ACTIVITY, AFFILIATION,
 NICKNAME                         27             16            2      ROLE but their importance with respect to the OPTM task, tell
 TITLE                            14             16           13      us that sparseness could be an issue and therefore a precise
 SEX                             806           1240          501      system of their treatment must be used. Otherwise it will be
                                                                      hard to achieve the expected results.
 ACTIVITY                        273            135          413
                                                                         Finally, we may notice that 39% of the mentions carry some
 AFFILIATION                      82             91           80      other information than SEX and name related values,
 ROLE                            126             81           94      MISCELLANEA excluded. Therefore in all those cases the
 PROVENANCE                       81            134          156      ontology is enriched with substantial information about the
 FAMILY_RELATION                  76             24          103      respective persons.
 AGE_CATEGORY                    139             62           12
                                                                         B. Difficulty of establishing Co-references among entities
 HONORARY                         20               7          31         The task of correctly identifying a value of a certain
 MISCELLANEA                      80             59           11      attribute inside a given mention is worth to be undertaken if
                                                                      the respective values play a role in other tasks, especially in
Table 9. Distribution of attributes into complex mentions             the co-reference task. A relevant factor for co-reference is the
   A second difficulty of correctly identifying the                   perplexity of an attribute, i.e. the percentage of the entities
attribute/value pairs comes from the combinatorial capacities         characterized by a particular value, computed as the ratio
of attributes inside a complex mention. If the diversity of           between distinct values for a certain attribute and collection
attribute patterns in a complex mention is high, then the             entities having that attribute (column III / IV in table 7). For
difficulty of their recognition is also high. Table 9 shows that      example the perplexity of LAST_NAME is 14% (see Table 10).
the whole set of attributes is very well represented in the           Therefore if we take randomly some values of LAST_NAME,
complex mentions and, interestingly, the number of attributes         86% of them are pointing to just one person. In the case of
varies independently of the number of mentions, therefore             SEX and MISCELLANEA, the perplexity is not defined.
their combinatorial capacity is high. The difficulty of their
recognition varies accordingly.                                        attribute                                         perplexity
   The distribution of attributes inside mentions is presented in      FIRST_NAME                                              58%
the second column of Table 7 in parenthesis. The figures give          MIDDLE_NAME                                             10%
the probability that a person is mentioned by making reference         LAST_NAME                                               14%
to a certain attribute. For example, one may expect the                NICKNAME                                                 0%
LAST_NAME attribute to be present in 57% of mentions, and              TITLE                                                   47%
the NICKNAME attribute to be present in 0,001% of the total. In        SEX                                                        -
the fifth column we compute the same figures without                   ACTIVITY                                                44%
repetition, considering the distinct values and distinct
                                                                       AFFILIATION                                               5%
mentions. Considering also the figures that show the linguistic
variability of values, we may obtain the probability of seeing a       ROLE                                                     34%
previously unseen value of a given attribute. The last column          PROVENANCE                                               52%
of Table 7 shows the variability of values for each attribute.         FAMILY_RELATION                                          39%
For example, taking randomly a mention of FIRST_NAME, only             AGE_CATEGORY                                             35%
in 29% of the cases that value is seen in the dataset just once.       HONORARY                                                  0%
   The fifth column, distinct values within distinct mentions,         MISCELLANEA                                                 -
and the sixth, variability of values in attribute, offer us insight
into the difficulty of recognizing attribute/value pairs. The         Table 10. Perplexity of PERSON attributes
variability might be considered as representative of the amount          By comparing the perplexity of LAST_NAME and
of training a system needs in order to have a satisfactory            MIDDLE_NAME one might erroneously conclude that the latter
coverage of cases. Intuitively, some of the attributes are close      is more discriminative. This fact is due to the small number of
classes, while some other attributes, e.g. those who have name        examples of MIDDLE_NAME values within the PERSON dataset.
values, are open classes.                                             Considering the occurrences of one attribute independently of
                                                                      another we may use the usual rule of thumb for Bernoulli
                                                                                                                                     7

Distribution. That is, it is highly likely that the perplexity of     collection, so it should be approximated. The difference
FIRST_NAME, LAST_NAME, ACTIVITY, AFFILIATION, ROLE and                between co-reference density and pseudo co-reference density
PROVENANCE will not change with the addition of new                   shows the increase in recall, if one considers that two identical
examples, as the actual numbers are high.                             mentions refer to the same entity with probability 1. On the
   We can estimate the probability that two entities selected         other hand, the loss in accuracy might be too large (consider
from different documents co-refer. Actually, this is the              for example the case when two different persons happen to
estimate of the probability that two entities co-refer                have the same first name).
conditioned by the fact that they have been correctly identified         For our dataset the co-ref is ≈0,22 which means that 22% of
inside the documents. We can compute such probability as the          the document entities occur in more than one document. The
complementary of the ratio between the number of different            detailed distribution is presented in Table 11, where on the
entities and the number of the document entities in the               first and third columns we list the number of collection entities
collection.                                                           that occur in the number of documents that is specified in the
                         # collection − entities                      second and fourth respectively.
   P(co − ref ) = 1 −
                         # document − entities                          #documents        #entities   #documents             #entities
                                                                        1                     2155    6                              6
   From Table 2 we read these values as 2574 and 3284
                                                                                             (84%)
respectively, therefore, for the PERSON data set, the probability       2                286 (11%)    7                              3
of intra-document co-reference is approximately 22%. We
consider that this figure is only partially indicative, and that it     3                  71 (2%)    8                              4
is very likely for it to be increased after inspection of bigger        4                  31 (1%)    9                              1
corpora. This is an aposteori probability because the number
of collection-entities is known only after the whole set of             5                15 (0,5%)    16                             1
mentions has been processed.
   An global estimator of the difficulty of the co-reference is       Table 11. Intra-document co-reference
the expectation that a correct identified mention refers to a
                                                                                               VII. CONCLUSION
new entity. This estimator shows the density of collection-
                                                                         We have presented the results of a pilot study on Ontology
entities in the mentions space: let us call it co-reference
                                                                      Population restricted to PERSON entities. One of the main
density. We can estimate the co-reference-density as the ratio
                                                                      motivation of the study was to individuate critical factors that
between the number of different entities and the number of
                                                                      determine the difficulty of the task.
mentions.
                                                                         The first conclusion we draw is that textual mentions of
                          # collection − entities
   coref − density =                                                  PERSON entities are highly structured. As a matter of fact, most
                                # mentions                            of the mentions bring information that can be easily classified
                                                                      in a limited number of attributes, while only 3% of them are
   The co-reference density takes values in the interval with         categorized as MISCELLANEA. These figures highly suggested
limits [0-1]. The case when the co-reference density tends to 0       that the Ontology Population from Textual Mentions (OPTM)
means that all the mentions refer to the same entity, while           approach is feasible and promising.
when the value tends to 1 it means that each mention in the              Secondly, we show that 50% of the mentions carry more
collection refers to a different entity. Both the limits render the   than the value of a single attribute. This fact, combined with
co-reference task superfluous. The figure for co-reference            the relatively low perplexity figures for some attributes, most
density we found in our corpus is 2574/7233 ≈ 0.35, and it is         notably LAST_NAME, suggests a co-reference procedure based
far from being close to one of the extremes.                          on attributes values.
   A measure, that can be used as a baseline for the co-                 Thirdly, we have computed the values of three estimators of
reference task, is the value of co-reference density conditioned      difficulty for entity co-reference. One of them, the pseudo-co-
by the fact that one knows in advance whether two mentions            reference-density, might be naturally used as a baseline for the
that are identical also co-refer. Let us call this measure            task. It has been also discovered that the co-reference-density
pseudo-co-reference-density. It shows the maximum accuracy            is far away from its possible extremes, 0 and 1, showing that
of a system that deals with ambiguity by ignoring it. We              simple string matching procedures might not achieve good
approximate it as the ratio between the number of different           results.
entities and the number of distinct mentions.                            Our future work will be focused on two main issues: (i) the
                               # collection − entities                use of the PERSON dataset as training corpus for resolving the
   p − coref − density =                                              entity co-reference task, as a first step towards implementing a
                               # distinct − mentions                  full OPTM system; and (ii) a controlled extension of the
                                                                      dataset with new data in order to understand which figures are
   The pseudo-co-reference for our dataset is 2574/4851 ≈             likely to remain stable.
0.55. This information is not directly expressed in the
                                                                        8

                            REFERENCES
1.  Almuhareb, A., Poesio, M. (2004). Attribute-based and value-
    based clustering: An evaluation. In: Proceedings of EMNLP
    2004, Barcelona, 2004, 158-165.
2. Avancini, H., Lavelli, A., Magnini, B., Sebastiani, F., Zanoli, R.
    (2003). Expanding Domain-Specific Lexicons by Term
    Categorization. In: Proceedings of SAC 2003, 793-79.
3. Bontcheva, K., Cunningham, H. (2003). The Semantic Web: A
    New Opportunity and Challenge for HLT. In: Proceedings of the
    Workshop HLT for the Semantic Web and Web Services at
    ISWC 2003, Sanibel Island, 2003.
4. Bouquet, P., Serafini, L., and Zanobini S.. Semantic
    coordination: a new approach and an application, In Sencond
    Internatinal Semantic Web Conference, volume 2870 of Lecture
    Notes in Computer Science, pages 130--145. Springer Verlag,
    September 2003
5. Buitelaar P., Cimiano P. and Magnini B. (Eds.) Ontology
    Learning from Text: Methods, Evaluation and applications. IOS
    Press, 2005.
6. Ferro, L., Gerber, L., Mani, I., Sundheim, B. and Wilson G.
    (2005). TIDES 2005 Standard for the Annotation of Temporal
    Expressions. Technical report, MITRE.
7. Lavelli, A., Magnini, B., Negri, M., Pianta, E., Speranza, M.,
    Sprugnoli, R. (2005). Italian Content Annotation Bank (I-CAB):
    Temporal Expressions (V. 1.0.). Technical Report T-0505-12.
    ITC-irst, Trento.
8. Lin, D. (1998). Automatic Retrieval and Clustering of Similar
    Words. In: Proceedings of COLING-ACL98, Montreal, Canada,
    1998.
9. Linguistic Data Consortium (2004). ACE (Automatic Content
    Extraction) English Annotation Guidelines for Entities, version
    5.6.1 2005.05.23.
10. B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Romano, M.
    Speranza, V. Bartalesi Lenzi and R. Sprugnoli. I-CAB: the
    Italian Content Annotation Bank, In: Proceedings of LREC-
    2006, Genova, Italy.
11. B. Magnini, E. Pianta, O. Popescu and M. Speranza. Ontology
    Population from Textual Mentions: Task Definition and
    Benchmark. Proceedings of the OLP2 workshop on Ontology
    Population and Learning, Sidney, Australia, 2006. Joint with
    ACL/Coling 2006.
12. Tanev H. and Magnini B. Weakly Supervised Approaches for
    Ontology Population. Proceedings of EACL-2006, Trento, 3-7
    April, 2006.
13. Velardi, P., Navigli, R., Cuchiarelli, A., Neri, F. (2004).
    Evaluation of Ontolearn, a Methodology for Automatic
    Population of Domain Ontologies. In: Buitelaar, P., Cimiano, P.,
    Magnini, B. (eds.): Ontology Learning from Text: Methods,
    Evaluation and Applications, IOS Press, Amsterdam, 2005.