1 From Mentions to Ontology: A Pilot Study Octavian Popescu, Bernardo Magnini, Emanuele Pianta, Luciano Serafini, Manuela Speranza and Andrei Tamilin, ITC-irst, 38050, Povo (TN), Italy all possible entities (e.g. ORGANIZATION, LOCATION, etc.). Abstract— In this paper we propose a pilot study aimed at an in- Mentions, as defined within the ACE (Automatic Content depth comprehension of the phenomena underlying Ontology Extraction)1 Entity Detection Task (Linguistic Data Population from text. The study has been carried out on a Consortium, 2004) are portions of text that refer to entities. As collection of Italian news articles, which have been manually annotated at several semantic levels. More specifically, we have an example, given a particular textual context, the two annotated all the textual expressions (i.e. mentions) referring to mentions “George W. Bush” and “the U.S President” refer to Persons; each mention has been in turn decomposed into a the same entity, i.e. a particular instance of PERSON whose first number of attribute/value pairs; co-reference relations among name is “George”, whose middle initial is “W.”, whose family mentions have been established, resulting in the identification of name is “Bush” and whose role is “President of the U.S.”. entities, which, finally, have been used to populate an ontology. As for PERSON entities, they were selected for our pilot There are two significant results of such a study. First, a number of factors have been empirically identified which determine the study because they occur very frequently in the news document difficulty of Ontology Population from Text and which can now collection we analyzed. Most of the results we obtained, be taken into account while designing automatic systems. Second, however, are likely to be generalized over the other types of the resulting dataset is a valuable resource for training and entities. testing single components of Ontology Population systems. Given the above-mentioned restrictions, the contribution of this paper is a thorough study of Ontology Population from I. INTRODUCTION Textual Mentions (OPTM). We have manually extracted a n this paper we propose an empirical investigation into number of relevant details concerning entities of type PERSON I the relations between language and knowledge, aiming at the definition of a computational framework for automatic from the document collection and then used them to populate a small pre-existing ontology. This led to two significant results Ontology Population (OP) from text. of such a study. First, a number of factors have been While Ontology Population from text has received an empirically identified which determine the difficulty of increasing attention in recent years (see for instance, Buitelaar Ontology Population from Text and which can now be taken et al. 2005), mostly due to its strong relationship with the into account while designing automatic systems. Second, the Semantic Web perspective, very little has been done in order resulting dataset is a valuable resource for training and testing to provide a clear definition of the task and to establish shared single components of Ontology Population. evaluation procedures and benchmarks. In this paper we We show that the difficulty of the OPTM task is directly propose a pilot study aimed at an in-depth comprehension of correlated to two factors: (A) the difficulty of identifying the phenomena underlying Ontology Population from Text attribute/value pairs inside a given mention and (B) the (OPTM). Specifically, we are interested in highlighting the difficulty of establishing co-reference between entities based following aspects of the task: on the values of their attributes. • What are the major sources of difficulty of the task? There are several advantages of OPTM that makes it • How does OP from text relate to well known tasks in appealing for OLP. First, mentions provide an obvious Natural Language Processing, such as Named Entity simplification with respect to the more general task of Recognition? Ontology Population from text (cfr. Buitelaar et al. 2005); in • What kinds of reasoning capabilities are crucial for the addition, mentions are well defined and there are systems for task? automatic mention recognition which can provide the input for • Is there any way to simplify the task so that it can be that task. Second, since mentions have been introduced as an addressed in a modular way? evolution of the traditional Named Entity Recognition task • Can we devise useful metrics to evaluate system (see Tanev and Magnini, 2006), they guarantee a reasonable performance? level of complexity, which makes OPTM challenging both for We addressed the above questions through a pilot study on a the Computational Linguistics and the Knowledge limited amount of textual data. We added two restrictions with Representation communities. Third, there already exist data respect to the general OP task: first, we considered textual annotated with mentions, delivered under the ACE initiative mentions instead of full text; second, we focused on (Ferro et al. 2005, Linguistic Data Consortium 2004), which information related to PERSON entities instead of considering make it possible to exploit machine learning approaches. The 1 http://www.nist.gov/speech/tests/ace 2 availability of annotated data allows for a better estimation of III. DATA SET the performance of OPTM; in particular, it is possible to The input of OPTM consists of textual mentions derived evaluate the recall of the task, i.e. the proportion of from the Italian Content Annotation Bank (I-CAB), which information correctly assigned to an entity out of the total consists of 525 news documents taken from the local amount of information provided by a certain mention. newspaper ‘L’Adige’2, for a total of around 180,000 words The paper is structured as follows. Section II provides some (Magnini et al., 2006). The annotation of I-CAB has been background on Ontology Population and reports on relevant carried out manually within the Ontotext project3, following related work; Section III describes the dataset of the PERSON the ACE annotation guidelines for the Entity Detection task. I- pilot study and compares it to the ACE dataset. Section IV CAB is annotated with expressions of type introduces a new methodology for the semantic annotation of TEMPORAL_EXPRESSION and four types of entities: PERSON, attribute/value pairs within textual mentions. In section V we ORGANIZATION, GEO-POLITICAL ENTITY and LOCATION. Due to describe the Ontology we plan on using. Finally, Section VI the morpho-syntactic differences between the two languages, reports on a quantitative and qualitative analysis of the data, the ACE annotation guidelines for English had to be adapted which help determining the main sources of difficulty of the to Italian; for instance, two specific new tags, PROCLIT and task. Conclusions are drawn in Section VII. ENCLIT, have been created to annotate clitics attached to the . beginning or the end of certain words (e.g. /to see him). II. RELATED WORK According to the ACE definition, entity mentions are Automatic Ontology Population (OP) from texts has portions of text referring to entities; the extent of this portion recently emerged as a new field of application for knowledge of text consists of an entire nominal phrase, thus including acquisition techniques (Buitelaar et al., 2005). Although there modifiers, prepositional phrases and dependent clauses (e.g./the resercher who approximation has been suggested by (Bontcheva and works at ITC- irst). Cunningham, 2003) as Ontology Driven Information Mentions are classified according to four syntactic Extraction with the goal of extracting and classifying instances categories: NAM (proper names), NOM (nominal of concepts and relations defined in an ontology, in place of constructions), PRO (pronouns) and PRE (modifiers). filling a template. A similar task has been approached in a variety of similar perspectives, including term clustering (Lin, 1998; Almuhareb and Poesio, 2004) and term categorization Corpus (total number of mentions) ACE ENG NWIRE (Avancini et al., 2003). A rather different task is Ontology (5186) Learning, where new concepts and relations are supposed to be NAM NOM acquired with the consequence of changing the definition of PRE the Ontology itself (Velardi et al. 2005). PRO The interest in OP is also reflected in the large number of I-CAB (28353) research projects which consider knowledge extraction from text a key technology for feeding Semantic Web applications. Among such projects, it is worth mentioning Vikef (Making 0 20 40 60 80 100 120 Percent. of m entions per synt. cat. the Semantic Web Fly), whose main aim is to bridge the gap between implicit information expressed in scientific documents and its explicit representation found in knowledge Fig. 1. Distribution of the four different ACE mention types in bases; and Parmenides, which is attempting to develop I-CAB and in the ACE 2004 Evaluation corpus (Newswire) technologies for the semi-automatic building and maintenance In spite of the adaptations to Italian, it is interesting to of domain-specific ontologies. notice that a comparison between I-CAB and the newswire The work presented in this paper has been inspired by the portion of the ACE 2004 Evaluation corpus (see Figure 1) ACE Entity Detection task, which requires that the entities shows a similar proportion of NAM and NOM mentions in the mentioned in a text (e.g. PERSON, ORGANIZATION, LOCATION two corpora. On the other hand, there is a low percentage of and GEO-POLITICAL ENTITY) be detected. As the same entity PRO mentions in Italian, which can be explained by the fact may be mentioned more than once in the same text, ACE that, unlike in English, subject pronouns in Italian can be defines two inter-connected levels of annotation: the level of omitted. As for the large difference in the total number of the entity, which provides a representation of an object in the mentions annotated in the two corpora (22,500 and 5,186 in I- world, and the level of the entity mention, which provides CAB and ACE NWIRE respectively), this is proportional to information about the textual references to that object. The their size (around 180,000 words for I-CAB and 25,900 words information contained in the textual references to entities may for ACE NWIRE), considering that some of the ACE entities be translated into a knowledge base, and eventually into an Ontology. 2 http://www.ladige.it/ 3 http://tcc.itc.it/projects/ontotext/index.html 3 (i.e. FACILITY, VEHICLE, AND WEAPON) are not annotated in I- these three attributes is associated with a mention and all the CAB. information within a group has to be derived from the same As shown in Figure 2, the two corpora also present a similar mention. If different pieces of information derive from distinct distribution as far as the number of mentions per entity is mentions, we will have two separate groups. For instance, the concerned. In fact, in both cases more than 60% of the entities three co-referring mentions “the journalist of Radio Liberty”, are mentioned only once, while around 15% are mentioned “the redactor of breaking news”, and “a spare time twice. Between 10% and 15% are mentioned three or four astronomer” lead to three different groups of ACTIVITY, ROLE times, while around 6% are mentioned between five and eight and AFFILIATION. The obvious inference that the first two times. The fact that the percentage of entities mentioned more mentions belong conceptually to the same group is not drawn. than eight times in a document is higher in the ACE corpus This step is to be taken at a further stage. than in I-CAB can be partly explained by the fact that the news stories in ACE are on average slightly longer than those in attributes values ACE (around 470 versus 350 words per document). FIRST_NAME Ralph, Greg MIDDLE_NAME J., W. LAST_NAME McCarthy, Newton NICKNAME Spider, Enigmista Percent of Total Entities 70 60 TITLE Prof., Mr. I-CAB 50 ACE ENG NWIRE SEX actress 40 ACTIVITY author, doctor 30 AFFILIATION The New York Times 20 ROLE manager, president 10 0 PROVENANCE South American [1] [2] [3 to 4] [5 to 8] >8 FAMILY_RELATION father, cousin Number of Mentions (of the entity in a document) AGE_CATEGORY boy, girl HONORARY the world champion 2000 Fig. 2. Intra-document co-reference in I-CAB and in the ACE MISCELLANEA The men with red shoes 2004 Evaluation corpus (Newswire) Table 1. The attribute structure of PERSON IV. ATTRIBUTES for TYPE PERSON We started with the set of 525 documents belonging to the I- After the annotation of mentions of type PERSON reported in CAB corpus (see section III), for which we have manually the previous section, each mention was additionally annotated annotated all PERSON entities (10039 mentions, see Table 2). in order to individuate the semantic information expressed by The annotation individuates both the entities mentioned within the mention regarding a specific entity. As an example, given a single document, called document entities, and the entities the mention “the Italian President Ciampi”, the following mentioned across the whole set of news stories, called attribute/value pairs were annotated: [PROVENANCE: Italian], collection entities. In addition, for the purposes of this work, [ROLE: President] and [LAST_NAME: Ciampi]. we decided to filter out the following mentions: (i) mentions The definition of the set of attributes for PERSON followed consisting only of one non-gender discriminative pronoun; (ii) an iterative process where we considered increasing amounts nested mentions, i.e. in case inside a mention there is a smaller of mentions from which we derived relevant attributes. The one, for example as in “the president Ciampi”, with “Ciampi” final set of attributes is listed in the first column of Table 1, being the included one, only the largest mention was with respective examples reported in the second column. considered. In this way we obtained a set of 7233 mentions A strict methodology is required in order to ensure accurate which represents the object of our study. annotation. As general guidelines for annotation, articles and Number of documents 525 prepositions are not admitted at the beginning of the textual extent of a value, an exception being made in the case of the Number of mentions 10039 articles in nicknames (see Magnini et al., 2006B for a full Number meaningful mentions 7233 description of the criteria used to decide on border cases). Number of distinct meaningful mentions 4851 Attributes can be grouped into bigger units, as in the case of Number of document entities 3284 the attribute JOB, which is composed of three attributes, Number of collection entities 2574 ACTIVITY, ROLE, and AFFILIATION, which are not independent Table 2. The PERSON Dataset of each other. ACTIVITY refers to the actual activity performed by the person, while ROLE refers to the position they occupy. The average number of meaningful mentions for an entity in So, for instance, “politician” is a possible value of the attribute a certain document is 2.20, while the average number of ACTIVITY, while “leader of the Labour Party” refers to the distinct meaningful mentions is 1.47. However, the variation ROLE a person plays inside an organization. Each group of from the average is high, only 14% of document entities are 4 mentioned exactly twice. In fact, there are relatively few ontological resources available on the web (see for instance entities whose mentions in news have a broad coverage in swoogle.umbc.edu), while we have manually encoded the terms of attributes, and there are quite a few whose mentions second in the ontology. contain just the name. A detailed analysis is carried out in The process of OPTM combines the ontology ET-box with Section VI. WK axioms and values of attributes recognized in textual mentions, and performs two main steps: V. ONTOLOGY 1. For each entry recognized in the text we create a The ontology adopted for the OPTM task is composed of new individual in the ontology, along with the individuals two main parts. The first part mirrors the mention attribute corresponding to the attribute values structure and contains axioms (restrictions) on the attribute 2. We normalize the values by comparing the “string” values. In this part, which we refer as the Entity T-Box (ET- values with the individuals present in the WK. box), we define three main classes corresponding to the three As an example of this process, consider the entry in Table 5. main entities, PERSON, ORGANIZATION and GEO-POLITICAL ENTITY. Each of these classes is associated with the mention FIRST_NAME Bob, B. attributes. An example of how the attributes are encoded in LAST_NAME Marley axioms in the ET-box is provided in Table 3. PROVENANCE Caribbean ACTIVITY musician, guitar player ONTOLOGY AXIOM Encoded restriction PERSON Every person has at Table 5. Attributes/Values examples ⊆(>0)HAS_FIRST_NAME least a first name In the first phase we add the axioms in Table 6 to the PERSON ⊆ Every person has ontology. (=1)HAS_LAST_NAME exactly one last name DOMAIN(HAS_FIRST_NAME) = Person(person23) the first argument of PERSON HAS_FIRST_NAME(person23,first_name76) the relation has_first_name must be HAS_LAST_NAME(person23,last_name93) a person HAS_PROVENANCE(person23,geo_pol_entity35) RANGE(HAS_PROVENANCE) = The second argument HAS_ACTIVITY(person23,activity43) GEOPOLITICALENTITY of the relation HAS_ACTIVITY(person23,activity44) HAS_PROVENENCE HAS_VALUE(first_name56, “Bob”) must be a geopolitical entity HAS_VALUE(first_name76, “B.”) HAS_VALUE(geo_pol_entity35, “Caribbean”) Table 3. Description of Ontology axioms HAS_VALUE(activity43, “musician”) The second component of the ontology, called world HAS_VALUE(activity44,“guitar player”) knowledge (WK), encodes the basic knowledge about the Table 6. Adding axioms to the Ontology world already available (see Table 4 for examples of axioms). This ontology has been semi-automatically constructed starting In the second phase, we attempt to match the values to the from the large amount of basic information available on the individuals in the WK and the Ontology is modified according web. Examples of such knowledge are the sets of countries, to the result of the matching process. This process is based on main cities, country capitals, Italian municipalities, etc. the semantic matching approach described in (Bouquet, 2003). ONTOLOGY AXIOM Encoded In this phase the WK-part of the ontology take a crucial restriction role. The main goal of this phase is to find the best match COUNTRY(Italy) Italy is a country between the values of an attribute and the individuals which HAS_CAPITAL(Italy,Rome) Rome is the capital are already present in the WK A-box. This process can have of Italy two outputs. When a good-enough match is found between an CONTINENT ⊆ A country is a attribute value and an individual of the WK A-box, then an GEOPOLITICALENTITY geopolitical entity equality assertion is added. Suppose for instance that the WK TOWN ⊆ GEOPOLITICALENTITY A town is a A-box contains the statement geopolitical entity STATE(Caribbean) then the mapping process will find a high match between the Table 4. Description of Ontology axioms related to WK value “Caribbean” (as a string) and the individual Caribbean (due to the syntactic similarity between the two strings, and the As can be seen from the above examples, WK is composed of fact that both are associated to individuals of type two types of knowledge: factual knowledge (the first two GEOPOLITICALENTITY). As a consequence the assertion axioms in Table 4) and generic commonsense knowledge. The Geo_pol_entity35 = Caribbean first type of knowledge can be obtained from the many 5 is asserted in the A-box. Notice that the above assertion pairs inside a given mention and (B) the difficulty of connects an individual of the WK with the value of an entity establishing the co-reference of entities based on the values of contained in the entity repository of the mentions. their attributes. When the mapping process does not produce a “good“ In table 7 we find the distribution of the values of the mapping (where good is defined w.r.t., a suitable distance attributes defined for PERSON. The first column lists the set of measure not described here) the value is transformed into an attributes; the second column lists the number of occurrences individual and added to the WK A-box. For instance, suppose of each attribute, the third lists the number of different values that the mapping of the value “guitar player” will not produce that the attribute actually takes; the fourth column lists the a good matching value, then the new assertion number of collection entities which have that attribute. Using ACTIVITY(GuitarPlayer) this table as base table we try to determine the parameters is added to the WK A-box and the assertion which give us no clues on the two factors above activity44 = GuitarPlayer . is added to the A-box that links WK with the A-box of the mentions. VI. PERSON DATASET Attribute OccurrenceANALYSIS Different Collection Distinct values Variability of The difficulty of the OPTM task is directly of attribute in correlated values for with entities with within distinct values in two factors: (A) the difficultymentions of identifying theattribute attribute/value attribute mentions attribute FIRST_NAME 2299 (31%) 676 1592 13% 29% MIDDLE_NAME 110 (1%) 67 74 1% 60% LAST_NAME 4173 (57%) 1906 2191 39% 45% NICKNAME 73 (1%) 44 41 0% 60% TITLE 73 (1%) 25 47 0% 34% SEX 3658 (50%) 1864 1743 38% 50% ACTIVITY 973 (13%) 322 569 6% 33% AFFILIATION 566 (7%) 389 409 8% 68% ROLE 531 (7%) 211 317 4% 39% PROVENANCE 469 (6%) 226 367 4% 48% FAMILY_RELATION 133 (1%) 46 94 0% 34% AGE_CATEGORY 307 (4%) 106 163 2% 34% HONORARY 69 (0%) 63 53 1% 91% MISCELLANEA 278 (3%) 270 227 5% 97% Table 7. Distribution of values of attributes for PERSON A. Difficulty of identifying attribute/value pairs The difficulty of correct identification of the attribute/value The identification of attribute/value pairs requires the pairs is directly linked to the complexity of a mention. Two correct decomposition of the mentions into non overlapping values inside the mention belong to the same entity. Without parts, each one carrying the value of one attribute. We are recognizing the correct frontiers of a complex mention interested in estimating the distribution of attributes inside the virtually 50% of the cases are treated badly. mentions. Table 8 shows on the second and fourth columns #attributes #mentions #attributes #mentions the number of mentions which contain respectively 1, 2, 3, …, 12 attributes. As we can see, the number of mentions having 1 3669 (50%) 7 34 (0,04%) more than 6 attributes is insignificant. On the other hand, the 2 1292 (17%) 8 19 number of mentions containing more than one attribute is 3564, which represents 49,27% of the total, therefore one in 3 1269 (17%) 9 4 two mentions is a complex mention. Usually, a complex 4 486 (6%) 10 4 mention contains a SEX value, therefore a two attribute 5 310 (4%) 11 0 mention practically has just one that might help in establishing co-reference. However, 92% of the mentions with up to 5 6 146 (2%) 12 0 attributes, which covers 96% of all mentions, contain a NAME attribute, which, presumably, is an important piece of evidence Table 8. Number of attributes carried by mentions in deciding on co-reference. 6 2 3 4 Probably, the importance of recognizing certain types of attribute attribute attribute attribute attributes is bigger than for other ones. If the occurrence of a mention mention mention new value of an important attribute is a rare event, a system FIRST_NAME 398 915 413 must be very precise in catching these cases. We may assume that a high precision is more difficult to achieve than a lower MIDDLE_NAME 5 20 34 one. The “distinct” column gives us a clue on this issue. For LAST_NAME 467 1025 426 example, the relatively low figures for ACTIVITY, AFFILIATION, NICKNAME 27 16 2 ROLE but their importance with respect to the OPTM task, tell TITLE 14 16 13 us that sparseness could be an issue and therefore a precise SEX 806 1240 501 system of their treatment must be used. Otherwise it will be hard to achieve the expected results. ACTIVITY 273 135 413 Finally, we may notice that 39% of the mentions carry some AFFILIATION 82 91 80 other information than SEX and name related values, ROLE 126 81 94 MISCELLANEA excluded. Therefore in all those cases the PROVENANCE 81 134 156 ontology is enriched with substantial information about the FAMILY_RELATION 76 24 103 respective persons. AGE_CATEGORY 139 62 12 B. Difficulty of establishing Co-references among entities HONORARY 20 7 31 The task of correctly identifying a value of a certain MISCELLANEA 80 59 11 attribute inside a given mention is worth to be undertaken if the respective values play a role in other tasks, especially in Table 9. Distribution of attributes into complex mentions the co-reference task. A relevant factor for co-reference is the A second difficulty of correctly identifying the perplexity of an attribute, i.e. the percentage of the entities attribute/value pairs comes from the combinatorial capacities characterized by a particular value, computed as the ratio of attributes inside a complex mention. If the diversity of between distinct values for a certain attribute and collection attribute patterns in a complex mention is high, then the entities having that attribute (column III / IV in table 7). For difficulty of their recognition is also high. Table 9 shows that example the perplexity of LAST_NAME is 14% (see Table 10). the whole set of attributes is very well represented in the Therefore if we take randomly some values of LAST_NAME, complex mentions and, interestingly, the number of attributes 86% of them are pointing to just one person. In the case of varies independently of the number of mentions, therefore SEX and MISCELLANEA, the perplexity is not defined. their combinatorial capacity is high. The difficulty of their recognition varies accordingly. attribute perplexity The distribution of attributes inside mentions is presented in FIRST_NAME 58% the second column of Table 7 in parenthesis. The figures give MIDDLE_NAME 10% the probability that a person is mentioned by making reference LAST_NAME 14% to a certain attribute. For example, one may expect the NICKNAME 0% LAST_NAME attribute to be present in 57% of mentions, and TITLE 47% the NICKNAME attribute to be present in 0,001% of the total. In SEX - the fifth column we compute the same figures without ACTIVITY 44% repetition, considering the distinct values and distinct AFFILIATION 5% mentions. Considering also the figures that show the linguistic variability of values, we may obtain the probability of seeing a ROLE 34% previously unseen value of a given attribute. The last column PROVENANCE 52% of Table 7 shows the variability of values for each attribute. FAMILY_RELATION 39% For example, taking randomly a mention of FIRST_NAME, only AGE_CATEGORY 35% in 29% of the cases that value is seen in the dataset just once. HONORARY 0% The fifth column, distinct values within distinct mentions, MISCELLANEA - and the sixth, variability of values in attribute, offer us insight into the difficulty of recognizing attribute/value pairs. The Table 10. Perplexity of PERSON attributes variability might be considered as representative of the amount By comparing the perplexity of LAST_NAME and of training a system needs in order to have a satisfactory MIDDLE_NAME one might erroneously conclude that the latter coverage of cases. Intuitively, some of the attributes are close is more discriminative. This fact is due to the small number of classes, while some other attributes, e.g. those who have name examples of MIDDLE_NAME values within the PERSON dataset. values, are open classes. Considering the occurrences of one attribute independently of another we may use the usual rule of thumb for Bernoulli 7 Distribution. That is, it is highly likely that the perplexity of collection, so it should be approximated. The difference FIRST_NAME, LAST_NAME, ACTIVITY, AFFILIATION, ROLE and between co-reference density and pseudo co-reference density PROVENANCE will not change with the addition of new shows the increase in recall, if one considers that two identical examples, as the actual numbers are high. mentions refer to the same entity with probability 1. On the We can estimate the probability that two entities selected other hand, the loss in accuracy might be too large (consider from different documents co-refer. Actually, this is the for example the case when two different persons happen to estimate of the probability that two entities co-refer have the same first name). conditioned by the fact that they have been correctly identified For our dataset the co-ref is ≈0,22 which means that 22% of inside the documents. We can compute such probability as the the document entities occur in more than one document. The complementary of the ratio between the number of different detailed distribution is presented in Table 11, where on the entities and the number of the document entities in the first and third columns we list the number of collection entities collection. that occur in the number of documents that is specified in the # collection − entities second and fourth respectively. P(co − ref ) = 1 − # document − entities #documents #entities #documents #entities 1 2155 6 6 From Table 2 we read these values as 2574 and 3284 (84%) respectively, therefore, for the PERSON data set, the probability 2 286 (11%) 7 3 of intra-document co-reference is approximately 22%. We consider that this figure is only partially indicative, and that it 3 71 (2%) 8 4 is very likely for it to be increased after inspection of bigger 4 31 (1%) 9 1 corpora. This is an aposteori probability because the number of collection-entities is known only after the whole set of 5 15 (0,5%) 16 1 mentions has been processed. An global estimator of the difficulty of the co-reference is Table 11. Intra-document co-reference the expectation that a correct identified mention refers to a VII. CONCLUSION new entity. This estimator shows the density of collection- We have presented the results of a pilot study on Ontology entities in the mentions space: let us call it co-reference Population restricted to PERSON entities. One of the main density. We can estimate the co-reference-density as the ratio motivation of the study was to individuate critical factors that between the number of different entities and the number of determine the difficulty of the task. mentions. The first conclusion we draw is that textual mentions of # collection − entities coref − density = PERSON entities are highly structured. As a matter of fact, most # mentions of the mentions bring information that can be easily classified in a limited number of attributes, while only 3% of them are The co-reference density takes values in the interval with categorized as MISCELLANEA. These figures highly suggested limits [0-1]. The case when the co-reference density tends to 0 that the Ontology Population from Textual Mentions (OPTM) means that all the mentions refer to the same entity, while approach is feasible and promising. when the value tends to 1 it means that each mention in the Secondly, we show that 50% of the mentions carry more collection refers to a different entity. Both the limits render the than the value of a single attribute. This fact, combined with co-reference task superfluous. The figure for co-reference the relatively low perplexity figures for some attributes, most density we found in our corpus is 2574/7233 ≈ 0.35, and it is notably LAST_NAME, suggests a co-reference procedure based far from being close to one of the extremes. on attributes values. A measure, that can be used as a baseline for the co- Thirdly, we have computed the values of three estimators of reference task, is the value of co-reference density conditioned difficulty for entity co-reference. One of them, the pseudo-co- by the fact that one knows in advance whether two mentions reference-density, might be naturally used as a baseline for the that are identical also co-refer. Let us call this measure task. It has been also discovered that the co-reference-density pseudo-co-reference-density. It shows the maximum accuracy is far away from its possible extremes, 0 and 1, showing that of a system that deals with ambiguity by ignoring it. We simple string matching procedures might not achieve good approximate it as the ratio between the number of different results. entities and the number of distinct mentions. Our future work will be focused on two main issues: (i) the # collection − entities use of the PERSON dataset as training corpus for resolving the p − coref − density = entity co-reference task, as a first step towards implementing a # distinct − mentions full OPTM system; and (ii) a controlled extension of the dataset with new data in order to understand which figures are The pseudo-co-reference for our dataset is 2574/4851 ≈ likely to remain stable. 0.55. This information is not directly expressed in the 8 REFERENCES 1. Almuhareb, A., Poesio, M. (2004). Attribute-based and value- based clustering: An evaluation. In: Proceedings of EMNLP 2004, Barcelona, 2004, 158-165. 2. Avancini, H., Lavelli, A., Magnini, B., Sebastiani, F., Zanoli, R. (2003). Expanding Domain-Specific Lexicons by Term Categorization. In: Proceedings of SAC 2003, 793-79. 3. Bontcheva, K., Cunningham, H. (2003). The Semantic Web: A New Opportunity and Challenge for HLT. In: Proceedings of the Workshop HLT for the Semantic Web and Web Services at ISWC 2003, Sanibel Island, 2003. 4. Bouquet, P., Serafini, L., and Zanobini S.. Semantic coordination: a new approach and an application, In Sencond Internatinal Semantic Web Conference, volume 2870 of Lecture Notes in Computer Science, pages 130--145. Springer Verlag, September 2003 5. Buitelaar P., Cimiano P. and Magnini B. (Eds.) Ontology Learning from Text: Methods, Evaluation and applications. IOS Press, 2005. 6. Ferro, L., Gerber, L., Mani, I., Sundheim, B. and Wilson G. (2005). TIDES 2005 Standard for the Annotation of Temporal Expressions. Technical report, MITRE. 7. Lavelli, A., Magnini, B., Negri, M., Pianta, E., Speranza, M., Sprugnoli, R. (2005). Italian Content Annotation Bank (I-CAB): Temporal Expressions (V. 1.0.). Technical Report T-0505-12. ITC-irst, Trento. 8. Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. In: Proceedings of COLING-ACL98, Montreal, Canada, 1998. 9. Linguistic Data Consortium (2004). ACE (Automatic Content Extraction) English Annotation Guidelines for Entities, version 5.6.1 2005.05.23. 10. B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Romano, M. Speranza, V. Bartalesi Lenzi and R. Sprugnoli. I-CAB: the Italian Content Annotation Bank, In: Proceedings of LREC- 2006, Genova, Italy. 11. B. Magnini, E. Pianta, O. Popescu and M. Speranza. Ontology Population from Textual Mentions: Task Definition and Benchmark. Proceedings of the OLP2 workshop on Ontology Population and Learning, Sidney, Australia, 2006. Joint with ACL/Coling 2006. 12. Tanev H. and Magnini B. Weakly Supervised Approaches for Ontology Population. Proceedings of EACL-2006, Trento, 3-7 April, 2006. 13. Velardi, P., Navigli, R., Cuchiarelli, A., Neri, F. (2004). Evaluation of Ontolearn, a Methodology for Automatic Population of Domain Ontologies. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.): Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, Amsterdam, 2005.