Ontology Building Using Parallel Enumerative Structures Mouna Kamel Bernard Rothenburger Institut de Recherche en Informatique de Toulouse Institut de Recherche en Informatique de Toulouse (IRIT) – CNRS – UPS, (IRIT) – CNRS – UPS 118, Route de Narbonne, 31062 Toulouse, France 118, Route de Narbonne 31062 Toulouse, France (+33) 5 61 55 83 38 (+33) 5 61 55 83 38 kamel@irit.fr rothenburger@irit.fr Under IAU definitions, in the Solar System and in order of increasing distance from the Sun, there are eight planets: ABSTRACT The semantics of a text is carried by both the natural language it • four terrestrials: contains and its layout. As ontology building processes have so - Mercury, far taken only plain text into consideration, our aim is to elicit its - Venus, textual structure. We focus here on parallel enumerative structures - Earth, because they bear implicit or explicit hierarchical relations, they - Mars. have salient visual properties, and they are frequently found in • four gas giants: corpora. We have defined a process which identifies them in a - Jupiter, text, translates them into ontology structures and finally links such - Saturn, structures to the concepts of an existing ontology. We have - Uranus, assessed this process on Wikipedia encyclopaedic articles as they - Neptune. are rich in definitions and statements, and contain many enumerations. The many ontology structures we have obtained are Example 1 : a structure which carries ontological knowledge thus used to enrich an ontology which we had automatically built from database specification documents. Under IAU definitions, there are eight planets in the Solar System. In order of increasing distance from the Sun, they are the four Categories and Subject Descriptors terrestrials, Mercury, Venus, Earth, and Mars, then the four gas I.2.7 Natural Language Processing - Text analysis, I.2.6 Learning giants, Jupiter, Saturn, Uranus, and Neptune. - Knowledge acquisition Example 2 : a sentential representation of the example 1 General Terms Algorithms, Documentation, Languages Keywords Ontology building and enrichment from text, layout analysis, NLP tools. 1. MOTIVATION Figure 1. Conceptual network corresponding to the meaning of Many approaches have been suggested for the construction, examples 1 and 2 enrichment or population of ontology from text. They are based However for layout structure analysis (example 1), different parts on lexical, syntactical, semantic or rhetorical aspects of natural of the knowledge are more easily identifiable thanks to lexical or language. They encompass machine learning [1], specific natural typo-dispositional marks. We claim that it becomes thus easier to language processing tools [2], or combination of both [3]. These identify in an automated way the corresponding conceptual methods are usually applied on plain texts. However, a large network. The above meaning-bearing layouts allow a variety of layouts or structures can be found in the visual straightforward identification of ontological relations: often presentation of a text with a diversity of interpretations for each of hyperonymy, sometimes meronymy, and occasionally other them [4]. Some of them implicitly carry ontological knowledge as relations. shown in example 1. The meaning carried by this structure may be expressed through the sentence in example 2. In both cases, a We focus here on a specific kind of meaning-bearing layout that human being may easily deduce the conceptual framework we call parallel enumerative structures (PES). Example 1 is presented in figure 1. typical of such a layout. These structures present some regularities and appear very frequently. Their analysis could be a relevant In the case of sentence analysis (example 2), the automatic contribution to improve knowledge elicitation and modelling from deduction by a Natural Language Processing (NLP) tool of its text. Moreover, it would provide new triggers for the formal counterpart is a very tricky issue which will necessitate to identification of new concepts or semantic relations, therefore carry out non trivial tasks such as the resolution of anaphora or enabling to go beyond the classical ontology learning approaches the design of sophisticated multi-sentence textual patterns. which only consider the plain text. 2. TRANSLATION PROCESS whether the enumeration is parallel, (3) identifying the father An enumeration is a set of items with or without semantic concept and the nature of the semantic relation, (4) extracting the relations between them. An item is a co-enumerated entity which child concepts from each item and (5) building an ontological can be discernable by typographic, dispositional and/or lexico- structure. This fifth step is based on annotations produced over syntactic marks. And a parallel enumeration is a paradigmatic the four previous steps. enumeration (i.e. all items are functionally equivalent, textually or syntactically), visually homogeneous (i.e. all items are visually 3. APPLICATION Wikipedia documents are encyclopaedic and contain a lot of equivalent) and isolated (i.e. no item is linked to any textual unit definitional statements and properties. Furthermore, articles are which is out of the enumeration). An introductory phrase, written according to a comprehensive set of editorial and hereafter called primer, is a phrase or a sentence which introduces structural guidelines. Actually it thus advocates the writing of an enumeration, and which is identifiable by lexico-syntactic PES. The experiment reported in this paper concerns the and/or typo-dispositional marks. Finally, let us call parallel enrichment of an existing ontology which is a frame of reference enumerative structure (PES) a vertical textual structure composed used to localise information relating to urbanism, environment of a primer and a parallel enumeration. and territorial organisations. It contains both geographical and real-world concepts. This ontology has 728 concepts. We then There are a number of diseases and conditions affecting the obtain 182 disambiguated pages which contain at least one PES (according our criteria). From these 182 articles we exploit 276 gastrointestinal system, including: PES which allowed to enrich our ontology with 349 new concepts Item Marker 1) Cholera primer and 201 instances which were considered as relevant by experts item 2) Colorectal cancer and knowledge engineers involved in the building of this 3) Diverticulitis enumeration ontology. 4. FUTURE WORKS Enumerative structure In the short-term, our idea is to combine our approach with the Figure 2. Composition of an enumerative structure usual ontology learning from text ones. For example, in order to better take advantage of Wikipedia’s articles, it would seem Broadly speaking, the idea is to translate a PES into a single interesting to complete the approach of Herbelot et al. [5], which ontology structure (i.e. one or two-level hierarchy) according to exploits plain text only. We also plan to exploit redirect links and the following principles: (1) the primer contains one father homonym pages to maximise the number of relevant articles. On concept and one semantic relation which links this father concept the other hand we want to improve the analysis of enumerative to concepts contained in the items, (2) each item contains one structures by going beyond simple parsing, particularly regarding child concept semantically related to the father concept of the the primer. Authors may use complex grammatical constructions primer, (3) all child concepts will be considered as belonging to or linguistic variations in their writing, even within the the same conceptual level. An example of this correspondence is enumerative structures. We then face problems of anaphora the structure obtained in Figure 1 from the example 1. resolution, ellipses, apposition, extraposition and rhetorical forms, The syntactic structure of the primer helps to identify the father etc. Also, discourse analysis must be carried out to process non- concept and the semantic relation it contains. We have parallel enumerative structures. characterized 3 cases:  The primer is not syntactically correct. 5. REFERENCES [1] Nédellec, C., Nazarenko, A.: Ontology and Information - The primer could be composed of a noun phrase. This noun Extraction. in S. Staab & R. Studer (eds.) Handbook on phrase represents the father concept and the semantic relation is Ontologies in Information Systems, Springer (2003) the relation is-a. [2] Giuliano, C., Lavelli, A., Romano, L.: Exploiting Shallow - The primer ends with a verb phrase at the active form. The Linguistic Information for Relation Extraction from semantic class to which this verb belongs reflects the nature of the Biomedical Literature. In Proc. EACL (2006) relation and the father concept corresponds to the main term of [3] Giovannetti, E., Marchi, S., Montemagni, S.: Combining the noun phrase which is the subject of this verb. Statistical Techniques and Lexico-syntactic Patterns for  The primer is complete. It contains a lexical unit taken from a Semantic Relation Extraction from Text. Fifth workshop on gazetteer or a number which specifies the number of items. The Semantic Web Applications and Perspectives, FA0-UN, concept father is the term which co-occurs with this lexical Roma, Italy (2008). marker, and the relation is the relation is-a. [4] Virbel, J., Luc, C.: Le modèle d'architecture textuelle: fondements et expérimentation. Verbum, Vol. XXIII, N. 1, p.  The primer is syntactically correct and not complete. The 103-123 (2001) father concept may be found in the subject noun phrase or in the object noun phrase of the main clause and may be eventually [5] Herbelot, A., Copestake, A., 2006: Acquiring ontological detected thanks to heuristics. The relation is the relation is-a. relationships from Wikipedia using RMRS. In: Proceedings of the International Semantic Web Conference 2006. Our method consists in (1) identifying each enumerative structure Workshop on Web Content Mining with Human Language and its different components (primer and items), (2) checking Technologies, Athens, GA (2006).