DAFOE: A Platform for Building Ontologies from Texts Sylvie Szulman Nathalie Aussenac-Gilles Adeline Nazarenko CNRS/LIPN et Université CNRS/IRIT et Université de CNRS/LIPN et Université Paris 13, France ; Toulouse, France ; Paris 13, France ; ss@lipn.univ-paris13.fr Henry Valéry Teguiak Eric Sardet Jean Charlet LISI-ENSMA et LISI-ENSMA et INSERM, France. CRITT-Informatique ; CRITT-Informatique ; jean.charlet@spim.jussieu.fr ABSTRACT entry that motivated their definition. The requirements of Although text-based ontology engineering gained much pop- the platfom and its development focus 1) on integrating var- ularity in the last 10 years, very few ontology engineering ious kinds of tools currently used within a single modelling platforms exploit the full potential of the connection be- platform, 2) on guaranteeing persistence and traceability of tween texts and ontologies. We propose DAFOE, a new plat- the whole ontology building process, and 3) on developing form for building ontologies with a terminological compo- the platform in an open source paradigm with possible plu- nent using different types of linguistic entries (text corpora, gin extensions. results of natural language processing tools, terminologies or thesauri). DAFOE supports knowledge structuring and 2. TEXT-BASED ONTOLOGY ENGINEER- conceptual modelling from these linguistic entries as well as ING ontology formalization. DAFOE outputs models with two There is a growing interest for ontologies and related tools, main original features: an ontology articulated with a lexi- including Ontology Engineering Environments. Many of cal component and a connection with the text or linguistic them are pure ontology editors that support the develop- entry that motivated their definition. ment of formal ontologies but do not assist the tasks of knowledge acquisition or structuring. Knowledge engineers Categories and Subject Descriptors are supposed to have a first of ontology draft before using D.2.11 [Software Architectures]; I.2.4 [Knowledge such tools. Since 2003, a significant shift occurred. Firstly, Representation Formalisms and Methods]; H.2.1 a parallel has been established between ontology popula- [Database Management]: Logical DesignData models tion from text and semantic (textual) annotation. Secondly, many projects have proved the benefit brought by Human General Terms Language Technologies (HLT), including NLP, Information Design Extraction, Knowledge Discovery or Text Miming, for com- plementarity activities such as ontology learning from text Keywords and ontology population. The diversity and richness of ex- Ontology Building, Ontology Editor, Meta-Modelization, isting HLT tools as well as the complexity of the ontology Data Model development tasks underlined the need for tool suites and platforms where the knowledge engineer can define its mod- 1. INTRODUCTION elling strategy.This challenge is also one major motivation of DAFOE1 is a new platform for building ontologies us- the DAFOE project but the platform targets more ambitious ing different types of linguistic entries (text corpora, results goals: a better interoperability, a higher robustness and an of natural language processing tools, terminologies or the- easier combination of HLT and ontology technologies. sauri). DAFOE supports knowledge structuring and con- The goal of DAFOE is both to extend the variety of HLT ceptual modelling from these linguistic entries as well as that can be used and to support scalable ontology engineer- ontology formalization. DAFOE outputs models with two ing. It claims that there are several ways to get an ontology, main original features: an ontology articulated with a lexi- and that tools and processes must be selected according to cal component and a connection with the text or linguistic each ontology case-study. DAFOE will propose tools simi- lar to those of Text2Onto, but human supervision will play 1 http://dafoe4app.fr a major role for selecting tools, validating their results and conceptualizing. Knowledge conceptualization requires that a human selects and organizes properly concepts and rela- tions, but this process can be guided. The result of DAFOE Permission to make digital or hard copies of all or part of this work for will typically be a termino-ontological resource where the personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies ontology is connected to a lexical component. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 3. DATA MODEL permission and/or a fee. EKAW 2010 Lisbon, Portugal DAFOE data model has to take into account various ontol- Copyright 2010 ACM X-XXXXX-XX-X/XX/XX ...$10.00. ogy building strategies, whatever information source (texts, terminologies, thesauri or human expertise) is used. to each other and by looking at their occurrences. The termino-conceptual layer is pivotal for transforming linguis- 3.1 Overall Architecture tic elements into conceptual ones and tracing the ontology The data model is based on a valid methodology for build- back to the linguistics. ing ontologies from texts, which has inspired tools such as Terminae [2] or Text2Onto [3]. This methodology takes 3.5 Ontology Layer into account the whole process of ”transforming” textual The ontology data model allows to formalize TCs and data into ontologies and split it into different phases, which RTCs in a formal language equivalent at OWL-DL. Concepts correspond to various input levels if one wants to start with are described as classes, individuals as instances of classes, a thesauri rather than text, for instance. This methodologies properties between classes as object properties and prop- relies on two main ideas: 1/ textual data are an important erties between a class and a value as data properties or at- information source to build ontologies, especially if the on- tributes. An automatic process will translate TCs and RTCs tology is to be used to annotate textual documents but 2/ into formal concepts in a hierarchy with inherited properties textual data cannot be mapped directly into an ontology as usual subsumption in description language. This trans- and the transformation must be mediated. The data model lation exploits the structure of the semantic network repre- is therefore structured into four layers as represented in Fig- sented in the termino-conteptual layer and the differential ure 1. Each one corresponds to a specific methodological criteria associated with TCs and RTCs. step. API for NLP 4. CONCLUSION tools A prototype of the DAFOE platform has been imple- Imports mented. DAFOE is intended to provide a variety of on- Layer 0 (corpora) tology engineering methods. As such a diversity can not Entry be managed in a unique and static model, we adopted an Text visualization extended Ontology-Based Database (OntoDB) architecture Layer 1 (terminological) Entry that supports model management and plugins. The strength Linguistic study of DAFOE approach is i) a precise definition of the vari- Layer 2 (terminolo-conceptual) SKOS Entry Semantic network building export ous steps by which one can design a formal ontology; ii) a data model guaranteeing persistence and traceability of the Layer 3 (ontological) Entry OWL export whole ontologie building process; iii) the supply of flexible Formal ontology building methodological guidelines that support the knowledge en- gineer without constraint; iv) an architecture based on the Figure 1: Data model architecture. MOF model and plugins adaptability to ensure extensibility of the model and processes around a core tool; v) the spec- 3.2 Corpora Layer ification of various modelling strategies based on different The corpora layer is useful for the knowledge engineer input/output of the platform; vi) the final production of an willing to build an ontology from text. He/she can build a ontology which is associated to a terminological component. working corpus by selecting different source documents and browse that corpus, either as plain documents or as seg- mented ones. In the data model the corpus is represented as 5. REFERENCES a sequence of sentences, each one having a unique identifier. [1] S. Aubin and T. Hamon. Improving term extraction with terminological resources. In T. Salakoski, 3.3 Terminological Layer F. Ginter, S. Pyysalo, and T. Pahikkala, editors, The terminological layer gives a view over the domain spe- Advances in Natural Language Processing (5th cific lexicon of the corpus. It gathers the terms of the do- International Conference on NLP, FinTAL 2006), main and their relationships. Terminological knowledge is number 4139 in LNAI, pages 380–387. Springer, August traditionally produced by NLP tools such as term extractors 2006. applied on the working corpus. The underlying assumption [2] N. Aussenac-Gilles, S. Despres, and S. Szulman. The is threefold: text analysis can extract term candidates that Terminae method and platform for ontology are relevant for a given domain, those terms are likely to be engineering from texts. In P. Buitelaar and P. Cimiano, turned into ontology concepts and the distribution of these editors, Bridging the Gap between Text and Knowledge: terms reflects their semantics [4]. DAFOE visualizes results Selected Contributions to Ontology learning from Text. of NLP tools such as YaTeA term extractor [1]. NLP results IOS Press, 2008. are given to DAFOE through an API. The data model is [3] P. Cimiano and J. Volker. Text2onto - a framework for extensible and may be adapted to different NLP tools. ontology learning and data-driven change discovery. In A. Montoyo, R. Munoz, and E. Metais, editors, Proc. of 3.4 Termino-Conceptual Layer the 10th International Conference on Applications of This layer represents a semantic structure of unambigu- Natural Language to Information Systems (NLDB), ous termino-concepts (TC) and termino-conceptual relations volume 3513 of Lecture Notes in Computer Science, (RTC). The knowledge engineer may build that layer by im- pages 227–238, Alicante, Spain, 2005. Springer. porting a preexisting termino-conceptual resource such as a [4] Z. Harris. Mathematical Structures of Language. thesaurus or out of the analysis of the terminological layer. Interscience Publishers, 1968. In that case, he/she analyses the meaning of terms and rela- tions that appear at the terminological layer with respect