<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DAFOE: A Platform for Building Ontologies from Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eric Sardet LISI-ENSMA et CRITT-Informatique</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adeline Nazarenko CNRS/LIPN et Université</institution>
          <addr-line>Paris 13</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jean Charlet INSERM</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Nathalie Aussenac-Gilles CNRS/IRIT et Université de Toulouse</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sylvie Szulman CNRS/LIPN et Université</institution>
          <addr-line>Paris 13</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although text-based ontology engineering gained much popularity in the last 10 years, very few ontology engineering platforms exploit the full potential of the connection between texts and ontologies. We propose DAFOE, a new platform for building ontologies with a terminological component using di erent types of linguistic entries (text corpora, results of natural language processing tools, terminologies or thesauri). DAFOE supports knowledge structuring and conceptual modelling from these linguistic entries as well as ontology formalization. DAFOE outputs models with two main original features: an ontology articulated with a lexical component and a connection with the text or linguistic entry that motivated their de nition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Design</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>DAFOE1 is a new platform for building ontologies
using di erent types of linguistic entries (text corpora, results
of natural language processing tools, terminologies or
thesauri). DAFOE supports knowledge structuring and
conceptual modelling from these linguistic entries as well as
ontology formalization. DAFOE outputs models with two
main original features: an ontology articulated with a
lexical component and a connection with the text or linguistic
entry that motivated their de nition. The requirements of
the platfom and its development focus 1) on integrating
various kinds of tools currently used within a single modelling
platform, 2) on guaranteeing persistence and traceability of
the whole ontology building process, and 3) on developing
the platform in an open source paradigm with possible
plugin extensions.
2.</p>
      <p>TEXT-BASED ONTOLOGY
ENGINEERING</p>
      <p>There is a growing interest for ontologies and related tools,
including Ontology Engineering Environments. Many of
them are pure ontology editors that support the
development of formal ontologies but do not assist the tasks of
knowledge acquisition or structuring. Knowledge engineers
are supposed to have a rst of ontology draft before using
such tools. Since 2003, a signi cant shift occurred. Firstly,
a parallel has been established between ontology
population from text and semantic (textual) annotation. Secondly,
many projects have proved the bene t brought by Human
Language Technologies (HLT), including NLP, Information
Extraction, Knowledge Discovery or Text Miming, for
complementarity activities such as ontology learning from text
and ontology population. The diversity and richness of
existing HLT tools as well as the complexity of the ontology
development tasks underlined the need for tool suites and
platforms where the knowledge engineer can de ne its
modelling strategy.This challenge is also one major motivation of
the DAFOE project but the platform targets more ambitious
goals: a better interoperability, a higher robustness and an
easier combination of HLT and ontology technologies.</p>
      <p>The goal of DAFOE is both to extend the variety of HLT
that can be used and to support scalable ontology
engineering. It claims that there are several ways to get an ontology,
and that tools and processes must be selected according to
each ontology case-study. DAFOE will propose tools
similar to those of Text2Onto, but human supervision will play
a major role for selecting tools, validating their results and
conceptualizing. Knowledge conceptualization requires that
a human selects and organizes properly concepts and
relations, but this process can be guided. The result of DAFOE
will typically be a termino-ontological resource where the
ontology is connected to a lexical component.
3.</p>
      <p>DAFOE data model has to take into account various
ontology building strategies, whatever information source (texts,
terminologies, thesauri or human expertise) is used.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Overall Architecture</title>
      <p>
        The data model is based on a valid methodology for
building ontologies from texts, which has inspired tools such as
Terminae [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or Text2Onto [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This methodology takes
into account the whole process of "transforming" textual
data into ontologies and split it into di erent phases, which
correspond to various input levels if one wants to start with
a thesauri rather than text, for instance. This methodologies
relies on two main ideas: 1/ textual data are an important
information source to build ontologies, especially if the
ontology is to be used to annotate textual documents but 2/
textual data cannot be mapped directly into an ontology
and the transformation must be mediated. The data model
is therefore structured into four layers as represented in
Figure 1. Each one corresponds to a speci c methodological
step.
      </p>
      <p>Entry
Entry
Entry
Entry</p>
      <p>API for NLP
tools</p>
      <p>Imports
Layer 0 (corpora)
Text visualization
Layer 1 (terminological)
Linguistic study
Layer 2 (terminolo-conceptual)
Semantic network building
SKOS
export
Layer 3 (ontological)
Formal ontology building</p>
      <p>OWL export
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Corpora Layer</title>
      <p>The corpora layer is useful for the knowledge engineer
willing to build an ontology from text. He/she can build a
working corpus by selecting di erent source documents and
browse that corpus, either as plain documents or as
segmented ones. In the data model the corpus is represented as
a sequence of sentences, each one having a unique identi er.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>Terminological Layer</title>
      <p>
        The terminological layer gives a view over the domain
speci c lexicon of the corpus. It gathers the terms of the
domain and their relationships. Terminological knowledge is
traditionally produced by NLP tools such as term extractors
applied on the working corpus. The underlying assumption
is threefold: text analysis can extract term candidates that
are relevant for a given domain, those terms are likely to be
turned into ontology concepts and the distribution of these
terms re ects their semantics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DAFOE visualizes results
of NLP tools such as YaTeA term extractor [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. NLP results
are given to DAFOE through an API. The data model is
extensible and may be adapted to di erent NLP tools.
3.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Termino-Conceptual Layer</title>
      <p>This layer represents a semantic structure of
unambiguous termino-concepts (TC) and termino-conceptual relations
(RTC). The knowledge engineer may build that layer by
importing a preexisting termino-conceptual resource such as a
thesaurus or out of the analysis of the terminological layer.
In that case, he/she analyses the meaning of terms and
relations that appear at the terminological layer with respect
to each other and by looking at their occurrences. The
termino-conceptual layer is pivotal for transforming
linguistic elements into conceptual ones and tracing the ontology
back to the linguistics.
3.5</p>
    </sec>
    <sec id="sec-7">
      <title>Ontology Layer</title>
      <p>The ontology data model allows to formalize TCs and
RTCs in a formal language equivalent at OWL-DL. Concepts
are described as classes, individuals as instances of classes,
properties between classes as object properties and
properties between a class and a value as data properties or
attributes. An automatic process will translate TCs and RTCs
into formal concepts in a hierarchy with inherited properties
as usual subsumption in description language. This
translation exploits the structure of the semantic network
represented in the termino-conteptual layer and the di erential
criteria associated with TCs and RTCs.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>A prototype of the DAFOE platform has been
implemented. DAFOE is intended to provide a variety of
ontology engineering methods. As such a diversity can not
be managed in a unique and static model, we adopted an
extended Ontology-Based Database (OntoDB) architecture
that supports model management and plugins. The strength
of DAFOE approach is i) a precise de nition of the
various steps by which one can design a formal ontology; ii) a
data model guaranteeing persistence and traceability of the
whole ontologie building process; iii) the supply of exible
methodological guidelines that support the knowledge
engineer without constraint; iv) an architecture based on the
MOF model and plugins adaptability to ensure extensibility
of the model and processes around a core tool; v) the
speci cation of various modelling strategies based on di erent
input/output of the platform; vi) the nal production of an
ontology which is associated to a terminological component.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Aubin</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Hamon</surname>
          </string-name>
          .
          <article-title>Improving term extraction with terminological resources</article-title>
          . In T. Salakoski,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ginter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , and T. Pahikkala, editors,
          <source>Advances in Natural Language Processing (5th International Conference on NLP, FinTAL</source>
          <year>2006</year>
          ), number 4139 in LNAI, pages
          <volume>380</volume>
          {
          <fpage>387</fpage>
          . Springer,
          <year>August 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Aussenac-Gilles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Despres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Szulman</surname>
          </string-name>
          .
          <article-title>The Terminae method and platform for ontology engineering from texts</article-title>
          . In P. Buitelaar and P. Cimiano, editors,
          <article-title>Bridging the Gap between Text and Knowledge: Selected Contributions to Ontology learning from Text</article-title>
          . IOS Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Volker.</surname>
          </string-name>
          <article-title>Text2onto - a framework for ontology learning and data-driven change discovery</article-title>
          . In A. Montoyo,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munoz</surname>
          </string-name>
          , and E. Metais, editors,
          <source>Proc. of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB)</source>
          , volume
          <volume>3513</volume>
          of Lecture Notes in Computer Science, pages
          <volume>227</volume>
          {
          <fpage>238</fpage>
          ,
          <string-name>
            <surname>Alicante</surname>
          </string-name>
          , Spain,
          <year>2005</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <source>Mathematical Structures of Language</source>
          . Interscience Publishers,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>