<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining terms and named entities for modeling domain ontologies from texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nouha Omrane</string-name>
          <email>nouha.omrane@lipn.univ-</email>
          <email>nouha.omrane@lipn.univparis13.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adeline Nazarenko</string-name>
          <email>adeline.nazarenko@lipn</email>
          <email>adeline.nazarenko@lipn. univ-paris13.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvie Szulman</string-name>
          <email>sylvie.szulman@lipn.univ-</email>
          <email>sylvie.szulman@lipn.univparis13.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Paris 13</institution>
          ,
          <addr-line>99 av JB Clement, 93430 Villetaneuse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Building ontologies from plain texts is still a research issue. This process cannot be fully automated but natural language processing and methodological guidelines can help the knowledge engineer's task. In this paper we present terminae and show through the analysis of three di erent experiments on policy documents how the initial terminological approach can be guided by taking named entities into account.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Ontology acquisition from texts</kwd>
        <kwd>terms</kwd>
        <kwd>named entities</kwd>
        <kwd>conceptualization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The rst "ontology learning" approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] relies on
distributional analysis of large acquisition corpora. It is considered
as an automatic one, even if the resulting ontology needs
to be manually edited afterwards. The second approach is
based on the terminological analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of the text. It is less
automated than the previous one but is useful for
applications where ontologies need to be carefully designed.
This work is part of a project aiming at modeling business
rules expressed in written policies. In this context, where
domain ontologies are used as conceptual vocabularies for
the writing of the rules of various use cases, the
terminological approach is prefered given the typical size of policies
(medium size specialised corpora) 1 and the expected
qual0This work was realized as part of the FP7 231875
ONTORULE project. We thank American Airline and
ArcelorMittal who are the owners of our working corpora.
1Typically, from 5 to 500 thousands of words
ity of the ontologies. In the terminological approach, terms
of a domain form the domain speci c vocabulary and, as
such, serve as a bootstrap for ontology design. Named
entities are another type of domain speci c textual units that
refer to well identi ed domain entities. They are
traditionally exploited in ontology engineering but for populating the
instance level of existing ontologies. The originality of the
proposed method comes from the fact that it exploits both
types of textual units to bootstrap the conceptualization
process itself. Our approach is a terminological fact-based
one that is embodied in a revised version of terminae tool
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which now takes named entities into account in addition
to terms. Section 2 explains that terms and named
entities can be exploited in a uni ed way and shows how the
terminae methodology has been enriched with the output
of named entity recognition rules. The last section presents
three di erent experiments exploiting named entities in the
ontology building process.
2. A COMBINED METHOD FOR BUILDING
      </p>
      <p>ONTOLOGIES FROM TEXTS
The terminae text-based acquisition method decomposes
the acquisition process into three main levels { the
terminological, termino-ontological and conceptual (or ontological)
levels { which are built on top of each other, the corpus
playing the role of ground level. The transition from text
to ontology must actually be mediated. Ontologies cannot
be "extracted" as such from texts, because conceptual
models (or ontologies) and texts are di erent in nature. At each
level, the knowledge engineer has to select the relevant items
and to organise them. This process is helped by the previous
terminological analysis of the text, which is automatic, and
guided by the method embodied in the interfaces of
terminae tool. The overall process is represented on Figure 1. In
this paper, we focus on the upper part of it. At the
linguisAcquisition
corpus</p>
      <p>Termilnevoelolgical
Named entities</p>
      <p>Terms</p>
      <p>Termino-conceptual</p>
      <p>level
Termino-concepts</p>
      <p>Conceptual
level (ontology)</p>
      <p>Instances</p>
      <p>Concepts
Terminological
relations</p>
      <p>Termino-conceptual
relations</p>
      <p>Conceptual
relations
tic level, the user has to extract from the acquisition corpus
the textual units that seem to be relevant for the domain
and use case to model. This step relies on NLP tools known
as "term extractors", as well as "named entity recognizers"
that extract named entities and their semantic types. The
user has to revise the extracted elements and to turn the
list of relevant units into a list of termino-concepts. In that
process, the linguistic output is normalised, which is a way
to abstract the future domain model from the textual
wording and linguistics. The third acquisition step of Terminae
methodology consists in formalising the list or network of
termino-concepts into an ontology. The core task of
ontology acquisition is the conceptualization step that consists in
choosing, structuring and de ning the conceptual elements
of the domain model. In this step, named entities are
generally neglected. On the contrary, we consider these textual
units and their semantic types from the beginning of the
conceptualization phase in the same way as we do for the
terms. It is not because they are identi ed as named
entities by NLP tools that they must necessary be turned into
instances. In some cases the named entities might model
concepts. The underlying modeling choices depend on the
corpus and use case that are considered.</p>
      <p>The next section illustrates the various bootstrapping
approaches in the context of policy modeling taking into
account the speci cities of policy documents in which passages
expressing rules deserve speci c attention.</p>
    </sec>
    <sec id="sec-2">
      <title>3. EXPERIMENTATIONS</title>
      <p>We consider three use-cases, each one dealing with a speci c
type of regulations (loyalty program, decision process, rules
of a game). The resulting ontologies are to be used for the
modeling and formalization of the rules that are expressed in
written policies. The acquisition scenario is not the same in
the three experiments reported below. In the rst one, the
named entities are exploited to enrich an ontology that we
had previously built on the basis of terms only. The second
case aimed at adding linguistic information to an existing
ontology and at enriching it with information coming from
the acquisition corpus. In the third experiment, the named
entities are realy used to bootstrap the conceptualization.
Even if the policy corpora do not contain numerous named
entities, the three experiments show that the named entities
are important to take into account.</p>
      <p>In the rst experiment, the ontology is built out of a
document of American Airlines (5, 300 words), which explains
mileage policy to customers. In this use case, taking the
named entities into account yields to enrich and partially
populate the ontology. Compared to the initial ontology of
130 concepts, 7 new concepts and 45 instances have been
added. 15 of the existing concepts have also been rede ned.
Except for cities which were not interesting for the use case,
all the named entities (76) have been introduced in the
ontology in some way. The second use-case deals with the
galvanization process and the rules dealing with the assignement
of a product (coil)(3, 562 words) : depending on various
quality criteria, a coil can be assigned to the order
(delivered to the customer), repaired or thrown away. We started
modeling the domain from an existing core ontology of 12
concepts. The goal was to associate textual units to existing
concepts2 and to enrich the structure of the ontology with
entities which have been found in the text. We exploited
2for the further semantic annotation of additional
documents
the 663 terms and 105 named entities respectively extracted
by YaTeA3 and Gate4. Taking named entities into account
helped to understand the details of the assignement process
and to identify the relevant conceptual properties. In the
third experiment, we had no preexisting information and we
exploited the named entities to bootstrap the
conceptualization process, in a fact-oriented approach. We started with
a French \Rules of Golf" corpus5 (112,898 words) which
describes the rules and conditions according to which a golf
player must replay, loose points or quit the game. YaTeA
and Gate respectively extracted 3, 711 terms and 350 named
entities. In this use case where the term list were too long to
be studied in detail, the analysis started with named entities
which underlined some core domain elements and was
progressively extended to the related terms and their
interrelations. These three experiments aimed at building ontologies
out of written policies. Named entity recognizers bring into
light textual units that are not identi ed as terms but which
nevertheless refer to crucial domain elements and guide the
conceptualization work. Even if the "populating" hypothesis
does not hold { named entities can be modeled as concepts
as well as instances {, named entities favour a fact-oriented
approach, which counterbalance purely terminological
analyses.</p>
    </sec>
    <sec id="sec-3">
      <title>4. CONCLUSION</title>
      <p>This paper shows how text-based ontology acquisition
methods can be enriched by taking all types of domain speci c
textual units into account, named entities as well as terms,
and explains how named entities can be used in the
conceptualization task.</p>
      <p>This combined approach, which is implemented in the
terminae tool, is illustrated on three di erent experiments that
all aim at building ontologies for the modeling of rules. The
written policies do not have as many named entities as press
articles for instances, but we have shown that they support a
fact-based modeling approach that is complementary to the
terminological one, which is more concept-oriented. Even
when they are represented as instances at the conceptual
level, named entities point out critical domain speci c
elements that are important to integrate in the conceptual
structure in a form or another.
3http://search.cpan.org/%7Ethhamon/Lingua-YaTeA-0.5/
4http://gate.ac.uk/
5This public document is available
http://www. golf.org/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Ontology Learning and Population from Text: Algorithms, Evaluation and Applications</article-title>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Aussenac-Gilles</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Despres</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szulman</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The TERMINAE Method and Platform for Ontology Engineering from texts</article-title>
          . In Buitelaar, P.,
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          , P., eds.:
          <article-title>Bridging the Gap between Text and Knowledge - Selected Contributions to Ontology Learning and Population from Text</article-title>
          . IOS Press (janvier
          <year>2008</year>
          )
          <volume>199</volume>
          {
          <fpage>223</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vieira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area</article-title>
          .
          <source>Electronic Journal of Communication, Information and Innovation in Health</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          )
          <volume>72</volume>
          {
          <fpage>84</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>