<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology Building Using Parallel Enumerative Structures</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>118, Route de Narbonne</institution>
          ,
          <addr-line>31062 Toulouse</addr-line>
          ,
          <country country="FR">France (</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Under IAU definitions, there are eight planets in the Solar System. In order of increasing distance from the Sun, they are the four terrestrials</institution>
          ,
          <addr-line>Mercury, Venus, Earth, and Mars, then the four gas giants, Jupiter, Saturn, Uranus, and Neptune</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>four gas giants: - Jupiter</institution>
          ,
          <addr-line>- Saturn, - Uranus, - Neptune</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>four terrestrials: - Mercury</institution>
          ,
          <addr-line>- Venus, - Earth, - Mars</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The semantics of a text is carried by both the natural language it contains and its layout. As ontology building processes have so far taken only plain text into consideration, our aim is to elicit its textual structure. We focus here on parallel enumerative structures because they bear implicit or explicit hierarchical relations, they have salient visual properties, and they are frequently found in corpora. We have defined a process which identifies them in a text, translates them into ontology structures and finally links such structures to the concepts of an existing ontology. We have assessed this process on Wikipedia encyclopaedic articles as they are rich in definitions and statements, and contain many enumerations. The many ontology structures we have obtained are thus used to enrich an ontology which we had automatically built from database specification documents.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ontology building and enrichment from text</kwd>
        <kwd>layout analysis</kwd>
        <kwd>NLP tools</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. MOTIVATION</title>
      <p>Many approaches have been suggested for the construction,
enrichment or population of ontology from text. They are based
on lexical, syntactical, semantic or rhetorical aspects of natural
language. They encompass machine learning [1], specific natural
language processing tools [2], or combination of both [3]. These
methods are usually applied on plain texts. However, a large
variety of layouts or structures can be found in the visual
presentation of a text with a diversity of interpretations for each of
them [4]. Some of them implicitly carry ontological knowledge as
shown in example 1. The meaning carried by this structure may be
expressed through the sentence in example 2. In both cases, a
human being may easily deduce the conceptual framework
presented in figure 1.</p>
      <p>In the case of sentence analysis (example 2), the automatic
deduction by a Natural Language Processing (NLP) tool of its
formal counterpart is a very tricky issue which will necessitate to
carry out non trivial tasks such as the resolution of anaphora or
the design of sophisticated multi-sentence textual patterns.</p>
      <p>Bernard Rothenburger
Institut de Recherche en Informatique de Toulouse
(IRIT) – CNRS – UPS
118, Route de Narbonne 31062 Toulouse, France
(+33) 5 61 55 83 38
However for layout structure analysis (example 1), different parts
of the knowledge are more easily identifiable thanks to lexical or
typo-dispositional marks. We claim that it becomes thus easier to
identify in an automated way the corresponding conceptual
network. The above meaning-bearing layouts allow a
straightforward identification of ontological relations: often
hyperonymy, sometimes meronymy, and occasionally other
relations.</p>
      <p>We focus here on a specific kind of meaning-bearing layout that
we call parallel enumerative structures (PES). Example 1 is
typical of such a layout. These structures present some regularities
and appear very frequently. Their analysis could be a relevant
contribution to improve knowledge elicitation and modelling from
text. Moreover, it would provide new triggers for the
identification of new concepts or semantic relations, therefore
enabling to go beyond the classical ontology learning approaches
which only consider the plain text.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TRANSLATION PROCESS</title>
      <p>An enumeration is a set of items with or without semantic
relations between them. An item is a co-enumerated entity which
can be discernable by typographic, dispositional and/or
lexicosyntactic marks. And a parallel enumeration is a paradigmatic
enumeration (i.e. all items are functionally equivalent, textually or
syntactically), visually homogeneous (i.e. all items are visually
equivalent) and isolated (i.e. no item is linked to any textual unit
which is out of the enumeration). An introductory phrase,
hereafter called primer, is a phrase or a sentence which introduces
an enumeration, and which is identifiable by lexico-syntactic
and/or typo-dispositional marks. Finally, let us call parallel
enumerative structure (PES) a vertical textual structure composed
of a primer and a parallel enumeration.</p>
      <p>There are a number of diseases and conditions affecting the
gastrointestinal system, including:
Item Marker
item
1)
2)
3)</p>
      <p>Cholera
Colorectal cancer</p>
      <p>Diverticulitis
Enumerative structure</p>
      <p>
        primer
enumeration
Broadly speaking, the idea is to translate a PES into a single
ontology structure (i.e. one or two-level hierarchy) according to
the following principles: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the primer contains one father
concept and one semantic relation which links this father concept
to concepts contained in the items, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) each item contains one
child concept semantically related to the father concept of the
primer, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) all child concepts will be considered as belonging to
the same conceptual level. An example of this correspondence is
the structure obtained in Figure 1 from the example 1.
The syntactic structure of the primer helps to identify the father
concept and the semantic relation it contains. We have
characterized 3 cases:
      </p>
      <p>The primer is not syntactically correct.
- The primer could be composed of a noun phrase. This noun
phrase represents the father concept and the semantic relation is
the relation is-a.
- The primer ends with a verb phrase at the active form. The
semantic class to which this verb belongs reflects the nature of the
relation and the father concept corresponds to the main term of
the noun phrase which is the subject of this verb.</p>
      <p>The primer is complete. It contains a lexical unit taken from a
gazetteer or a number which specifies the number of items. The
concept father is the term which co-occurs with this lexical
marker, and the relation is the relation is-a.</p>
      <p>
        The primer is syntactically correct and not complete. The
father concept may be found in the subject noun phrase or in the
object noun phrase of the main clause and may be eventually
detected thanks to heuristics. The relation is the relation is-a.
Our method consists in (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) identifying each enumerative structure
and its different components (primer and items), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) checking
whether the enumeration is parallel, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) identifying the father
concept and the nature of the semantic relation, (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) extracting the
child concepts from each item and (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) building an ontological
structure. This fifth step is based on annotations produced over
the four previous steps.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. APPLICATION</title>
      <p>Wikipedia documents are encyclopaedic and contain a lot of
definitional statements and properties. Furthermore, articles are
written according to a comprehensive set of editorial and
structural guidelines. Actually it thus advocates the writing of
PES. The experiment reported in this paper concerns the
enrichment of an existing ontology which is a frame of reference
used to localise information relating to urbanism, environment
and territorial organisations. It contains both geographical and
real-world concepts. This ontology has 728 concepts. We then
obtain 182 disambiguated pages which contain at least one PES
(according our criteria). From these 182 articles we exploit 276
PES which allowed to enrich our ontology with 349 new concepts
and 201 instances which were considered as relevant by experts
and knowledge engineers involved in the building of this
ontology.</p>
    </sec>
    <sec id="sec-4">
      <title>4. FUTURE WORKS</title>
      <p>In the short-term, our idea is to combine our approach with the
usual ontology learning from text ones. For example, in order to
better take advantage of Wikipedia’s articles, it would seem
interesting to complete the approach of Herbelot et al. [5], which
exploits plain text only. We also plan to exploit redirect links and
homonym pages to maximise the number of relevant articles. On
the other hand we want to improve the analysis of enumerative
structures by going beyond simple parsing, particularly regarding
the primer. Authors may use complex grammatical constructions
or linguistic variations in their writing, even within the
enumerative structures. We then face problems of anaphora
resolution, ellipses, apposition, extraposition and rhetorical forms,
etc. Also, discourse analysis must be carried out to process
nonparallel enumerative structures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Nédellec</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nazarenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Ontology and Information Extraction</article-title>
          . in S. Staab &amp; R. Studer (eds.)
          <source>Handbook on Ontologies in Information Systems</source>
          , Springer (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Giuliano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature</article-title>
          .
          <source>In Proc. EACL</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Giovannetti</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montemagni</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Combining Statistical Techniques and Lexico-syntactic Patterns for Semantic Relation Extraction from Text</article-title>
          .
          <source>Fifth workshop on Semantic Web Applications and Perspectives</source>
          ,
          <fpage>FA0</fpage>
          -UN, Roma, Italy (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Virbel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luc</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Le modèle d'architecture textuelle: fondements et expérimentation</article-title>
          .
          <source>Verbum</source>
          , Vol. XXIII,
          <source>N. 1</source>
          , p.
          <fpage>103</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Herbelot</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Copestake</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2006</year>
          :
          <article-title>Acquiring ontological relationships from Wikipedia using RMRS</article-title>
          .
          <source>In: Proceedings of the International Semantic Web Conference 2006. Workshop on Web Content Mining with Human Language Technologies</source>
          , Athens, GA (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>