<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Description Logics for Ontology Extraction</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Amalia Todiras ̧cu (University Al.I.Cuza of Iasi</institution>
          ,
          <addr-line>Romania and LIIA, ENSAIS, France) Franc ̧ois de Beuvron, LIIA, ENSAIS</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Franc ̧ois Rousselot</institution>
          ,
          <addr-line>LIIA, ENSAIS</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents a prototype of a system for querying the Web in natural language (French) for a limited domain. The domain knowledge, represented in description logics (DL), is used for filtering the results of the search and is extended dynamically, (when new concepts are identified in the texts) as result of DL inference mechanisms. The conceptual hierarchy is built semi-automatically from the texts. Different small French corpora (heart surgery, newspaper articles, papers on natural language processing) have been used for experimenting the prototype. The system uses shallow natural language parsing techniques and DL reasoning mechanisms are used to handle incomplete or incorrect user queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Web searching engines accept user query composed by a set of
keywords, written in a command language or in natural language. These
systems use index files for retrieving the documents. Indexes can be
keywords, terms, syntactic or semantic structures.</p>
      <p>User queries are transformed into semantic representations, which
are matched to the index items. The semantic representation of
the query could be a set of keywords or more complex semantic
representations. The performances of these systems are evaluated
by two parameters: recall (the number of retrieved documents/the
number of documents) and precision (the number of relevant
documents/the number of retrieved documents). Keyword-based
searching engines provide bad recall(ignoring synonyms or
generalization/specialization handling) and low precision (the answers contain
a significant amount of irrelevant information).</p>
      <p>
        Several modern IR (Information Retrieval) systems use
semantic resources as filters for improving search results: keywords with
multiple-word terms [9], their semantic variations [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ], thesaurus
(Corelex [2], EuroWordNet [13]) or lists of synonyms [7].Natural
language querying needs linguistic knowledge and NLP tools, like
conceptual sentence parsers, applying patterns with case constraints
to extract information ([9]), or predefined case frames [8].
      </p>
      <p>The design of these systems involves off-line, time-consuming
building of resources and provides low flexibility and portability.
New concepts are identified in the texts, but only human
experts can extend domain model. The disadvantage of these systems is
the use of predefined semantic resources (thesaurus, lists of
concepts etc.) which are not modified at runtime. IR applications deal with
incomplete and erroneous data which need robust methods for
parsing. Deep semantic and syntactic parsing methods fail to handle
erroneous input and need an important amount of linguistic knowledge.</p>
      <p>Another approach used by IR and IE applications provides the
use of data-driven acquisition of resources. The use of semantic
issues (terms, summaries) as indexes or of the inference capabilities of
knowledge representation formalisms are just a few examples of the
data-driven acquisition paradigm.</p>
      <p>Document Surrogater uses of phrasal terms as indexes,
eliminating the ambiguity introduced by single words used for indexing.
The algorithm uses a special module to produce a set of significant
terms from a focus file prepared by a human expert. An application
of this methodology is document summarization [15]. Another
approach expands user queries to document summaries that are stored
in index files. Summaries contain relevant concepts and relations for
each document [11]. Special DL operators are defined for generating
document summaries [8].</p>
      <p>
        Other system uses terminological information acquired from texts
like FASTER [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ]. It identifies multi-words terms and uses them as
indexes for the document base. A terminological base is used and is
extended by new term candidates (generated by morphological
transformations on the terms).
      </p>
      <p>Other systems use DLs as representation formalism of the domain
knowledge base. An example is CLASSIC [14], used for manually
indexing documents (using the name of the author, the title, the
subject) of a digital library, containing documents in XML (Extended
Markup Language) format.</p>
      <p>We design a prototype of a system for querying in natural language
(French) a set of documents. We use semantic resources for filtering
the search and we adopt a data-driven methodology for resource
acquisition. The domain hierarchy is represented in description logic
(DL), providing efficiency and fault tolerance to incomplete or
erroneous data. Logic inference mechanisms provided by DL are used to
extend dynamically the domain model, and to complete missing
information identified from the user query. Building linguistic and
domain knowledge requires minimal efforts from the human designer,
while it integrates shallow natural language processing techniques.</p>
      <p>The system could be easily ported for another domain, due to the
dynamic maintenance of the domain knowledge base. The new
concepts inferred from the new documents are validated and added to
the existing hierarchy. The methodology is not appropriate for
unrestricted domains, due to the limited size of the ontology supported
by the system.
9 C = (SOME Rel D) Rel.D there is at least one
obi (1 n, jects of D in relation
_ C= (OR D1 D2) D1 D2 disjunction of
concep: C = NOT D D the complement of a
8 C= (ALL Rel D) Rel.D restricts the co-domain
R(x; yi) ^ Rel with C
^ C = (AND D1 D2) D1 D2 conjunction of
concep9y1 : : : yn C = there are at least n
ob</p>
    </sec>
    <sec id="sec-2">
      <title>2 Description Logics</title>
      <p>Description logics (DL) are formalisms related to semantic networks
and frame systems dedicated to knowledge representation ([1]).</p>
      <p>DL structures the domain knowledge on two levels: a
terminological level (T-Box), containing the axioms defining the classes of
objects of the domain (named concepts), with their properties and
relations (roles) with other objects, and an assertional level, (A-Box),
containing objects of the abstract classes (individuals). The main
reasoning service available in T-Box is subsumption between two
concepts, determining which concept is more general. A-Box provides
instantiation test, determining which concept or role has as
individual a given instance.</p>
      <p>Some of the basic logical operators which are used for creating
complex conceptual descriptions are the following:</p>
      <p>DL Operator Logic DL Interpretation
expresion
ject belonging to D
related by a relation Rel
with the objects of C
of the relation Rel
tual descriptions
tual descriptions
concept
D(yi))
and relations.
instances related by must be an individual of the con- hasAge
statements). Instantiation test detects which conceptual
deis more general than the other one). A concept description
are dened by their roles and attributes. The instances do
structured or incomplete data, like IR systems.The concepts
ordering of the concepts in a hierarchy. The A-Box provides
ference allows for retrieval the individuals which belongs to
nological level, the main reasoning mechanism is the
subsumption relation between two concepts (detecting which concept
not contain all the values of the concept attributes. In some
can be checked for satisabilit y. Classication is a partially
cept For each instance of the concept all the Child. Mother,
Child)(ALL hasAge Age)))
is interpreted as: a is a that have at least Mother Woman
about the membership relation between pairs of individuals
instances.
one child (relation being an instance of the con- hasChild)
a given concept. It provides also the posibility of reasoning
DLs provide powerful inference mechanisms. At the
termiDLs are appropriate for applications dealing with
semiconsistency test (i.e. if there is a contradiction in the set of
values are not allowed, while DL accepts dening incomplete
scription is instantiated by a given instance, and retrieval
in(define-concept Mother (AND Woman (SOME hasChild
frame-based knowledge representation formalism, the missing
cept Age.
Example.</p>
      <p>Example. The definition</p>
    </sec>
    <sec id="sec-3">
      <title>3 System Architecture</title>
      <p>chunks, containing relevant information. The example
consemantic information is sucien t to understand the query and
In this example, "le patient" and "infarctus" are semantic
tains syntactic errors (missing the determiner "un"), but the
to return a correct answer.
where is the most general role in the role hierarchy. relation
then new concept(and sem(Chunk1)(SOME relation
and (Noun in Chunk1)
We use this for combining the chunks. relation
sem(Chunk2)))
and (Modifier in Chunk2)
if (h Chunk1 i h Border i h Chunk2 i)
c) The sense tagger labels the syntagms and words with
taining these keywords. These documents are then processed
with new concepts.
they are added to the domain hierarchy.</p>
      <p>The content of the hierarchy is extended dynamically, while
b) Lexical information is assigned to each word by the POS
tagger.
d) The semantic chunks are identied in the input text.
by NLP modules for rening the searc h results and for
idene) we identify a set of new conceptual descriptions (as a
become not available. New documents are added to the base
processed by NLP modules for new concept identication.
accordingly the domain model:
the Web is modied ev ery moment, new pages appear, others
were processed, then the domain hierarchy could be extended
conceptual denitions are validated by the DL module and
keywords. The system retrieves a number of documents
conThe NLP modules described above process user queries or
documents to be included in the base.
result of heuristic rules), checked by the DL module. The new
If we process user input, then the instances of the concepts
rules will be used for combining partial semantic descriptions.
system identies the concepts in the documen t and extends
a) The document is rst processed by The wordcounter.
are retrieved from the domain hierarchy. If new documents
tifying new concepts.While a new document is indexed, the
context of the content words provided by this module are
conceptual descriptions.</p>
      <p>Each chunk is assigned a conceptual description. The heuristic
of documents. User queries are rst processed like a set of</p>
    </sec>
    <sec id="sec-4">
      <title>3.2 Functionality</title>
    </sec>
    <sec id="sec-5">
      <title>4 DL Conceptual Hierarchy</title>
    </sec>
    <sec id="sec-6">
      <title>4.1 Initializing the ontology</title>
      <p>Document
5 Conclusion and further work
handling erroneous or incomplete data.
to extract a semantic representation. The concepts identied
This section illustrates the use of DL concept hierarchy for
User queries are interpreted by the NLP modules in order
in the user query are used to retrieve the instances.</p>
      <p>The user asks Example.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>4.3 User queries</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>