-

Using Description Logics for Ontology Extraction

0 Amalia Todiras ̧cu (University Al.I.Cuza of Iasi , Romania and LIIA, ENSAIS, France) Franc ̧ois de Beuvron, LIIA, ENSAIS , France 1 Franc ̧ois Rousselot , LIIA, ENSAIS , France

The paper presents a prototype of a system for querying the Web in natural language (French) for a limited domain. The domain knowledge, represented in description logics (DL), is used for filtering the results of the search and is extended dynamically, (when new concepts are identified in the texts) as result of DL inference mechanisms. The conceptual hierarchy is built semi-automatically from the texts. Different small French corpora (heart surgery, newspaper articles, papers on natural language processing) have been used for experimenting the prototype. The system uses shallow natural language parsing techniques and DL reasoning mechanisms are used to handle incomplete or incorrect user queries.

Web searching engines accept user query composed by a set of keywords, written in a command language or in natural language. These systems use index files for retrieving the documents. Indexes can be keywords, terms, syntactic or semantic structures.

User queries are transformed into semantic representations, which are matched to the index items. The semantic representation of the query could be a set of keywords or more complex semantic representations. The performances of these systems are evaluated by two parameters: recall (the number of retrieved documents/the number of documents) and precision (the number of relevant documents/the number of retrieved documents). Keyword-based searching engines provide bad recall(ignoring synonyms or generalization/specialization handling) and low precision (the answers contain a significant amount of irrelevant information).

Several modern IR (Information Retrieval) systems use semantic resources as filters for improving search results: keywords with multiple-word terms [9], their semantic variations [ 4 ], thesaurus (Corelex [2], EuroWordNet [13]) or lists of synonyms [7].Natural language querying needs linguistic knowledge and NLP tools, like conceptual sentence parsers, applying patterns with case constraints to extract information ([9]), or predefined case frames [8].

The design of these systems involves off-line, time-consuming building of resources and provides low flexibility and portability. New concepts are identified in the texts, but only human experts can extend domain model. The disadvantage of these systems is the use of predefined semantic resources (thesaurus, lists of concepts etc.) which are not modified at runtime. IR applications deal with incomplete and erroneous data which need robust methods for parsing. Deep semantic and syntactic parsing methods fail to handle erroneous input and need an important amount of linguistic knowledge.

Another approach used by IR and IE applications provides the use of data-driven acquisition of resources. The use of semantic issues (terms, summaries) as indexes or of the inference capabilities of knowledge representation formalisms are just a few examples of the data-driven acquisition paradigm.

Document Surrogater uses of phrasal terms as indexes, eliminating the ambiguity introduced by single words used for indexing. The algorithm uses a special module to produce a set of significant terms from a focus file prepared by a human expert. An application of this methodology is document summarization [15]. Another approach expands user queries to document summaries that are stored in index files. Summaries contain relevant concepts and relations for each document [11]. Special DL operators are defined for generating document summaries [8].

Other system uses terminological information acquired from texts like FASTER [ 4 ]. It identifies multi-words terms and uses them as indexes for the document base. A terminological base is used and is extended by new term candidates (generated by morphological transformations on the terms).

Other systems use DLs as representation formalism of the domain knowledge base. An example is CLASSIC [14], used for manually indexing documents (using the name of the author, the title, the subject) of a digital library, containing documents in XML (Extended Markup Language) format.

We design a prototype of a system for querying in natural language (French) a set of documents. We use semantic resources for filtering the search and we adopt a data-driven methodology for resource acquisition. The domain hierarchy is represented in description logic (DL), providing efficiency and fault tolerance to incomplete or erroneous data. Logic inference mechanisms provided by DL are used to extend dynamically the domain model, and to complete missing information identified from the user query. Building linguistic and domain knowledge requires minimal efforts from the human designer, while it integrates shallow natural language processing techniques.

The system could be easily ported for another domain, due to the dynamic maintenance of the domain knowledge base. The new concepts inferred from the new documents are validated and added to the existing hierarchy. The methodology is not appropriate for unrestricted domains, due to the limited size of the ontology supported by the system. 9 C = (SOME Rel D) Rel.D there is at least one obi (1 n, jects of D in relation _ C= (OR D1 D2) D1 D2 disjunction of concep: C = NOT D D the complement of a 8 C= (ALL Rel D) Rel.D restricts the co-domain R(x; yi) ^ Rel with C ^ C = (AND D1 D2) D1 D2 conjunction of concep9y1 : : : yn C = there are at least n ob

2 Description Logics

Description logics (DL) are formalisms related to semantic networks and frame systems dedicated to knowledge representation ([1]).

DL structures the domain knowledge on two levels: a terminological level (T-Box), containing the axioms defining the classes of objects of the domain (named concepts), with their properties and relations (roles) with other objects, and an assertional level, (A-Box), containing objects of the abstract classes (individuals). The main reasoning service available in T-Box is subsumption between two concepts, determining which concept is more general. A-Box provides instantiation test, determining which concept or role has as individual a given instance.

Some of the basic logical operators which are used for creating complex conceptual descriptions are the following:

DL Operator Logic DL Interpretation expresion ject belonging to D related by a relation Rel with the objects of C of the relation Rel tual descriptions tual descriptions concept D(yi)) and relations. instances related by must be an individual of the con- hasAge statements). Instantiation test detects which conceptual deis more general than the other one). A concept description are dened by their roles and attributes. The instances do structured or incomplete data, like IR systems.The concepts ordering of the concepts in a hierarchy. The A-Box provides ference allows for retrieval the individuals which belongs to nological level, the main reasoning mechanism is the subsumption relation between two concepts (detecting which concept not contain all the values of the concept attributes. In some can be checked for satisabilit y. Classication is a partially cept For each instance of the concept all the Child. Mother, Child)(ALL hasAge Age))) is interpreted as: a is a that have at least Mother Woman about the membership relation between pairs of individuals instances. one child (relation being an instance of the con- hasChild) a given concept. It provides also the posibility of reasoning DLs provide powerful inference mechanisms. At the termiDLs are appropriate for applications dealing with semiconsistency test (i.e. if there is a contradiction in the set of values are not allowed, while DL accepts dening incomplete scription is instantiated by a given instance, and retrieval in(define-concept Mother (AND Woman (SOME hasChild frame-based knowledge representation formalism, the missing cept Age. Example.

Example. The definition

3 System Architecture

chunks, containing relevant information. The example consemantic information is sucien t to understand the query and In this example, "le patient" and "infarctus" are semantic tains syntactic errors (missing the determiner "un"), but the to return a correct answer. where is the most general role in the role hierarchy. relation then new concept(and sem(Chunk1)(SOME relation and (Noun in Chunk1) We use this for combining the chunks. relation sem(Chunk2))) and (Modifier in Chunk2) if (h Chunk1 i h Border i h Chunk2 i) c) The sense tagger labels the syntagms and words with taining these keywords. These documents are then processed with new concepts. they are added to the domain hierarchy.

The content of the hierarchy is extended dynamically, while b) Lexical information is assigned to each word by the POS tagger. d) The semantic chunks are identied in the input text. by NLP modules for rening the searc h results and for idene) we identify a set of new conceptual descriptions (as a become not available. New documents are added to the base processed by NLP modules for new concept identication. accordingly the domain model: the Web is modied ev ery moment, new pages appear, others were processed, then the domain hierarchy could be extended conceptual denitions are validated by the DL module and keywords. The system retrieves a number of documents conThe NLP modules described above process user queries or documents to be included in the base. result of heuristic rules), checked by the DL module. The new If we process user input, then the instances of the concepts rules will be used for combining partial semantic descriptions. system identies the concepts in the documen t and extends a) The document is rst processed by The wordcounter. are retrieved from the domain hierarchy. If new documents tifying new concepts.While a new document is indexed, the context of the content words provided by this module are conceptual descriptions.

Each chunk is assigned a conceptual description. The heuristic of documents. User queries are rst processed like a set of

3.2 Functionality 4 DL Conceptual Hierarchy 4.1 Initializing the ontology

Document 5 Conclusion and further work handling erroneous or incomplete data. to extract a semantic representation. The concepts identied This section illustrates the use of DL concept hierarchy for User queries are interpreted by the NLP modules in order in the user query are used to retrieve the instances.

The user asks Example.

4.3 User queries