=Paper=
{{Paper
|id=Vol-40/paper-7
|storemode=property
|title=Learning Ontologies for the Semantic Web
|pdfUrl=https://ceur-ws.org/Vol-40/maedche+staab.pdf
|volume=Vol-40
}}
==Learning Ontologies for the Semantic Web==
<pdf width="1500px">https://ceur-ws.org/Vol-40/maedche+staab.pdf</pdf>
<pre>
                       Learning Ontologies for the Semantic Web

                            Alexander Maedche                                                         Steffen Staab
                                Institute AIFB,                                                      Institute AIFB,
                            University of Karlsruhe,                                             University of Karlsruhe,
                          76128 Karlsruhe, Germany                                             76128 Karlsruhe, Germany
                              Ontoprise GmbH,                                                      Ontoprise GmbH,
                           Haid-und-Neu Strasse 7                                               Haid-und-Neu Strasse 7
                          76131 Karlsruhe, Germany                                             76131 Karlsruhe, Germany
                      ama@aifb.uni-karlsruhe.de                                              sst@aifb.uni-karlsruhe.de


ABSTRACT                                                                             cheap and fast construction of domain-speci c ontologies is
The Semantic Web relies heavily on the formal ontologies                             crucial for the success and the proliferation of the Semantic
that structure underlying data for the purpose of compre-                            Web.
hensive and transportable machine understanding. There-                                 Though ontology engineering tools have become mature
fore, the success of the Semantic Web depends strongly on                            over the last decade (cf. [9]), the manual acquisition of on-
the proliferation of ontologies, which requires fast and easy                        tologies still remains a tedious, cumbersome task resulting
engineering of ontologies and avoidance of a knowledge ac-                           easily in a knowledge acquisition bottleneck. Having devel-
quisition bottleneck.                                                                oped our ontology engineering workbench, OntoEdit, we had
   Ontology Learning greatly facilitates the construction of                         to face exactly this issue, in particular we were given ques-
ontologies by the ontology engineer. The vision of ontology                          tions like
learning that we propose here includes a number of comple-                               Can you develop an ontology fast? (time)
mentary disciplines that feed on di erent types of unstruc-
tured, semi-structured and fully structured data in order to                             Is it diÆcult to build an ontology? (diÆculty)
support a semi-automatic, cooperative ontology engineering
process. Our ontology learning framework proceeds through                                How do you know that you've got the ontology right?
ontology import, extraction, pruning, re nement, and eval-                                (con dence)
uation giving the ontology engineer a wealth of coordinated
tools for ontology modeling. Besides of the general frame-                              In fact, these problems on time, diÆculty and con dence
work and architecture, we show in this paper some exem-                              that we ended up with were similar to what knowledge en-
plary techniques in the ontology learning cycle that we have                         gineers had dealt with over the last two decades when they
implemented in our ontology learning environment, Text-                              elaborated on methodologies for knowledge acquisition or
To-Onto, such as ontology learning from free text, from dic-                         workbenches for de ning knowledge bases. A method that
tionaries, or from legacy ontologies, and refer to some others                       proved extremely bene cial for the knowledge acquisition
that need to complement the complete architecture, such as                           task was the integration of knowledge acquisition with ma-
reverse engineering of ontologies from database schemata or                          chine learning techniques [33]. The drawback of these ap-
learning from XML documents.                                                         proaches, e.g. the work described in [21], however, was their
                                                                                     rather strong focus on structured knowledge or data bases,
1.     ONTOLOGIES FOR THE SEMANTIC WEB                                               from which they induced their rules.
   Conceptual structures that de ne an underlying ontology                              In contrast, in the Web environment that we encounter
are germane to the idea of machine processable data on the                           when building Web ontologies, the structured knowledge or
Semantic Web. Ontologies are (meta)data schemas, provid-                             data base is rather the exception than the norm. Hence, in-
ing a controlled vocabulary of concepts, each with an explic-                        telligent means for an ontology engineer takes on a di erent
itly de ned and machine processable semantics. By de ning                            meaning than the | very seminal | integration architec-
shared and common domain theories, ontologies help both                              tures for more conventional knowledge acquisition [7].
people and machines to communicate concisely, supporting                                Our notion of Ontology Learning aims at the integration of
the exchange of semantics and not only syntax. Hence, the                            a multitude of disciplines in order to facilitate the construc-
                                                                                     tion of ontologies, in particular machine learning. Because
                                                                                     the fully automatic acquisition of knowledge by machines
                                                                                     remains in the distant future, we consider the process of
Permission to make digital or hard copies of all or part of this work for            ontology learning as semi-automatic with human interven-
personal or classroom use is granted without fee provided that copies are            tion, adopting the paradigm of balanced cooperative modeling
not made or distributed for profit or commercial advantage and that copies           [20] for the construction of ontologies for the Semantic Web.
bear this notice and the full citation on the first page. To copy otherwise, to      This objective in mind, we have built an architecture that
republish, to post on servers or to redistribute to lists, requires prior specific
permission by the authors.
                                                                                     combines knowledge acquisition with machine learning, feed-
Semantic Web Workshop 2001 Hongkong, China                                           ing on the resources that we nowadays nd on the syntactic
Copyright by the authors.                                                            Web, viz. free text, semi-structured text, schema de nitions
(DTDs), etc. Thereby, modules in our framework serve dif-        with this wealth. Hence, there comes the need for a range of
ferent steps in the engineering cycle, which here consists of    di erent techniques: Structured data and meta data require
the following ve steps (cf. Figure 1):                           reverse engineering approaches, free text may contribute to
   First, existing ontologies are imported and reused by         ontology learning directly or through information extraction
merging existing structures or de ning mapping rules be-         approaches.3 Semi-structured data may nally require and
tween existing structures and the ontology to be established.    pro t from both.
For instance, [26] describe how ontological structures con-         In the following we elaborate on our ontology learning
tained in Cyc are used in order to facilitate the construc-      framework. Thereby we approach di erent techniques for
tion of a domain-speci c ontology. Second, in the ontol-         di erent types of data, showing parts of our architecture, its
ogy extraction phase major parts of the target ontology          current status, and parts that may complement our current
are modeled with learning support feeding from web doc-          Text-To-Onto environment.
uments. Third, this rough outline of the target ontology            A general overview of ontology learning techniques as well
needs to be pruned in order to better adjust the ontology        as corresponding references may be found in Section 9.
to its prime purpose. Fourth, ontology re nement pro ts
from the given domain ontology, but completes the ontology
at a ne granularity (also in contrast to extraction). Fifth,     3. AN ARCHITECTURE FOR ONTOLOGY
the prime target application serves as a measure for vali-          LEARNING
dating the resulting ontology [31]. Finally, one may revolve       Given the task of constructing and maintaining an ontol-
again in this cycle, e.g. for including new domains into the     ogy for a Semantic Web application, e.g. for an ontology-
constructed ontology or for maintaining and updating its         based knowledge portal that we have been dealing with (cf.
scope.                                                           [29]), we have produced a wish list of what kind of support
                                                                 we would fancy.
2.   THE ONTOLOGY LEARNING KILLER                                3.1 Ontology Engineering Workbench OntoEdit
     APPLICATION                                                    As core to our approach we have built a graphical user in-
   Though ontologies and their underlying data in the Se-        terface to support the ontology engineering process manually
mantic Web are envisioned to be reusable for a wide range of     performed by the ontology engineer. Here, we o er sophisti-
possibly unforeseen applications1 , a particular target appli-   cated graphical means for manual modeling and re ning the
cation remains the touchstone for a given ontology. In our         nal ontology. Di erent views are o ered to the user target-
case, we have been dealing with ontology-based knowledge         ing the epistemological level rather than a particular rep-
portals that structure Web content and that allow for struc-     resentation language. However, the ontological structures
tured provisioning and accessing of data [29, 30]. Knowledge     built there may be exported to standard Semantic Web rep-
portals are information intermediaries for knowledge access-     resentation languages, such as OIL and DAML-ONT, as well
ing and sharing on the Web. The development of a knowl-          as our own F-Logic based extensions of RDF(S). In addition,
edge portal consists of the tasks of structuring the knowl-      executable representations for constraint checking and ap-
edge, establishing means for providing new knowledge and         plication debugging can be generated and then accessed via
accessing the knowledge contained in the portal.                 SilRi4 , our F-Logic inference engine, that is directly con-
   A considerable part of development and maintenance of         nected with OntoEdit.
the portal lies in integrating legacy information as well as        The sophisticated ontology engineering tools we knew, e.g.
in constructing and maintaining the ontology in vast, of-        the Protege modeling environment for knowledge-based sys-
ten unknown, terrain. For instance, a knowledge portal           tems [9], would o er capabilities roughly comparable to On-
may focus on the electronics sector, integrating compar-         toEdit. However, given the task of constructing a knowledge
ative shopping in conjunction with manuals, reports and          portal, we found that there was this large conceptual bridge
opinions about current electronic products. The creation of      between the ontology engineering tool and the input (of-
the background ontology for this knowledge portal involves       ten legacy data), such as Web documents, Web document
tremendous e orts for engineering the conceptual structures      schemata, databases on the Web, and Web ontologies, which
that underly existing warehouse databases, product cata-         ultimately determined the target ontology. Into this void we
logues, user manuals, test reports and newsgroup discus-         have positioned new components of our ontology learning ar-
sions. Correspondingly, ontology structures must be con-         chitecture (cf. Figure 2). The new components support the
structed from database schemata, a given product thesaurus       ontology engineer in importing existing ontology primitives,
(like BMEcat), XML documents and document type de ni-            extracting new ones, pruning given ones, or re ning with
tions (DTDs), and free texts. Still worse, signi cant parts of   additional ontology primitives. In our case, the ontology
these (meta-)data change extremely fast and, hence, require      primitives comprise:
a regular update of the corresponding ontology parts.
   Thus, very di erent types of (meta-)data might be use-            a set of strings that describe lexical entries L for con-
ful input for the construction of the ontology. However, in           cepts and relations;
practice one needs comprehensive support2 in order to deal
                                                                 3
1
                                                                   In fact, ontology learning for free text serves a double pur-
  Just like Tim Berners-Lee did not forsee online auctions       pose. On the one hand it yields a readily exploitable on-
being a common business model of the Web of 2000.                tology for Semantic Web purposes, on the other hand it
2
  Comprehensive support with ontology learning need not          often returns improved information extraction and natural
necessarily imply the top-notch learning algorithms, but         language understanding means adjusted to the learned on-
may rely more heavily on appropriate tool support and            tology, cf. [10].
                                                                 4
methodology.                                                       http://www.ontoprise.com/ | then download area.
                                            Legacy + Application Data


                                                                Ontology          Prune
                                                                                    ^c
                                                                Learning
                      Extract


                                                       Domain
                                                       Ontology
                                                                                            Refine

                        Import /              Ontology
                                              Learning
                         Reuse
                                                                           Apply
                                      Legacy + Application Data


                                      Figure 1: Ontology Learning process steps

    a set of concepts5 | C ;                                          Ontology learning relies on ontology structures given along
                                                                    these lines and on input data as described above in order to
    a taxonomy of concepts with multiple inheritance (het-         propose new knowledge about reasonably interesting con-
     erarchy) HC ;                                                  cepts, relations, lexical entries, or about links between these
    a set of non-taxonomic relations | R | described by            entities | proposing the addition, the deletion, or the merg-
     their domain and range restrictions;                           ing of some of them. The results of the ontology learning
                                                                    process are presented to the ontology engineer by the graph-
    a heterarchy of relations, i.e. a set of taxonomic rela-       ical result set representation (cf. Figure 4 for an example of
     tions HR ;                                                     how extracted properties may be presented). The ontology
                                                                    engineer may then browse the results and decide to follow,
    relations F and G that relate concepts and relations           delete, or modify the proposals in accordance to the purpose
     with their lexical entries, respectively; and, nally,          of her task.
    a set of axioms A that describe additional constraints
     on the ontology and allow to make implicit facts ex-           4. COMPONENTS FOR LEARNING
     plicit [29].                                                      ONTOLOGIES
   This structure corresponds closely to RDFS, the one ex-             Integrating the considerations from above into a coherent
ception is the explicit consideration of lexical entries. The       generic architecture for extracting and maintaining ontolo-
separation of concept reference and concept denotation, which       gies from data on the Web we have identi ed several core
may be easily expressed in RDF, allows to provide very              components. There are, (i), a generic management compo-
domain-speci c ontologies without incurring an instanta-            nent dealing with delegation of tasks and constituting the
neous con ict when merging ontologies | a standard re-              infrastructure backbone, (ii), a resource processing compo-
quest in the Semantic Web. For instance, the lexical entry          nent working on input data from the Web including, in par-
\school" in one ontology may refer to a building in ontol-          ticular, a natural language processing system, (iii), an algo-
ogy A, but to an organization in ontology B, or to both in          rithm library working on the output of the resource process-
ontology C. Also in ontology A the concept refered to in            ing component as well as the ontology structures sketched
English by \school" and \school building" may be referred           above and returning result sets also mentioned above and,
to in German by \Schule" and \Schulgebaude".                       (iv), the graphical user interface for ontology engineering,
5
                                                                    OntoEdit.
  Concepts in our framework are roughly akin to synsets in
WordNet [19].
             Web documents                                            Legacy databases
                                                                                                                              O2


                                                 DTD                                                               Ontology
                                                       DTD                                  Import       WordNet
                                            Import semi-                                    existing
                             Crawl                              Import schema
                                            structured schema                               ontologies
                             corpus
                                                                                                                      O1


                                      Management Component


                                       Resource Processing
                                           Component                                            NLP System

                                                                                                                                   Ontology
                                                                           Lexicon 1
                                                                                                                                   Engineer
                                                                                 ...


                                                                           Lexicon n


                                               Domain
                                               Ontology


                                Algorithm              Result
                                                                                              OntoEdit
                                 Library                Set
                                                                                            Inference Engine(s)


                         Figure 2: Architecture for Learning Ontologies for the Semantic Web

4.1 Management component                                                                Semi-structured and structured schema data (like DTD's,
   The ontology engineer uses the management component                                   structured database schemata, and existing ontologies)
to select input data, i.e. relevant resources such as HTML &                             are handeled following di erent strategies for import
XML documents, document type de nitions, databases, or                                   as described later in this paper.
existing ontologies that are exploited in the further discovery                         For processing free natural text our system accesses the
process. Secondly, using the management component, the                                   natural language processing system SMES (Saarbrucken
ontology engineer also chooses among a set of resource pro-                              Message Extraction System), a shallow text processor
cessing methods available at the resource processing com-                                for German (cf. [24]). SMES comprises a tokenizer
ponent and among a set of algorithms available in the algo-                              based on regular expressions, a lexical analysis compo-
rithm library.                                                                           nent including various word lexicons, a morphological
   Furthermore, the management component even supports                                   analysis module, a named entity recognizer, a part-of-
the ontology engineer in discovering task-relevant legacy data,                          speech tagger and a chunk parser.
e.g. an ontology-based crawler gathers HTML documents
that are relevant to a given core ontology and an RDF                              After rst preprocessing according to one of these or simi-
crawler follows URIs (i.e., unique identi ers in XML/RDF)                       lar strategies, the resource processing module transforms the
that are also URLs in order to cover parts of the so far tiny,                  data into an algorithm-speci c relational representation.
but growing Semantic Web.
                                                                                4.3 Algorithm Library
4.2 Resource processing component
                                                                                   As described above an ontology may be described by a
   Resource processing strategies di er depending on the type                   number of sets of concepts, relations, lexical entries, and
of input data made available:                                                   links between these entities. An existing ontology de nition
    HTML documents may be indexed and reduced to free                          (including L; C ; HC ; R; HR ; A; F ; G ) may be acquired using
     text.                                                                      various algorithms working on this de nition and the pre-
                                                                                processed input data. While speci c algorithms may greatly
    Semi-structured documents, like dictionaries, may be                       vary from one type of input to the next, there is also con-
     transformed into a prede ned relational structure.                         siderable overlap concerning underlying learning approaches
like association rules, formal concept analysis, or clustering.   ative model where previous revisions through the ontology
Hence, we may reuse algorithms from the library for acquir-       learning cycle may propel subsequent ones and more sophis-
ing di erent parts of the ontology de nition.                     ticated algorithms may work on structures proposed by more
   Subsequently, we introduce some of these algorithms avail-     straightforward ones before.
able in our implementation. In general, we use a multi-              Describing this phase, we sketch some of the techniques
strategy learning and result combination approach, i.e. each      and algorithms that have been embedded in our framework
algorithm that is plugged into the library generates normal-      and implemented in our ontology learning environment Text-
ized results that adhere to the ontology structures sketched      To-Onto (cf. Figure 3). Doing so, we cover a very substantial
above and that may be combined into a coherent ontology           part of the overall ontology learning task in the extraction
de nition.                                                        phase. Text-To-Onto proposes many di erent ontology com-
                                                                  ponents, which we have described above (i.e. L; C ; R; : : : ),
5.   IMPORT & REUSE                                               to the ontology engineer feeding on several types of input.
   Given our experiences in medicine, telecommunication,         6.1 Lexical Entry & Concept Extraction
and insurance, we expect that for almost any commercially           This technique is one of the baseline methods applied in
signi cant domain there are some kind of domain conceptu-        our framework for acquiring lexical entries with correspond-
alizations available. Thus, we need mechanisms and strate-       ing concepts. In Text-To-Onto, web documents are morpho-
gies to import & reuse domain conceptualizations from ex-        logically processed, including the treatment of multi-word
isting (schema) structures. Thereby, the conceptualizations      terms such as \database reverse engineering" by N-grams,
may be recovered, e.g., from legacy database schemata, document- a simple statistics means. Based on this text preprocessing,
type de nitions (DTDs), or from existing ontologies that         term extraction techniques, which are based on (weighted)
conceptualize some relevant part of the target ontology.         statistical frequencies, are applied in order to propose new
   In the rst part of the import & reuse step, the schema        lexical entries for L.
structures are identi ed and their general content need to          Often, the ontology engineer follows the proposal by the
be discussed with domain experts. Each of these knowl-           lexical entry & concept extraction mechanism and includes
edge sources must be imported separately. Import may be          a new lexical entry in the ontology. Because the new lexical
performed manually | which may include the manual def-           entry comes without an associated concept, the ontology
inition of transformation rules. Alternatively, reverse engi-    engineer must then decide (possibly with help from further
neering tools, such as exist for recovering extended entity-     processing) whether to introduce a new concept or link the
relationship diagrams from the SQL description of a given        new lexical entry to an existing concept.
database (cf. reference [32, 14] in survey, Table 1), may
facilitate the recovery of conceptual structures.                6.2 Hierarchical Concept Clustering
   In the second part of the import & reuse step, imported
conceptual structures need to be merged or aligned in or-           Given a lexicon and a set of concepts, one major next
der to constitute a single common ground from which to           step  is the taxonomic classi cation of concepts. One gen-
take-o into the subsequent ontology learning phases of ex-       erally  applicable method with to this regard is hierarchical
tracting, pruning and re ning. While the general research        clustering. Hierarchical clustering exploits the similarity of
issue concerning merging and aligning is still an open prob-     items in order to propose a hierarchy of item categories. The
lem, recent proposals (e.g., [25]) have shown how to improve     similarity measure is de ned on the properties of items.
the manual process of merging/aligning. Existing methods            Given the task of extracting a hierarchy from natural lan-
for merging/aligning mostly rely on matching heuristics for      guage   text, adjacency of terms or syntactical relationships
proposing the merge of concepts and similar knowledge-base       between   terms are two properties that yield considerable de-
operations. Our current research also integrates mechanisms      scriptive power to induce the semantic hierarchy of concepts
that use a application data oriented, bottom-up approach.        related  to these terms.
For instance, formal concept analysis allows to discover pat-       A sophisticated example for hierarchical clustering is given
terns between application data on the one hand and the           by Faure & Nedellec (cf. reference [6] in survey, Table 1):
usage of concepts and relations and the semantics given by       They present a cooperative machine learning system, ASIUM,
their heterarchies on the other hand in a formally concise       which acquires taxonomic relations and subcategorization
way (cf. reference [8] in survey, Table 1, on formal concept     frames of verbs based on syntactic input. The ASIUM sys-
analysis).                                                       tem hierarchically clusters nouns based on the verbs that
   Overall, the import and reuse step in ontology learning       they are syntactically related with and vice versa. Thus,
seems to be the one that is the hardest to generalize. The       they cooperatively extend the lexicon, the set of concepts,
task may remind vaguely of the general problems with data        and the concept heterarchy (L; C ; HC ).
warehousing adding, however, challenging problems of its         6.3 Dictionary Parsing
own.
                                                                    Machine-readable dictionaries (MRD) are frequently avail-
                                                                 able for many domains. Though their internal structure is
6. EXTRACTING ONTOLOGIES                                         free text to a large extent, there are comparatively few pat-
   In the ontology extraction phase of the ontology learn-       terns that are used to give text de nitions. Hence, MRDs
ing process, major parts, i.e. the complete ontology or large    exhibit a large degree of regularity that may be exploited
chunks re ecting a new subdomain of the ontology, are mod-       for extracting a domain conceptualization and proposing it
eled with learning support exploiting various types of (Web)     to the ontology engineer.
sources. Thereby, ontology learning techniques partially rely       Text-To-Onto has been used to generate a taxonomy of
on given ontology parts. Thus, we here encounter an iter-        concepts from a machine-readable dictionary of an insurance
                     Figure 3: Screenshot of our Ontology Learning Workbench Text-To-Onto

company (cf. reference [15] in survey, Table 1). Likewise to     taxonomy, e.g. \snacks are purchased together with drinks"
term extraction from free text morphological processing is       rather than \chips are purchased with beer" and \peanuts
applied, this time however complementing several pattern-        are purchased with soda".
matching heuristics. For example the dictionary contained           In Text-To-Onto (cf. reference [17] in survey, Table 1) we
the following entry:                                             use a modi cation of the generalized association rule learn-
                                                                 ing algorithm for discovering properties between classes. A
  Automatic Debit Transfer: Electronic service arising           given class hierarchy HC serves as background knowledge.
  from a debit authorization of the Yellow Account holder        Pairs of syntactically related classes (e.g. pair(festival,island)
  for a recipient to debit bills that fall due direct from the   describing the head-modi er relationship contained in the
  account..                                                      sentence \The festival on Usedom6 attracts tourists from all
  Several heuristics were applied to the morphologically an-     over the world.") are given as input to the algorithm. The
alyzed de nitions. For instance, one simple heuristic relates    algorithm generates association rules comparing the rele-
the de nition term, here \automatic debit transfer", with        vance of di erent rules while climbing up and/or down the
the rst noun phrase occurring in the de nition, here \elec-      taxonomy. The appearingly most relevant binary rules are
tronic service". Their corresponding concepts are linked in      proposed to the ontology engineer for modeling relations into
the heterarchy HC :                                              the ontology, thus extending R.
 HC (automatic debit transfer, electronic service).                 As the number of generated rules is typically high, we
  Applying this heuristic iteratively, one may propose large     o er various modes of interaction. For example, it is possi-
parts of the target ontology, more precisely L; C and HC to      ble to restrict the number of suggested relations by de ning
the ontology engineer. In fact, because verbs tend to be         so-called restriction classes that have to participate in the
modeled as relations, R (and the linkage between R and L)        relations that are extracted. Another way of focusing is the
may be extended by this way, too.                                  exible enabling / disabling of the use of taxonomic knowl-
                                                                 edge for extracting relations.
6.4 Association Rules                                               Results are presented o ering various views onto the re-
   Association rule learning algorithms are typically used for   sults as depicted in Figure 4. A generalized relation that
prototypical applications of data mining, like nding associ-     may be induced by the partially given example data above
ations that occur between items, e.g. supermarket products,      may be the property(event,area), which may be named by
in a set of transactions, e.g. customers' purchases. The gen-
eralized association rule learning algorithm extends its base-   6
                                                                   Usedom is an island located in north-east of Germany in
line by aiming at descriptions at the appropriate level of the   the Baltic Sea.
the ontology engineer as locatedIn, viz. events are located      speci c taxonomies. An ontology is incrementally updated
in an area (thus extending L and F ). The user may add the       as new concepts are acquired from text. The acquisition pro-
extracted relations to the ontology by drag-and-drop. To ex-     cess is centered around the linguistic and conceptual \qual-
plore and determine the right aggregation level of adding a      ity" of various forms of evidence underlying the generation
relation to the ontology, the user may browse the hierarchy      and re nement of concept hypothesis. In particular they
view on extracted properties as given in the left part of Fig-   consider semantic con icts and analogous semantic struc-
ure 4. This view may also support the ontology engineer          tures from the knowledge base into the ontology in order to
in de ning appropriate subPropertyOf relations between           determine the quality of a particular proposal. Thus, they
properties, such as subPropertyOf(hasDoubleRoom,hasRoom)         extend an existing ontology with new lexical entries for L,
(thereby extending HR ).                                         new concepts for C and new relations for HC .

7.   PRUNING THE ONTOLOGY                                        9. RELATED WORK
   A common theme of modeling in various disciplines is the         Until recently ontology learning per se, i.e. for comprehen-
balance between completeness and scarcity of the domain          sive construction of ontologies, has not existed. We here give
model. It is a widely held belief that targeting completeness    the reader a comprehensive overview over existing work that
for the domain model on the one hand appears to be prac-         has actually researched and practiced techniques for solving
tically inmanagable and computationally intractable, and         parts of the overall problem of ontology learning.
targeting the scarcest model on the other hand is overly lim-       There are only a few approaches that described the de-
iting with regard to expressiveness. Hence, what we strive       velopment of frameworks and workbenches for extracting
for is the balance between these two, which is really working.   ontologies from data: Faure & Nedellec [6] present a co-
We aim at a model that captures a rich conceptualization         operative machine learning system, ASIUM, which acquires
of the target domain, but that excludes parts that are out       taxonomic relations and subcategorization frames of verbs
of its focus. The import & reuse of ontologies as well as the    based on syntactic input. The ASIUM system hierarchically
extraction of ontologies considerably pull the lever of the      clusters nouns based on the verbs that they are syntactically
scale into the imbalance where out-of-focus concepts reign.      related with and vice versa. Thus, they cooperatively extend
Therefore, we pursue the appropriate diminishing of the on-      the lexicon, the set of concepts, and the concept heterarchy
tology in the pruning phase.                                     (L; C ; HC ).
   There are at least two dimensions to look at the prob-           Hahn and Schnattinger [11] introduced a methodology for
lem of pruning. First, one needs to clarify how the pruning      the maintenance of domain-speci c taxonomies. An ontol-
of particular parts of the ontology (e.g., the removal of a      ogy is incrementally updated as new concepts are acquired
concept or a relation) a ects the rest. For instance, Peter-     from real-world texts. The acquisition process is centered
son et. al. [26] have described strategies that leave the user   around linguistic and conceptual \quality" of various forms
with a coherent ontology (i.e. no dangling or broken links).     of evidence underlying the generation and re nement of con-
Second, one may consider strategies for proposing ontology       cept hypotheses. Their ontology learning approach is em-
items that should be either kept or pruned. We have inves-       bedded in a framework for natural language understanding,
tigated several mechanisms for generating proposals from         named Syndicate [10].
application data. Given a set of application-speci c docu-          Mikheev & Finch [18] have presented their KAWB Work-
ments there are several strategies for pruning the ontology.     bench for \Acquisition of Domain Knowledge form Natural
They are based on absolute or relative counts of frequency       Language". The workbench compromises a set of compu-
of terms (cf. reference [15] in survey, Table 1).                tational tools for uncovering internal structure in natural
                                                                 language texts. The main idea behind the workbench is
                                                                 the independence of the text representation and text anal-
8.   REFINING THE ONTOLOGY                                       ysis phases. At the representation phase the text is con-
   Re ning plays a similar role as extracting. Their di erence   verted from a sequence of characters to features of interest
exists rather on a sliding scale than by a clear-cut distinc-    by means of the annotation tools. At the analysis phase
tion. While extracting serves mostly for cooperative mod-        those features are used by statistics gathering and infer-
eling of the overall ontology (or at least of very signi cant    ence tools for nding signi cant correlations in the texts.
chunks of it), the re nement phase is about ne tuning the        The analysis tools are independent of particular assumptions
target ontology and the support of its evolving nature. The      about the nature of the feature-set and work on the abstract
re nement phase may use data that comes from the con-            level of feature elements represented as SGML items.
crete Semantic Web application, e.g. log les of user queries        Much work in a number of disciplines | computational
or generic user data. Adapting and re ning the ontology          linguistics, information retrieval, machine learning, databases,
with respect to user requirements plays a major role for the     software engineering | has actually researched and prac-
acceptance of the application and its further development.       ticed techniques for solving part of the overall problem.
   In principle, the same algorithms may be used for extrac-     Hence, techniques and methods relevant for ontology learn-
tion as for re nement. However, during re nement one must        ing may be found under terms like the acquisition of selec-
consider in detail the existing ontology and the existing con-   tional restrictions (cf. Resnik [27] and Basili et al. [2]), word
nections into the ontology, while extraction works more often    sense disambiguation and learning of word senses (cf. Hast-
than not practically from scratch.                               ings [34]), the computation of concept lattices from formal
   A prototypical approach for re nement (though not for         contexts (cf. Ganter & Wille [8]) and Reverse Engineering
extraction!) has been presented by Hahn & Schnattinger           in software engineering (cf. Mueller et al. [23]).
(cf. reference [11] in survey, Table 1). They have introduced       Ontology Learning puts a number of research activities,
a methodology for automating the maintenance of domain-          which focus on di erent types of inputs, but share their tar-
                                     Figure 4: Result Presentation in Text-To-Onto

get of a common domain conceptualization, into one per-           Acknowledgements.. We thank our students, Dirk Wenke
spective. One may recognize that these activities are spread      and Raphael Volz, for work at OntoEdit and Text-To-Onto.
between very many communities incurring references from           Research for this paper was partially nanced by Ontoprise
20 completely di erent events / journals.                         GmbH, Karlsruhe, Germany, by US Air Force in the DARPA
                                                                  DAML project \OntoAgents", by European Union in the
                                                                  IST-1999-10132 project \On-To-Knowledge", and by Ger-
                                                                  man BMBF in the project \GETESS" (01IN901C0).
10. CHALLENGES
   Ontology Learning may add signi cant leverage to the           11. REFERENCES
Semantic Web, because it propels the construction of do-           [1] H. Assadi. Construction of a regional ontology from
main ontologies, which are needed fastly and cheaply for               text and its use within a documentary system. In
the Semantic Web to succeed. We have presented a compre-               Proceedings of the International Conference on Formal
hensive framework for Ontology Learning that crosses the               Ontology and Information Systems - FOIS'98, Trento,
boundaries of single disciplines, touching on a number of              Italy, 1998.
challenges. Table 1 gives a survey of what types of tech-
niques should be included in a full- edged ontology learning       [2] R. Basili, M. T. Pazienza, and P. Velardi. Acquisition
and engineering environment. The good news however is                  of selectional patterns in a sublanguage. Machine
that one does not need perfect or optimal support for co-              Translation, 8(1):175{201, 1993.
operative modeling of ontologies. At least according to our        [3] Paul Buitelaar. CoreLex: Systematic Polysemy and
experience \cheap" methods in an integrated environment                Underspeci cation. PhD thesis, Brandeis University,
may yield tremendous help for the ontology engineer.                   Department of Computer Science, 1998.
   While a number of problems remain with the single disci-        [4] A. Doan, P. Domingos, and A. Levy. Learning Source
plines, some more challenges come up regarding the partic-             Descriptions for Data Integration. In Proceedings of
ular problem of Ontology Learning for the Semantic Web.                the International Workshop on The Web and
First, with the XML-based namespace mechanisms the no-                 Databases (WebDB-2000), 2000.
tion of an ontology with well-de ned boundaries, e.g. only         [5] F. Esposito, S. Ferilli, N. Fanizzi, and G. Semeraro.
de nitions that are in one le, will disappear. Rather, the             Learning from parsed sentences with inthelex. In
Semantic Web may yield an \amoeba-like" structure regard-              Proceedings of Learning Language in Logic Workshop
ing ontology boundaries, because ontologies refer to each              (LLL-2000), Lisbon, Portugal, 2000, 2000.
other and import each other (cf. e.g. the DAML-ONT prim-           [6] D. Faure and C. Nedellec. A corpus-based conceptual
itive import). However, it is not yet clear how the semantics          clustering method for verb frames and ontology
of these structures will look like. In light of these facts the        acquisition. In LREC workshop on adapting lexical
importance of methods like ontology pruning and crawling of            and corpus resources to sublanguages and applications,
ontologies will drastically increase still. Second, we have so         Granada, Spain, 1998.
far restricted our attention in ontology learning to the con-      [7] B. Gaines and M. Shaw. Integrated knowledge
ceptual structures that are (almost) contained in RDF(S)               acquisition architectures. Journal of Intelligent
proper. Additional semantic layers on top of RDF (e.g. fu-             Information Systems, 1(1), 1992.
ture OIL or DAML-ONT with axioms, A) will require new              [8] B. Ganter and R. Wille. Formal Concept Analysis:
means for improved ontology engineering with axioms, too!              Mathematical Foundations. Springer, Berlin -
                               Table 1: Classi cation of Ontology Learning Approaches
    Domain        Method                Features used        Prime purpose             Papers
    Free Text     Clustering            Syntax               Extract                   Buitelaar [3], Assadi [1] and Faure &
                                                                                       Nedellec [6]
                  Inductive     Logic   Syntax, Logic rep-   Extract                   Esposito et al. [5]
                  Programming           resentation
                  Association rules     Syntax, Tokens       Extract                   Maedche & Staab [17]
                  Frequency-based       Syntax               Prune                     Kietz et al. [15]
                  Pattern-Matching                           Extract                   Morin [22]
                  Classi cation         Syntax, Semantics    Re ne                     Schnattinger & Hahn [11]
    Dictionary    Information extrac-   Syntax               Extract                   Hearst [12], Wilks [35] and Kietz et
                  tion                                                                 al. [15]
                  Page rank             Tokens                                         Jannink & Wiederhold [13]
    Knowledge     Concept Induction,    Relations            Extract                   Kietz & Morik [16] and Schlobach
    base          A-Box mining                                                         [28]
    Semi-         Naive Bayes           Relations            Reverse engineering       Doan et al. [4]
    structured
    schemata
    Relational    Data Correlation      Relations            Reverse engineering       Johannesson [14] and Tari et al. [32]
    schemata


     Heidelberg - New York, 1999.                                      pages 372{379, 1997.
 [9] E. Grosso, H. Eriksson, R. Fergerson, S. Tu, and             [19] G. Miller. Wordnet: A lexical database for English.
     M. Musen. Knowledge modeling at the millennium |                  CACM, 38(11):39{41, 1995.
     the design and evolution of protege-2000. In                 [20] K. Morik. Balanced cooperative modeling. Machine
     Proceedings of KAW-99, Ban , Canada, 1999.                        Learning, 11:217{235, 1993.
[10] U. Hahn and M. Romacker. Content management in               [21] K. Morik, S. Wrobel, J.-U. Kietz, and W. Emde.
     the syndikate system | how technical documents are                Knowledge acquisition and machine learning: Theory,
     automatically transformed to text knowledge bases.                methods, and applications. Academic Press, London,
     Data & Knowledge Engineering, 35:137{159, 2000.                   1993.
[11] U. Hahn and K. Schnattinger. Towards text                    [22] E. Morin. Automatic acquisition of semantic relations
     knowledge engineering. In Proc. of AAAI '98, pages                between terms from technical corpora. In Proc. of the
     129{144, 1998.                                                    Fifth International Congress on Terminology and
[12] M.A. Hearst. Automatic acquisition of hyponyms from               Knowledge Engineering - TKE'99, 1999.
     large text corpora. In Proceedings of the 14th               [23] H. A. Mueller, J. H. Jahnke, D. B. Smith, M.-A.
     International Conference on Computational                         Storey, S. R. Tilley, and K. Wong. Reverse
     Linguistics. Nantes, France, 1992.                                Engineering: A Roadmap. In Proceedings of the 22nd
[13] J. Jannink and G. Wiederhold. Thesaurus entry                     International Conference on Software Engineering
     extraction from an on-line dictionary. In Proceedings             (ICSE-2000), Limerick, Ireland. Springer, 2000.
     of Fusion '99, Sunnyvale CA, July 1999, 1999.                [24] G. Neumann, R. Backofen, J. Baur, M. Becker, and
     http://www-db.stanford.edu/SKC/publications.html.                 C. Braun. An information extraction core system for
[14] P. Johannesson. A method for transforming relational              real world german text processing. In ANLP'97 |
     schemas into conceptual schemas. In M. Rusinkiewicz,              Proceedings of the Conference on Applied Natural
     editor, 10th International Conference on Data                     Language Processing, pages 208{215, Washington,
     Engineering, pages 115 { 122, Houston, 1994. IEEE                 USA, 1997.
     Press.                                                       [25] N. Fridman Noy and M. A. Musen. PROMPT:
[15] J.-U. Kietz, A. Maedche, and R. Volz. Semi-automatic              Algorithm and Tool for Automated Ontology Merging
     ontology acquisition from a corporate intranet. In                and Alignment. In Proceedings of the 17th National
     International Conference on Grammar Inference                     Conf. on Arti cial Intelligence (AAAI'2000), Austin,
     (ICGI-2000), to appear: Lecture Notes in Arti cial                Texas. MIT Press/AAAI Press, 2000.
     Intelligence, LNAI, 2000.                                    [26] B. Peterson, W.A. Andersen, and J. Engel. Knowledge
[16] J.-U. Kietz and K. Morik. A polynomial approach to                bus: Generating application-focused databases from
     the constructive induction of structural knowledge.               large ontologies. In Proc of KRDB 1998, Seattle,
     Machine Learning, 14(2):193{218, 1994.                            Washington, USA, pages 2.1{2.10, 1998.
[17] A. Maedche and S. Staab. Discovering conceptual              [27] P. Resnik. Selection and Information: A Class-based
     relations from text. In Proceedings of ECAI-2000. IOS             Approach to Lexical Relationships. PhD thesis,
     Press, Amsterdam, 2000.                                           University of Pennsylania, 1993.
[18] A. Mikheev and S. Finch. A workbench for nding               [28] S. Schlobach. Assertional mining in description logics.
     structure in text. In In Proceedings of the 5th                   In Proceedings of the 2000 International Workshop on
     Conference on Applied Natural Language Processing                 Description Logics (DL2000), 2000.
     | ANLP'97, March 1997, Washington DC, USA,                        http://SunSITE.Informatik.RWTH-
     Aachen.DE/Publications/CEUR-WS/Vol-33/.
[29] S. Staab, J. Angele, S. Decker, M. Erdmann,
     A. Hotho, A. Maedche, H.-P. Schnurr, R. Studer, and
     Y. Sure. Semantic community web portals. Proc. of
     WWW9 / Computer Networks, 33(1-6):473{491, 2000.
[30] S. Staab and A. Maedche. Knowledge portals |
     ontologies at work. AI Magazine, 21(2), Summer 2001.
[31] S. Staab, H.-P. Schnurr, R. Studer, and Y. Sure.
     Knowledge processes and ontologies. IEEE Intelligent
     Systems, 16(1), 2001.
[32] Z. Tari, O. Bukhres, J. Stokes, and S. Hammoudi. The
     Reengineering of Relational Databases based on Key
     and Data Correlations. In Proceedings of the 7th
     Conference on Database Semantics (DS-7), 7-10
     October 1997, Leysin, Switzerland. Chapman & Hall,
     1998.
[33] G. Webb, J. Wells, and Z. Zheng. An experimental
     evaluation of integrating machine learning with
     knowledge acquisition. Machine Learning, 35(1):5{23,
     1999.
[34] P. Wiemer-Hastings, A. Graesser, and
     K. Wiemer-Hastings. Inferring the meaning of verbs
     from context. In Proceedings of the Twentieth Annual
     Conference of the Cognitive Science Society, 1998.
[35] Y. Wilks, B. Slator, and L. Guthrie. Electric Words:
     Dictionaries, Computers, and Meanings. MIT Press,
     Cambridge, MA, 1996.

</pre>