=Paper=
{{Paper
|id=Vol-40/paper-7
|storemode=property
|title=Learning Ontologies for the Semantic Web
|pdfUrl=https://ceur-ws.org/Vol-40/maedche+staab.pdf
|volume=Vol-40
}}
==Learning Ontologies for the Semantic Web==
Learning Ontologies for the Semantic Web
Alexander Maedche Steffen Staab
Institute AIFB, Institute AIFB,
University of Karlsruhe, University of Karlsruhe,
76128 Karlsruhe, Germany 76128 Karlsruhe, Germany
Ontoprise GmbH, Ontoprise GmbH,
Haid-und-Neu Strasse 7 Haid-und-Neu Strasse 7
76131 Karlsruhe, Germany 76131 Karlsruhe, Germany
ama@aifb.uni-karlsruhe.de sst@aifb.uni-karlsruhe.de
ABSTRACT cheap and fast construction of domain-speci c ontologies is
The Semantic Web relies heavily on the formal ontologies crucial for the success and the proliferation of the Semantic
that structure underlying data for the purpose of compre- Web.
hensive and transportable machine understanding. There- Though ontology engineering tools have become mature
fore, the success of the Semantic Web depends strongly on over the last decade (cf. [9]), the manual acquisition of on-
the proliferation of ontologies, which requires fast and easy tologies still remains a tedious, cumbersome task resulting
engineering of ontologies and avoidance of a knowledge ac- easily in a knowledge acquisition bottleneck. Having devel-
quisition bottleneck. oped our ontology engineering workbench, OntoEdit, we had
Ontology Learning greatly facilitates the construction of to face exactly this issue, in particular we were given ques-
ontologies by the ontology engineer. The vision of ontology tions like
learning that we propose here includes a number of comple- Can you develop an ontology fast? (time)
mentary disciplines that feed on di erent types of unstruc-
tured, semi-structured and fully structured data in order to Is it diÆcult to build an ontology? (diÆculty)
support a semi-automatic, cooperative ontology engineering
process. Our ontology learning framework proceeds through How do you know that you've got the ontology right?
ontology import, extraction, pruning, re nement, and eval- (con dence)
uation giving the ontology engineer a wealth of coordinated
tools for ontology modeling. Besides of the general frame- In fact, these problems on time, diÆculty and con dence
work and architecture, we show in this paper some exem- that we ended up with were similar to what knowledge en-
plary techniques in the ontology learning cycle that we have gineers had dealt with over the last two decades when they
implemented in our ontology learning environment, Text- elaborated on methodologies for knowledge acquisition or
To-Onto, such as ontology learning from free text, from dic- workbenches for de ning knowledge bases. A method that
tionaries, or from legacy ontologies, and refer to some others proved extremely bene cial for the knowledge acquisition
that need to complement the complete architecture, such as task was the integration of knowledge acquisition with ma-
reverse engineering of ontologies from database schemata or chine learning techniques [33]. The drawback of these ap-
learning from XML documents. proaches, e.g. the work described in [21], however, was their
rather strong focus on structured knowledge or data bases,
1. ONTOLOGIES FOR THE SEMANTIC WEB from which they induced their rules.
Conceptual structures that de ne an underlying ontology In contrast, in the Web environment that we encounter
are germane to the idea of machine processable data on the when building Web ontologies, the structured knowledge or
Semantic Web. Ontologies are (meta)data schemas, provid- data base is rather the exception than the norm. Hence, in-
ing a controlled vocabulary of concepts, each with an explic- telligent means for an ontology engineer takes on a di erent
itly de ned and machine processable semantics. By de ning meaning than the | very seminal | integration architec-
shared and common domain theories, ontologies help both tures for more conventional knowledge acquisition [7].
people and machines to communicate concisely, supporting Our notion of Ontology Learning aims at the integration of
the exchange of semantics and not only syntax. Hence, the a multitude of disciplines in order to facilitate the construc-
tion of ontologies, in particular machine learning. Because
the fully automatic acquisition of knowledge by machines
remains in the distant future, we consider the process of
Permission to make digital or hard copies of all or part of this work for ontology learning as semi-automatic with human interven-
personal or classroom use is granted without fee provided that copies are tion, adopting the paradigm of balanced cooperative modeling
not made or distributed for profit or commercial advantage and that copies [20] for the construction of ontologies for the Semantic Web.
bear this notice and the full citation on the first page. To copy otherwise, to This objective in mind, we have built an architecture that
republish, to post on servers or to redistribute to lists, requires prior specific
permission by the authors.
combines knowledge acquisition with machine learning, feed-
Semantic Web Workshop 2001 Hongkong, China ing on the resources that we nowadays nd on the syntactic
Copyright by the authors. Web, viz. free text, semi-structured text, schema de nitions
(DTDs), etc. Thereby, modules in our framework serve dif- with this wealth. Hence, there comes the need for a range of
ferent steps in the engineering cycle, which here consists of di erent techniques: Structured data and meta data require
the following ve steps (cf. Figure 1): reverse engineering approaches, free text may contribute to
First, existing ontologies are imported and reused by ontology learning directly or through information extraction
merging existing structures or de ning mapping rules be- approaches.3 Semi-structured data may nally require and
tween existing structures and the ontology to be established. pro t from both.
For instance, [26] describe how ontological structures con- In the following we elaborate on our ontology learning
tained in Cyc are used in order to facilitate the construc- framework. Thereby we approach di erent techniques for
tion of a domain-speci c ontology. Second, in the ontol- di erent types of data, showing parts of our architecture, its
ogy extraction phase major parts of the target ontology current status, and parts that may complement our current
are modeled with learning support feeding from web doc- Text-To-Onto environment.
uments. Third, this rough outline of the target ontology A general overview of ontology learning techniques as well
needs to be pruned in order to better adjust the ontology as corresponding references may be found in Section 9.
to its prime purpose. Fourth, ontology re nement pro ts
from the given domain ontology, but completes the ontology
at a ne granularity (also in contrast to extraction). Fifth, 3. AN ARCHITECTURE FOR ONTOLOGY
the prime target application serves as a measure for vali- LEARNING
dating the resulting ontology [31]. Finally, one may revolve Given the task of constructing and maintaining an ontol-
again in this cycle, e.g. for including new domains into the ogy for a Semantic Web application, e.g. for an ontology-
constructed ontology or for maintaining and updating its based knowledge portal that we have been dealing with (cf.
scope. [29]), we have produced a wish list of what kind of support
we would fancy.
2. THE ONTOLOGY LEARNING KILLER 3.1 Ontology Engineering Workbench OntoEdit
APPLICATION As core to our approach we have built a graphical user in-
Though ontologies and their underlying data in the Se- terface to support the ontology engineering process manually
mantic Web are envisioned to be reusable for a wide range of performed by the ontology engineer. Here, we o er sophisti-
possibly unforeseen applications1 , a particular target appli- cated graphical means for manual modeling and re ning the
cation remains the touchstone for a given ontology. In our nal ontology. Di erent views are o ered to the user target-
case, we have been dealing with ontology-based knowledge ing the epistemological level rather than a particular rep-
portals that structure Web content and that allow for struc- resentation language. However, the ontological structures
tured provisioning and accessing of data [29, 30]. Knowledge built there may be exported to standard Semantic Web rep-
portals are information intermediaries for knowledge access- resentation languages, such as OIL and DAML-ONT, as well
ing and sharing on the Web. The development of a knowl- as our own F-Logic based extensions of RDF(S). In addition,
edge portal consists of the tasks of structuring the knowl- executable representations for constraint checking and ap-
edge, establishing means for providing new knowledge and plication debugging can be generated and then accessed via
accessing the knowledge contained in the portal. SilRi4 , our F-Logic inference engine, that is directly con-
A considerable part of development and maintenance of nected with OntoEdit.
the portal lies in integrating legacy information as well as The sophisticated ontology engineering tools we knew, e.g.
in constructing and maintaining the ontology in vast, of- the Protege modeling environment for knowledge-based sys-
ten unknown, terrain. For instance, a knowledge portal tems [9], would o er capabilities roughly comparable to On-
may focus on the electronics sector, integrating compar- toEdit. However, given the task of constructing a knowledge
ative shopping in conjunction with manuals, reports and portal, we found that there was this large conceptual bridge
opinions about current electronic products. The creation of between the ontology engineering tool and the input (of-
the background ontology for this knowledge portal involves ten legacy data), such as Web documents, Web document
tremendous e orts for engineering the conceptual structures schemata, databases on the Web, and Web ontologies, which
that underly existing warehouse databases, product cata- ultimately determined the target ontology. Into this void we
logues, user manuals, test reports and newsgroup discus- have positioned new components of our ontology learning ar-
sions. Correspondingly, ontology structures must be con- chitecture (cf. Figure 2). The new components support the
structed from database schemata, a given product thesaurus ontology engineer in importing existing ontology primitives,
(like BMEcat), XML documents and document type de ni- extracting new ones, pruning given ones, or re ning with
tions (DTDs), and free texts. Still worse, signi cant parts of additional ontology primitives. In our case, the ontology
these (meta-)data change extremely fast and, hence, require primitives comprise:
a regular update of the corresponding ontology parts.
Thus, very di erent types of (meta-)data might be use- a set of strings that describe lexical entries L for con-
ful input for the construction of the ontology. However, in cepts and relations;
practice one needs comprehensive support2 in order to deal
3
1
In fact, ontology learning for free text serves a double pur-
Just like Tim Berners-Lee did not forsee online auctions pose. On the one hand it yields a readily exploitable on-
being a common business model of the Web of 2000. tology for Semantic Web purposes, on the other hand it
2
Comprehensive support with ontology learning need not often returns improved information extraction and natural
necessarily imply the top-notch learning algorithms, but language understanding means adjusted to the learned on-
may rely more heavily on appropriate tool support and tology, cf. [10].
4
methodology. http://www.ontoprise.com/ | then download area.
Legacy + Application Data
Ontology Prune
^c
Learning
Extract
Domain
Ontology
Refine
Import / Ontology
Learning
Reuse
Apply
Legacy + Application Data
Figure 1: Ontology Learning process steps
a set of concepts5 | C ; Ontology learning relies on ontology structures given along
these lines and on input data as described above in order to
a taxonomy of concepts with multiple inheritance (het- propose new knowledge about reasonably interesting con-
erarchy) HC ; cepts, relations, lexical entries, or about links between these
a set of non-taxonomic relations | R | described by entities | proposing the addition, the deletion, or the merg-
their domain and range restrictions; ing of some of them. The results of the ontology learning
process are presented to the ontology engineer by the graph-
a heterarchy of relations, i.e. a set of taxonomic rela- ical result set representation (cf. Figure 4 for an example of
tions HR ; how extracted properties may be presented). The ontology
engineer may then browse the results and decide to follow,
relations F and G that relate concepts and relations delete, or modify the proposals in accordance to the purpose
with their lexical entries, respectively; and, nally, of her task.
a set of axioms A that describe additional constraints
on the ontology and allow to make implicit facts ex- 4. COMPONENTS FOR LEARNING
plicit [29]. ONTOLOGIES
This structure corresponds closely to RDFS, the one ex- Integrating the considerations from above into a coherent
ception is the explicit consideration of lexical entries. The generic architecture for extracting and maintaining ontolo-
separation of concept reference and concept denotation, which gies from data on the Web we have identi ed several core
may be easily expressed in RDF, allows to provide very components. There are, (i), a generic management compo-
domain-speci c ontologies without incurring an instanta- nent dealing with delegation of tasks and constituting the
neous con ict when merging ontologies | a standard re- infrastructure backbone, (ii), a resource processing compo-
quest in the Semantic Web. For instance, the lexical entry nent working on input data from the Web including, in par-
\school" in one ontology may refer to a building in ontol- ticular, a natural language processing system, (iii), an algo-
ogy A, but to an organization in ontology B, or to both in rithm library working on the output of the resource process-
ontology C. Also in ontology A the concept refered to in ing component as well as the ontology structures sketched
English by \school" and \school building" may be referred above and returning result sets also mentioned above and,
to in German by \Schule" and \Schulgebaude". (iv), the graphical user interface for ontology engineering,
5
OntoEdit.
Concepts in our framework are roughly akin to synsets in
WordNet [19].
Web documents Legacy databases
O2
DTD Ontology
DTD Import WordNet
Import semi- existing
Crawl Import schema
structured schema ontologies
corpus
O1
Management Component
Resource Processing
Component NLP System
Ontology
Lexicon 1
Engineer
...
Lexicon n
Domain
Ontology
Algorithm Result
OntoEdit
Library Set
Inference Engine(s)
Figure 2: Architecture for Learning Ontologies for the Semantic Web
4.1 Management component Semi-structured and structured schema data (like DTD's,
The ontology engineer uses the management component structured database schemata, and existing ontologies)
to select input data, i.e. relevant resources such as HTML & are handeled following di erent strategies for import
XML documents, document type de nitions, databases, or as described later in this paper.
existing ontologies that are exploited in the further discovery For processing free natural text our system accesses the
process. Secondly, using the management component, the natural language processing system SMES (Saarbrucken
ontology engineer also chooses among a set of resource pro- Message Extraction System), a shallow text processor
cessing methods available at the resource processing com- for German (cf. [24]). SMES comprises a tokenizer
ponent and among a set of algorithms available in the algo- based on regular expressions, a lexical analysis compo-
rithm library. nent including various word lexicons, a morphological
Furthermore, the management component even supports analysis module, a named entity recognizer, a part-of-
the ontology engineer in discovering task-relevant legacy data, speech tagger and a chunk parser.
e.g. an ontology-based crawler gathers HTML documents
that are relevant to a given core ontology and an RDF After rst preprocessing according to one of these or simi-
crawler follows URIs (i.e., unique identi ers in XML/RDF) lar strategies, the resource processing module transforms the
that are also URLs in order to cover parts of the so far tiny, data into an algorithm-speci c relational representation.
but growing Semantic Web.
4.3 Algorithm Library
4.2 Resource processing component
As described above an ontology may be described by a
Resource processing strategies di er depending on the type number of sets of concepts, relations, lexical entries, and
of input data made available: links between these entities. An existing ontology de nition
HTML documents may be indexed and reduced to free (including L; C ; HC ; R; HR ; A; F ; G ) may be acquired using
text. various algorithms working on this de nition and the pre-
processed input data. While speci c algorithms may greatly
Semi-structured documents, like dictionaries, may be vary from one type of input to the next, there is also con-
transformed into a prede ned relational structure. siderable overlap concerning underlying learning approaches
like association rules, formal concept analysis, or clustering. ative model where previous revisions through the ontology
Hence, we may reuse algorithms from the library for acquir- learning cycle may propel subsequent ones and more sophis-
ing di erent parts of the ontology de nition. ticated algorithms may work on structures proposed by more
Subsequently, we introduce some of these algorithms avail- straightforward ones before.
able in our implementation. In general, we use a multi- Describing this phase, we sketch some of the techniques
strategy learning and result combination approach, i.e. each and algorithms that have been embedded in our framework
algorithm that is plugged into the library generates normal- and implemented in our ontology learning environment Text-
ized results that adhere to the ontology structures sketched To-Onto (cf. Figure 3). Doing so, we cover a very substantial
above and that may be combined into a coherent ontology part of the overall ontology learning task in the extraction
de nition. phase. Text-To-Onto proposes many di erent ontology com-
ponents, which we have described above (i.e. L; C ; R; : : : ),
5. IMPORT & REUSE to the ontology engineer feeding on several types of input.
Given our experiences in medicine, telecommunication, 6.1 Lexical Entry & Concept Extraction
and insurance, we expect that for almost any commercially This technique is one of the baseline methods applied in
signi cant domain there are some kind of domain conceptu- our framework for acquiring lexical entries with correspond-
alizations available. Thus, we need mechanisms and strate- ing concepts. In Text-To-Onto, web documents are morpho-
gies to import & reuse domain conceptualizations from ex- logically processed, including the treatment of multi-word
isting (schema) structures. Thereby, the conceptualizations terms such as \database reverse engineering" by N-grams,
may be recovered, e.g., from legacy database schemata, document- a simple statistics means. Based on this text preprocessing,
type de nitions (DTDs), or from existing ontologies that term extraction techniques, which are based on (weighted)
conceptualize some relevant part of the target ontology. statistical frequencies, are applied in order to propose new
In the rst part of the import & reuse step, the schema lexical entries for L.
structures are identi ed and their general content need to Often, the ontology engineer follows the proposal by the
be discussed with domain experts. Each of these knowl- lexical entry & concept extraction mechanism and includes
edge sources must be imported separately. Import may be a new lexical entry in the ontology. Because the new lexical
performed manually | which may include the manual def- entry comes without an associated concept, the ontology
inition of transformation rules. Alternatively, reverse engi- engineer must then decide (possibly with help from further
neering tools, such as exist for recovering extended entity- processing) whether to introduce a new concept or link the
relationship diagrams from the SQL description of a given new lexical entry to an existing concept.
database (cf. reference [32, 14] in survey, Table 1), may
facilitate the recovery of conceptual structures. 6.2 Hierarchical Concept Clustering
In the second part of the import & reuse step, imported
conceptual structures need to be merged or aligned in or- Given a lexicon and a set of concepts, one major next
der to constitute a single common ground from which to step is the taxonomic classi cation of concepts. One gen-
take-o into the subsequent ontology learning phases of ex- erally applicable method with to this regard is hierarchical
tracting, pruning and re ning. While the general research clustering. Hierarchical clustering exploits the similarity of
issue concerning merging and aligning is still an open prob- items in order to propose a hierarchy of item categories. The
lem, recent proposals (e.g., [25]) have shown how to improve similarity measure is de ned on the properties of items.
the manual process of merging/aligning. Existing methods Given the task of extracting a hierarchy from natural lan-
for merging/aligning mostly rely on matching heuristics for guage text, adjacency of terms or syntactical relationships
proposing the merge of concepts and similar knowledge-base between terms are two properties that yield considerable de-
operations. Our current research also integrates mechanisms scriptive power to induce the semantic hierarchy of concepts
that use a application data oriented, bottom-up approach. related to these terms.
For instance, formal concept analysis allows to discover pat- A sophisticated example for hierarchical clustering is given
terns between application data on the one hand and the by Faure & Nedellec (cf. reference [6] in survey, Table 1):
usage of concepts and relations and the semantics given by They present a cooperative machine learning system, ASIUM,
their heterarchies on the other hand in a formally concise which acquires taxonomic relations and subcategorization
way (cf. reference [8] in survey, Table 1, on formal concept frames of verbs based on syntactic input. The ASIUM sys-
analysis). tem hierarchically clusters nouns based on the verbs that
Overall, the import and reuse step in ontology learning they are syntactically related with and vice versa. Thus,
seems to be the one that is the hardest to generalize. The they cooperatively extend the lexicon, the set of concepts,
task may remind vaguely of the general problems with data and the concept heterarchy (L; C ; HC ).
warehousing adding, however, challenging problems of its 6.3 Dictionary Parsing
own.
Machine-readable dictionaries (MRD) are frequently avail-
able for many domains. Though their internal structure is
6. EXTRACTING ONTOLOGIES free text to a large extent, there are comparatively few pat-
In the ontology extraction phase of the ontology learn- terns that are used to give text de nitions. Hence, MRDs
ing process, major parts, i.e. the complete ontology or large exhibit a large degree of regularity that may be exploited
chunks re ecting a new subdomain of the ontology, are mod- for extracting a domain conceptualization and proposing it
eled with learning support exploiting various types of (Web) to the ontology engineer.
sources. Thereby, ontology learning techniques partially rely Text-To-Onto has been used to generate a taxonomy of
on given ontology parts. Thus, we here encounter an iter- concepts from a machine-readable dictionary of an insurance
Figure 3: Screenshot of our Ontology Learning Workbench Text-To-Onto
company (cf. reference [15] in survey, Table 1). Likewise to taxonomy, e.g. \snacks are purchased together with drinks"
term extraction from free text morphological processing is rather than \chips are purchased with beer" and \peanuts
applied, this time however complementing several pattern- are purchased with soda".
matching heuristics. For example the dictionary contained In Text-To-Onto (cf. reference [17] in survey, Table 1) we
the following entry: use a modi cation of the generalized association rule learn-
ing algorithm for discovering properties between classes. A
Automatic Debit Transfer: Electronic service arising given class hierarchy HC serves as background knowledge.
from a debit authorization of the Yellow Account holder Pairs of syntactically related classes (e.g. pair(festival,island)
for a recipient to debit bills that fall due direct from the describing the head-modi er relationship contained in the
account.. sentence \The festival on Usedom6 attracts tourists from all
Several heuristics were applied to the morphologically an- over the world.") are given as input to the algorithm. The
alyzed de nitions. For instance, one simple heuristic relates algorithm generates association rules comparing the rele-
the de nition term, here \automatic debit transfer", with vance of di erent rules while climbing up and/or down the
the rst noun phrase occurring in the de nition, here \elec- taxonomy. The appearingly most relevant binary rules are
tronic service". Their corresponding concepts are linked in proposed to the ontology engineer for modeling relations into
the heterarchy HC : the ontology, thus extending R.
HC (automatic debit transfer, electronic service). As the number of generated rules is typically high, we
Applying this heuristic iteratively, one may propose large o er various modes of interaction. For example, it is possi-
parts of the target ontology, more precisely L; C and HC to ble to restrict the number of suggested relations by de ning
the ontology engineer. In fact, because verbs tend to be so-called restriction classes that have to participate in the
modeled as relations, R (and the linkage between R and L) relations that are extracted. Another way of focusing is the
may be extended by this way, too. exible enabling / disabling of the use of taxonomic knowl-
edge for extracting relations.
6.4 Association Rules Results are presented o ering various views onto the re-
Association rule learning algorithms are typically used for sults as depicted in Figure 4. A generalized relation that
prototypical applications of data mining, like nding associ- may be induced by the partially given example data above
ations that occur between items, e.g. supermarket products, may be the property(event,area), which may be named by
in a set of transactions, e.g. customers' purchases. The gen-
eralized association rule learning algorithm extends its base- 6
Usedom is an island located in north-east of Germany in
line by aiming at descriptions at the appropriate level of the the Baltic Sea.
the ontology engineer as locatedIn, viz. events are located speci c taxonomies. An ontology is incrementally updated
in an area (thus extending L and F ). The user may add the as new concepts are acquired from text. The acquisition pro-
extracted relations to the ontology by drag-and-drop. To ex- cess is centered around the linguistic and conceptual \qual-
plore and determine the right aggregation level of adding a ity" of various forms of evidence underlying the generation
relation to the ontology, the user may browse the hierarchy and re nement of concept hypothesis. In particular they
view on extracted properties as given in the left part of Fig- consider semantic con icts and analogous semantic struc-
ure 4. This view may also support the ontology engineer tures from the knowledge base into the ontology in order to
in de ning appropriate subPropertyOf relations between determine the quality of a particular proposal. Thus, they
properties, such as subPropertyOf(hasDoubleRoom,hasRoom) extend an existing ontology with new lexical entries for L,
(thereby extending HR ). new concepts for C and new relations for HC .
7. PRUNING THE ONTOLOGY 9. RELATED WORK
A common theme of modeling in various disciplines is the Until recently ontology learning per se, i.e. for comprehen-
balance between completeness and scarcity of the domain sive construction of ontologies, has not existed. We here give
model. It is a widely held belief that targeting completeness the reader a comprehensive overview over existing work that
for the domain model on the one hand appears to be prac- has actually researched and practiced techniques for solving
tically inmanagable and computationally intractable, and parts of the overall problem of ontology learning.
targeting the scarcest model on the other hand is overly lim- There are only a few approaches that described the de-
iting with regard to expressiveness. Hence, what we strive velopment of frameworks and workbenches for extracting
for is the balance between these two, which is really working. ontologies from data: Faure & Nedellec [6] present a co-
We aim at a model that captures a rich conceptualization operative machine learning system, ASIUM, which acquires
of the target domain, but that excludes parts that are out taxonomic relations and subcategorization frames of verbs
of its focus. The import & reuse of ontologies as well as the based on syntactic input. The ASIUM system hierarchically
extraction of ontologies considerably pull the lever of the clusters nouns based on the verbs that they are syntactically
scale into the imbalance where out-of-focus concepts reign. related with and vice versa. Thus, they cooperatively extend
Therefore, we pursue the appropriate diminishing of the on- the lexicon, the set of concepts, and the concept heterarchy
tology in the pruning phase. (L; C ; HC ).
There are at least two dimensions to look at the prob- Hahn and Schnattinger [11] introduced a methodology for
lem of pruning. First, one needs to clarify how the pruning the maintenance of domain-speci c taxonomies. An ontol-
of particular parts of the ontology (e.g., the removal of a ogy is incrementally updated as new concepts are acquired
concept or a relation) a ects the rest. For instance, Peter- from real-world texts. The acquisition process is centered
son et. al. [26] have described strategies that leave the user around linguistic and conceptual \quality" of various forms
with a coherent ontology (i.e. no dangling or broken links). of evidence underlying the generation and re nement of con-
Second, one may consider strategies for proposing ontology cept hypotheses. Their ontology learning approach is em-
items that should be either kept or pruned. We have inves- bedded in a framework for natural language understanding,
tigated several mechanisms for generating proposals from named Syndicate [10].
application data. Given a set of application-speci c docu- Mikheev & Finch [18] have presented their KAWB Work-
ments there are several strategies for pruning the ontology. bench for \Acquisition of Domain Knowledge form Natural
They are based on absolute or relative counts of frequency Language". The workbench compromises a set of compu-
of terms (cf. reference [15] in survey, Table 1). tational tools for uncovering internal structure in natural
language texts. The main idea behind the workbench is
the independence of the text representation and text anal-
8. REFINING THE ONTOLOGY ysis phases. At the representation phase the text is con-
Re ning plays a similar role as extracting. Their di erence verted from a sequence of characters to features of interest
exists rather on a sliding scale than by a clear-cut distinc- by means of the annotation tools. At the analysis phase
tion. While extracting serves mostly for cooperative mod- those features are used by statistics gathering and infer-
eling of the overall ontology (or at least of very signi cant ence tools for nding signi cant correlations in the texts.
chunks of it), the re nement phase is about ne tuning the The analysis tools are independent of particular assumptions
target ontology and the support of its evolving nature. The about the nature of the feature-set and work on the abstract
re nement phase may use data that comes from the con- level of feature elements represented as SGML items.
crete Semantic Web application, e.g. log les of user queries Much work in a number of disciplines | computational
or generic user data. Adapting and re ning the ontology linguistics, information retrieval, machine learning, databases,
with respect to user requirements plays a major role for the software engineering | has actually researched and prac-
acceptance of the application and its further development. ticed techniques for solving part of the overall problem.
In principle, the same algorithms may be used for extrac- Hence, techniques and methods relevant for ontology learn-
tion as for re nement. However, during re nement one must ing may be found under terms like the acquisition of selec-
consider in detail the existing ontology and the existing con- tional restrictions (cf. Resnik [27] and Basili et al. [2]), word
nections into the ontology, while extraction works more often sense disambiguation and learning of word senses (cf. Hast-
than not practically from scratch. ings [34]), the computation of concept lattices from formal
A prototypical approach for re nement (though not for contexts (cf. Ganter & Wille [8]) and Reverse Engineering
extraction!) has been presented by Hahn & Schnattinger in software engineering (cf. Mueller et al. [23]).
(cf. reference [11] in survey, Table 1). They have introduced Ontology Learning puts a number of research activities,
a methodology for automating the maintenance of domain- which focus on di erent types of inputs, but share their tar-
Figure 4: Result Presentation in Text-To-Onto
get of a common domain conceptualization, into one per- Acknowledgements.. We thank our students, Dirk Wenke
spective. One may recognize that these activities are spread and Raphael Volz, for work at OntoEdit and Text-To-Onto.
between very many communities incurring references from Research for this paper was partially nanced by Ontoprise
20 completely di erent events / journals. GmbH, Karlsruhe, Germany, by US Air Force in the DARPA
DAML project \OntoAgents", by European Union in the
IST-1999-10132 project \On-To-Knowledge", and by Ger-
man BMBF in the project \GETESS" (01IN901C0).
10. CHALLENGES
Ontology Learning may add signi cant leverage to the 11. REFERENCES
Semantic Web, because it propels the construction of do- [1] H. Assadi. Construction of a regional ontology from
main ontologies, which are needed fastly and cheaply for text and its use within a documentary system. In
the Semantic Web to succeed. We have presented a compre- Proceedings of the International Conference on Formal
hensive framework for Ontology Learning that crosses the Ontology and Information Systems - FOIS'98, Trento,
boundaries of single disciplines, touching on a number of Italy, 1998.
challenges. Table 1 gives a survey of what types of tech-
niques should be included in a full- edged ontology learning [2] R. Basili, M. T. Pazienza, and P. Velardi. Acquisition
and engineering environment. The good news however is of selectional patterns in a sublanguage. Machine
that one does not need perfect or optimal support for co- Translation, 8(1):175{201, 1993.
operative modeling of ontologies. At least according to our [3] Paul Buitelaar. CoreLex: Systematic Polysemy and
experience \cheap" methods in an integrated environment Underspeci cation. PhD thesis, Brandeis University,
may yield tremendous help for the ontology engineer. Department of Computer Science, 1998.
While a number of problems remain with the single disci- [4] A. Doan, P. Domingos, and A. Levy. Learning Source
plines, some more challenges come up regarding the partic- Descriptions for Data Integration. In Proceedings of
ular problem of Ontology Learning for the Semantic Web. the International Workshop on The Web and
First, with the XML-based namespace mechanisms the no- Databases (WebDB-2000), 2000.
tion of an ontology with well-de ned boundaries, e.g. only [5] F. Esposito, S. Ferilli, N. Fanizzi, and G. Semeraro.
de nitions that are in one le, will disappear. Rather, the Learning from parsed sentences with inthelex. In
Semantic Web may yield an \amoeba-like" structure regard- Proceedings of Learning Language in Logic Workshop
ing ontology boundaries, because ontologies refer to each (LLL-2000), Lisbon, Portugal, 2000, 2000.
other and import each other (cf. e.g. the DAML-ONT prim- [6] D. Faure and C. Nedellec. A corpus-based conceptual
itive import). However, it is not yet clear how the semantics clustering method for verb frames and ontology
of these structures will look like. In light of these facts the acquisition. In LREC workshop on adapting lexical
importance of methods like ontology pruning and crawling of and corpus resources to sublanguages and applications,
ontologies will drastically increase still. Second, we have so Granada, Spain, 1998.
far restricted our attention in ontology learning to the con- [7] B. Gaines and M. Shaw. Integrated knowledge
ceptual structures that are (almost) contained in RDF(S) acquisition architectures. Journal of Intelligent
proper. Additional semantic layers on top of RDF (e.g. fu- Information Systems, 1(1), 1992.
ture OIL or DAML-ONT with axioms, A) will require new [8] B. Ganter and R. Wille. Formal Concept Analysis:
means for improved ontology engineering with axioms, too! Mathematical Foundations. Springer, Berlin -
Table 1: Classi cation of Ontology Learning Approaches
Domain Method Features used Prime purpose Papers
Free Text Clustering Syntax Extract Buitelaar [3], Assadi [1] and Faure &
Nedellec [6]
Inductive Logic Syntax, Logic rep- Extract Esposito et al. [5]
Programming resentation
Association rules Syntax, Tokens Extract Maedche & Staab [17]
Frequency-based Syntax Prune Kietz et al. [15]
Pattern-Matching Extract Morin [22]
Classi cation Syntax, Semantics Re ne Schnattinger & Hahn [11]
Dictionary Information extrac- Syntax Extract Hearst [12], Wilks [35] and Kietz et
tion al. [15]
Page rank Tokens Jannink & Wiederhold [13]
Knowledge Concept Induction, Relations Extract Kietz & Morik [16] and Schlobach
base A-Box mining [28]
Semi- Naive Bayes Relations Reverse engineering Doan et al. [4]
structured
schemata
Relational Data Correlation Relations Reverse engineering Johannesson [14] and Tari et al. [32]
schemata
Heidelberg - New York, 1999. pages 372{379, 1997.
[9] E. Grosso, H. Eriksson, R. Fergerson, S. Tu, and [19] G. Miller. Wordnet: A lexical database for English.
M. Musen. Knowledge modeling at the millennium | CACM, 38(11):39{41, 1995.
the design and evolution of protege-2000. In [20] K. Morik. Balanced cooperative modeling. Machine
Proceedings of KAW-99, Ban , Canada, 1999. Learning, 11:217{235, 1993.
[10] U. Hahn and M. Romacker. Content management in [21] K. Morik, S. Wrobel, J.-U. Kietz, and W. Emde.
the syndikate system | how technical documents are Knowledge acquisition and machine learning: Theory,
automatically transformed to text knowledge bases. methods, and applications. Academic Press, London,
Data & Knowledge Engineering, 35:137{159, 2000. 1993.
[11] U. Hahn and K. Schnattinger. Towards text [22] E. Morin. Automatic acquisition of semantic relations
knowledge engineering. In Proc. of AAAI '98, pages between terms from technical corpora. In Proc. of the
129{144, 1998. Fifth International Congress on Terminology and
[12] M.A. Hearst. Automatic acquisition of hyponyms from Knowledge Engineering - TKE'99, 1999.
large text corpora. In Proceedings of the 14th [23] H. A. Mueller, J. H. Jahnke, D. B. Smith, M.-A.
International Conference on Computational Storey, S. R. Tilley, and K. Wong. Reverse
Linguistics. Nantes, France, 1992. Engineering: A Roadmap. In Proceedings of the 22nd
[13] J. Jannink and G. Wiederhold. Thesaurus entry International Conference on Software Engineering
extraction from an on-line dictionary. In Proceedings (ICSE-2000), Limerick, Ireland. Springer, 2000.
of Fusion '99, Sunnyvale CA, July 1999, 1999. [24] G. Neumann, R. Backofen, J. Baur, M. Becker, and
http://www-db.stanford.edu/SKC/publications.html. C. Braun. An information extraction core system for
[14] P. Johannesson. A method for transforming relational real world german text processing. In ANLP'97 |
schemas into conceptual schemas. In M. Rusinkiewicz, Proceedings of the Conference on Applied Natural
editor, 10th International Conference on Data Language Processing, pages 208{215, Washington,
Engineering, pages 115 { 122, Houston, 1994. IEEE USA, 1997.
Press. [25] N. Fridman Noy and M. A. Musen. PROMPT:
[15] J.-U. Kietz, A. Maedche, and R. Volz. Semi-automatic Algorithm and Tool for Automated Ontology Merging
ontology acquisition from a corporate intranet. In and Alignment. In Proceedings of the 17th National
International Conference on Grammar Inference Conf. on Arti cial Intelligence (AAAI'2000), Austin,
(ICGI-2000), to appear: Lecture Notes in Arti cial Texas. MIT Press/AAAI Press, 2000.
Intelligence, LNAI, 2000. [26] B. Peterson, W.A. Andersen, and J. Engel. Knowledge
[16] J.-U. Kietz and K. Morik. A polynomial approach to bus: Generating application-focused databases from
the constructive induction of structural knowledge. large ontologies. In Proc of KRDB 1998, Seattle,
Machine Learning, 14(2):193{218, 1994. Washington, USA, pages 2.1{2.10, 1998.
[17] A. Maedche and S. Staab. Discovering conceptual [27] P. Resnik. Selection and Information: A Class-based
relations from text. In Proceedings of ECAI-2000. IOS Approach to Lexical Relationships. PhD thesis,
Press, Amsterdam, 2000. University of Pennsylania, 1993.
[18] A. Mikheev and S. Finch. A workbench for nding [28] S. Schlobach. Assertional mining in description logics.
structure in text. In In Proceedings of the 5th In Proceedings of the 2000 International Workshop on
Conference on Applied Natural Language Processing Description Logics (DL2000), 2000.
| ANLP'97, March 1997, Washington DC, USA, http://SunSITE.Informatik.RWTH-
Aachen.DE/Publications/CEUR-WS/Vol-33/.
[29] S. Staab, J. Angele, S. Decker, M. Erdmann,
A. Hotho, A. Maedche, H.-P. Schnurr, R. Studer, and
Y. Sure. Semantic community web portals. Proc. of
WWW9 / Computer Networks, 33(1-6):473{491, 2000.
[30] S. Staab and A. Maedche. Knowledge portals |
ontologies at work. AI Magazine, 21(2), Summer 2001.
[31] S. Staab, H.-P. Schnurr, R. Studer, and Y. Sure.
Knowledge processes and ontologies. IEEE Intelligent
Systems, 16(1), 2001.
[32] Z. Tari, O. Bukhres, J. Stokes, and S. Hammoudi. The
Reengineering of Relational Databases based on Key
and Data Correlations. In Proceedings of the 7th
Conference on Database Semantics (DS-7), 7-10
October 1997, Leysin, Switzerland. Chapman & Hall,
1998.
[33] G. Webb, J. Wells, and Z. Zheng. An experimental
evaluation of integrating machine learning with
knowledge acquisition. Machine Learning, 35(1):5{23,
1999.
[34] P. Wiemer-Hastings, A. Graesser, and
K. Wiemer-Hastings. Inferring the meaning of verbs
from context. In Proceedings of the Twentieth Annual
Conference of the Cognitive Science Society, 1998.
[35] Y. Wilks, B. Slator, and L. Guthrie. Electric Words:
Dictionaries, Computers, and Meanings. MIT Press,
Cambridge, MA, 1996.