-

Learning Ontologies for the Semantic Web

Alexander Maedche

ama@aifb.uni-karlsruhe.de 0

Steffen Staab

sst@aifb.uni-karlsruhe.de 0 0 Institute AIFB, University of Karlsruhe , 76128 Karlsruhe, Germany, Ontoprise GmbH, Haid-und-Neu Strasse 7, 76131 Karlsruhe , Germany

1999

1. ONTOLOGIES FOR THE SEMANTIC WEB

tures for more conventional knowledge acquisition [7]. the fully automatic acquisition of knowledge by machines proved extremely benecial for the knowledge acquisition In contrast, in the Web environment that we encounter combines knowledge acquisition with machine learning, feedThis objective in mind, we have built an architecture that Our notion of Ontology Learning aims at the integration of ing on the resources that we nowadays nd on the syn tactic chine learning techniques [33]. The drawback of these aptask was the integration of knowledge acquisition with maelaborated on methodologies for knowledge acquisition or when building Web ontologies, the structured knowledge or workbenches for dening kno wledge bases. A method that meaning than the | very seminal | integration architec[20] for the construction of ontologies for the Semantic Web. ontology learning as semi-automatic with human intervention of ontologies, in particular machine learning. Because gineers had dealt with over the last two decades when they a multitude of disciplines in order to facilitate the construcWeb, viz. free text, semi-structured text, schema denitions from which they induced their rules. proaches, e.g. the work described in [21], however, was their tion, adopting the paradigm of balanced cooperative modeling data base is rather the exception than the norm. Hence, inthat we ended up with were similar to what knowledge entelligent means for an ontology engineer takes on a dieren t In fact, these problems on time, diÆculty and condence remains in the distant future, we consider the process of rather strong focus on structured knowledge or data bases, uation giving the ontology engineer a wealth of coordinated hensive and transportable machine understanding. Therethat structure underlying data for the purpose of compreontology import, extraction, pruning, renemen t, and evaltured, semi-structured and fully structured data in order to fore, the success of the Semantic Web depends strongly on plary techniques in the ontology learning cycle that we have tionaries, or from legacy ontologies, and refer to some others reverse engineering of ontologies from database schemata or mentary disciplines that feed on dieren t types of unstructhe proliferation of ontologies, which requires fast and easy learning from XML documents. work and architecture, we show in this paper some exemthat need to complement the complete architecture, such as Ontology Learning greatly facilitates the construction of quisition bottleneck. ontologies by the ontology engineer. The vision of ontology tools for ontology modeling. Besides of the general frameThe Semantic Web relies heavily on the formal ontologies process. Our ontology learning framework proceeds through engineering of ontologies and avoidance of a knowledge acimplemented in our ontology learning environment, Textlearning that we propose here includes a number of complesupport a semi-automatic, cooperative ontology engineering To-Onto, such as ontology learning from free text, from dicoped our ontology engineering workbench, OntoEdit, we had Web. tologies still remains a tedious, cumbersome task resulting Though ontology engineering tools have become mature cheap and fast construction of domain-specic on tologies is over the last decade (cf. [9]), the manual acquisition of oneasily in a knowledge acquisition bottleneck. Having develtions like to face exactly this issue, in particular we were given quescrucial for the success and the proliferation of the Semantic Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.

Semantic Web Workshop 2001 Hongkong, China

Copyright by the authors. ogy extraction phase major parts of the target ontology (DTDs), etc. Thereby, modules in our framework serve difto its prime purpose. Fourth, ontology renemen t prots merging existing structures or dening mapping rules bescope. the following v e steps (cf. Figure 1): needs to be pruned in order to better adjust the ontology For instance, [26] describe how ontological structures conare modeled with learning support feeding from web docferent steps in the engineering cycle, which here consists of the prime target application serves as a measure for valiagain in this cycle, e.g. for including new domains into the uments. Third, this rough outline of the target ontology First, existing ontologies are imported and reused by tained in Cyc are used in order to facilitate the construcat a ne gran ularity (also in contrast to extraction). Fifth, tion of a domain-specic ontology. Second, in the ontolfrom the given domain ontology, but completes the ontology dating the resulting ontology [31]. Finally, one may revolve tween existing structures and the ontology to be established. constructed ontology or for maintaining and updating its

Semi-structured data may nally require and approaches.3 dieren t techniques: Structured data and meta data require In the following we elaborate on our ontology learning framework. Thereby we approach dieren t techniques for prot from both.

Text-To-Onto environment. with this wealth. Hence, there comes the need for a range of dieren t types of data, showing parts of our architecture, its as corresponding references may be found in Section 9. reverse engineering approaches, free text may contribute to ontology learning directly or through information extraction A general overview of ontology learning techniques as well current status, and parts that may complement our current as our own F-Logic based extensions of RDF(S). In addition, primitives comprise: cated graphical means for manual modeling and rening the between the ontology engineering tool and the input (ofontology engineer in importing existing ontology primitives, plication debugging can be generated and then accessed via ing the epistemological level rather than a particular representation languages, such as OIL and DAML-ONT, as well performed by the ontology engineer. Here, we oer sophistinected with OntoEdit. portal, we found that there was this large conceptual bridge ultimately determined the target ontology. Into this void we terface to support the ontology engineering process manually schemata, databases on the Web, and Web ontologies, which toEdit. However, given the task of constructing a knowledge have positioned new components of our ontology learning arten legacy data), such as Web documents, Web document As core to our approach we have built a graphical user inresentation language. However, the ontological structures The sophisticated ontology engineering tools we knew, e.g. chitecture (cf. Figure 2). The new components support the extracting new ones, pruning given ones, or rening with additional ontology primitives. In our case, the ontology our F-Logic inference engine, that is directly con- SilRi4, executable representations for constraint checking and aptems [9], would oer capabilities roughly comparable to Onthe Protege modeling environment for knowledge-based sysnal on tology. Dieren t views are oered to the user targetbuilt there may be exported to standard Semantic Web reptured provisioning and accessing of data [29, 30]. Knowledge ten unknown, terrain. For instance, a knowledge portal tions (DTDs), and free texts. Still worse, signican t parts of possibly unforeseen a particular target appli- applications1, structed from database schemata, a given product thesaurus (like BMEcat), XML documents and document type denitremendous eorts for engineering the conceptual structures ative shopping in conjunction with manuals, reports and edge, establishing means for providing new knowledge and portals that structure Web content and that allow for structhat underly existing warehouse databases, product cataful input for the construction of the ontology. However, in portals are information intermediaries for knowledge accessedge portal consists of the tasks of structuring the knowlthe portal lies in integrating legacy information as well as cation remains the touchstone for a given ontology. In our practice one needs comprehensive in order to deal support2 sions. Correspondingly, ontology structures must be conThough ontologies and their underlying data in the Seopinions about current electronic products. The creation of in constructing and maintaining the ontology in vast, ofA considerable part of development and maintenance of accessing the knowledge contained in the portal. ing and sharing on the Web. The development of a knowlthese (meta-)data change extremely fast and, hence, require mantic Web are envisioned to be reusable for a wide range of may focus on the electronics sector, integrating comparThus, very dieren t types of (meta-)data might be usea regular update of the corresponding ontology parts. logues, user manuals, test reports and newsgroup discusthe background ontology for this knowledge portal involves case, we have been dealing with ontology-based knowledge 3. AN ARCHITECTURE FOR ONTOLOGY

LEARNING a taxonomy of concepts with multiple inheritance (heterarchy) HC; 4. COMPONENTS FOR LEARNING

ONTOLOGIES links between these entities. An existing ontology denition number of sets of concepts, relations, lexical entries, and vary from one type of input to the next, there is also conprocessed input data. While specic algorithms ma y greatly As described above an ontology may be described by a siderable overlap concerning underlying learning approaches various algorithms working on this denition and the pre(including L; C; R; A; F; G) may be acquired using HC; HR; 4.1 Management component 4.2 Resource processing component 4.3 Algorithm Library Lexicon 1

...

Lexicon n

Ontology

Engineer 5. IMPORT & REUSE 6.1 Lexical Entry & Concept Extraction with respect to user requirements plays a major role for the crete Semantic Web application, e.g. log les of user queries tion as for renemen t. However, during renemen t one must (cf. reference [11] in survey, Table 1). They have introduced In principle, the same algorithms may be used for extracor generic user data. Adapting and rening the ontology nections into the ontology, while extraction works more often renemen t phase may use data that comes from the conconsider in detail the existing ontology and the existing contarget ontology and the support of its evolving nature. The than not practically from scratch. eling of the overall ontology (or at least of very signican t tion. While extracting serves mostly for cooperative modexists rather on a sliding scale than by a clear-cut distincRening plays a similar role as extracting. Their dierence chunks of it), the renemen t phase is about ne tuning the A prototypical approach for renemen t (though not for acceptance of the application and its further development. extraction!) has been presented by Hahn & Schnattinger a methodology for automating the maintenance of domainextraction of ontologies considerably pull the lever of the son et. al. [26] have described strategies that leave the user There are at least two dimensions to look at the problem of pruning. First, one needs to clarify how the pruning scale into the imbalance where out-of-focus concepts reign. concept or a relation) aects the rest. For instance, Peterfor the domain model on the one hand appears to be pracof its focus. The import & reuse of ontologies as well as the application data. Given a set of application-specic docuof particular parts of the ontology (e.g., the removal of a Second, one may consider strategies for proposing ontology ments there are several strategies for pruning the ontology. balance between completeness and scarcity of the domain iting with regard to expressiveness. Hence, what we strive targeting the scarcest model on the other hand is overly limTherefore, we pursue the appropriate diminishing of the onitems that should be either kept or pruned. We have investically inmanagable and computationally intractable, and of terms (cf. reference [15] in survey, Table 1).

We aim at a model that captures a rich conceptualization model. It is a widely held belief that targeting completeness tology in the pruning phase.

They are based on absolute or relative counts of frequency A common theme of modeling in various disciplines is the of the target domain, but that excludes parts that are out tigated several mechanisms for generating proposals from with a coherent ontology (i.e. no dangling or broken links). for is the balance between these two, which is really working.

7. PRUNING THE ONTOLOGY 9. RELATED WORK plore and determine the right aggregation level of adding a the ontology engineer as locatedIn, viz. events are located properties, such as subPropertyOf(hasDoubleRoom,hasRoom) in an area (thus extending L and F ). The user may add the relation to the ontology, the user may browse the hierarchy extracted relations to the ontology by drag-and-drop. To exin dening appropriate subPropertyOf relations between (thereby extending HR). view on extracted properties as given in the left part of Figure 4. This view may also support the ontology engineer

8. REFINING THE ONTOLOGY that one does not need perfect or optimal support for coimportance of methods like ontology pruning and crawling of itive import). However, it is not yet clear how the semantics ture OIL or DAML-ONT with axioms, A) will require new Semantic Web, because it propels the construction of dooperative modeling of ontologies. At least according to our First, with the XML-based namespace mechanisms the noexperience \cheap" methods in an integrated environment far restricted our attention in ontology learning to the contion of an ontology with well-dened boundaries, e.g. only may yield tremendous help for the ontology engineer.

While a number of problems remain with the single discithe Semantic Web to succeed. We have presented a comprehensive framework for Ontology Learning that crosses the Semantic Web may yield an \amoeba-like" structure regarddenitions that are in one le, will disappear. Rather, the other and import each other (cf. e.g. the DAML-ONT primmain ontologies, which are needed fastly and cheaply for of these structures will look like. In light of these facts the plines, some more challenges come up regarding the particceptual structures that are (almost) contained in RDF(S) Ontology Learning may add signican t leverage to the boundaries of single disciplines, touching on a number of ular problem of Ontology Learning for the Semantic Web. proper. Additional semantic layers on top of RDF (e.g. fuchallenges. Table 1 gives a survey of what types of techontologies will drastically increase still. Second, we have so ing ontology boundaries, because ontologies refer to each means for improved ontology engineering with axioms, too! and engineering environment. The good news however is niques should be included in a full-edged on tology learning

Information Systems, 1(1), 1992. [1] H. Assadi. Construction of a regional ontology from Proceedings of Learning Language in Logic Workshop Italy, 1998. clustering method for verb frames and ontology (LLL-2000), Lisbon, Portugal, 2000, 2000. text and its use within a documentary system. In [3] Paul Buitelaar. CoreLex: Systematic Polysemy and Learning from parsed sentences with inthelex. In [4] A. Doan, P. Domingos, and A. Levy. Learning Source and corpus resources to sublanguages and applications, Underspecic ation. PhD thesis, Brandeis University, Descriptions for Data Integration. In Proceedings of Proceedings of the International Conference on Formal acquisition architectures. Journal of Intelligent Translation, 8(1):175{201, 1993. of selectional patterns in a sublanguage. Machine acquisition. In LREC workshop on adapting lexical Ontology and Information Systems - FOIS’98, Trento, the International Workshop on The Web and Granada, Spain, 1998.

Mathematical Foundations. Springer, Berlin [6] D. Faure and C. Nedellec. A corpus-based conceptual [7] B. Gaines and M. Shaw. Integrated knowledge [5] F. Esposito, S. Ferilli, N. Fanizzi, and G. Semeraro.

Databases (WebDB-2000), 2000. [2] R. Basili, M. T. Pazienza, and P. Velardi. Acquisition Department of Computer Science, 1998. [8] B. Ganter and R. Wille. Formal Concept Analysis: 10. CHALLENGES 11. REFERENCES schemas into conceptual schemas. In M. Rusinkiewicz, 129{144, 1998. ontology acquisition from a corporate intranet. In [18] A. Mikheev and S. Finch. A workbench for nding extraction from an on-line dictionary. In Proceedings [10] U. Hahn and M. Romacker. Content management in of Fusion ’99, Sunnyvale CA, July 1999, 1999.

Proceedings of KAW-99, Ban, Canada , 1999. the design and evolution of protege-2000. In Engineering, pages 115 { 122, Houston, 1994. IEEE [16] J.-U. Kietz and K. Morik. A polynomial approach to M. Musen. Knowledge modeling at the millennium | [17] A. Maedche and S. Staab. Discovering conceptual [11] U. Hahn and K. Schnattinger. Towards text Intelligence, LNAI, 2000.

Press.

Machine Learning, 14(2):193{218, 1994. structure in text. In In Proceedings of the 5th International Conference on Computational http://www-db.stanford.edu/SKC/publications.html. | ANLP’97, March 1997, Washington DC, USA, (ICGI-2000), to appear: Lecture Notes in Articial [12] M.A. Hearst. Automatic acquisition of hyponyms from editor, 10th International Conference on Data large text corpora. In Proceedings of the 14th [14] P. Johannesson. A method for transforming relational Heidelberg - New York, 1999.

Conference on Applied Natural Language Processing International Conference on Grammar Inference knowledge engineering. In Proc. of AAAI ’98, pages Linguistics. Nantes, France, 1992. [13] J. Jannink and G. Wiederhold. Thesaurus entry [15] J.-U. Kietz, A. Maedche, and R. Volz. Semi-automatic relations from text. In Proceedings of ECAI-2000. IOS the constructive induction of structural knowledge.

Data & Knowledge Engineering, 35:137{159, 2000. automatically transformed to text knowledge bases. the syndikate system | how technical documents are [9] E. Grosso, H. Eriksson, R. Fergerson, S. Tu, and Press, Amsterdam, 2000. Features used Prime purpose Papers Syntax Extract Buitelaar [3], Assadi [1] and Faure & Nedellec [6] Esposito et al. [5] Table 1: Classication of On tology Learning Approaches base SemiKnowledge schemata Relational schemata structured Dictionary Data Correlation Relations Reverse engineering Johannesson [14] and Tari et al. [32] Extract Relations Reverse engineering Logic Extract Washington, USA, pages 2.1{2.10, 1998.

Proceedings of the Conference on Applied Natural Approach to Lexical Relationships. PhD thesis, Algorithm and Tool for Automated Ontology Merging Texas. MIT Press/AAAI Press, 2000.

Fifth International Congress on Terminology and [25] N. Fridman Noy and M. A. Musen. PROMPT: methods, and applications. Academic Press, London, [24] G. Neumann, R. Backofen, J. Baur, M. Becker, and CACM, 38(11):39{41, 1995. [28] S. Schlobach. Assertional mining in description logics.

Storey, S. R. Tilley, and K. Wong. Reverse and Alignment. In Proceedings of the 17th National http://SunSITE.Informatik.RWTH[20] K. Morik. Balanced cooperative modeling. Machine [27] P. Resnik. Selection and Information: A Class-based USA, 1997.

Engineering: A Roadmap. In Proceedings of the 22nd University of Pennsylania, 1993. pages 372{379, 1997.

Knowledge acquisition and machine learning: Theory, real world german text processing. In ANLP’97 | (ICSE-2000), Limerick, Ireland. Springer, 2000. [19] G. Miller. Wordnet: A lexical database for English. between terms from technical corpora. In Proc. of the [22] E. Morin. Automatic acquisition of semantic relations Conf. on Articial Intelligenc e (AAAI’2000), Austin, International Conference on Software Engineering [23] H. A. Mueller, J. H. Jahnke, D. B. Smith, M.-A. [21] K. Morik, S. Wrobel, J.-U. Kietz, and W. Emde.

Language Processing, pages 208{215, Washington, In Proceedings of the 2000 International Workshop on 1993. [26] B. Peterson, W.A. Andersen, and J. Engel. Knowledge C. Braun. An information extraction core system for Learning, 11:217{235, 1993. large ontologies. In Proc of KRDB 1998, Seattle, Knowledge Engineering - TKE’99, 1999.

Description Logics (DL2000), 2000. bus: Generating application-focused databases from Programming Inductive Concept Induction, Relations A-Box mining Association rules Pattern-Matching tion Naive Bayes Classication Syntax, Semantics Page rank Tokens Information extrac- Syntax Frequency-based Doan et al. [4] Maedche & Staab [17] Kietz et al. [15] [28] al. [15] Kietz & Morik [16] and Schlobach Jannink & Wiederhold [13] Hearst [12], Wilks [35] and Kietz et Morin [22] Schnattinger & Hahn [11]