Acquisition of Ontological Knowledge from Canonical Documents Raphael Malyankar Dept. of Computer Science and Engineering Arizona State University Tempe, AZ 85287, USA. E-mail: rmm@acm.org (Position Paper) Abstract 2.1 Standards Documents The most recent normative standard for digital nautical chart This paper describes experiences with quasi- content is the S-57 Standard for [International Hydrographic automated creation of a computational ontology for Organization, 1996]. The ‘object catalog’ section of this doc- maritime information from a mixed collection of ument consists of a list of chart entities, definitions, and entity source material. Based on these experiences, hy- attributes, which gives us a collection (sic) of domain entities potheses and conclusions concerning the creation that can be considered canonical as far as the scope of the of computational ontologies for engineering and standard goes. Extraction from this ‘object catalog’ was au- other technical or scientific domains are presented. tomated by using graph traversal programs that exploit links Heuristics for resolving anomalies in ontologies between entities and attributes in the object catalog. The au- generated from mixed sources are also described. tomated extraction resulted in 173 classes and 186 slots. A comparison of 10% (selected at random) of the extracted in- formation with the original source indicated error rates of 8% 1 Introduction to 20% (for different categories of ontological knowledge - This paper describes our experiences with ontology acquisi- classes/types/attributes). The additional effort needed to re- tion in the context of maritime information. Ontological in- duce this error in the automated portion of the extraction was formation is acquired from multiple types of sources, includ- not undertaken, as it proved no very laborious task to make ing standards documents, database schemas, lexicons, collec- the corrections by hand (about 10 hours for a non-expert who tions of symbology definitions, and also by inference from compared the extracted ontology with the original source). semi-structured documents. This is followed by a descrip- A second source was the Spatial Data Transfer Standard tion of the computational approach to rationalization, align- [FGDC, 1998]. The parts we used were the sections that list ment, and merging of the ontological information derived ‘included terms’ (analogous to a synonym list) and attribute from these sources. The computational ontology thus cre- definitions. Extraction from this was less satisfactory in some ated is intended to be used in creating a Maritime Informa- ways, since these sections are less rigorous than the object tion Markup Language (MIML) for tagging of documents in catalog of the S-57 standard, but, on the other hand, the syn- this domain. An example of the kind of application that will onym list covers more of the terms used in practice. be enabled is a question-answering system that extracts only While the S-57 standard is normative, there are two defi- necessary and relevant information from marked-up text doc- ciencies involved in using it: uments. 1. It is limited in scope. This standard covers only objects The observations and heuristics described in this paper ap- (entities) that are used in digital nautical charts. Impor- ply to domains - here, maritime information - where onto- tant concepts such are weather conditions are not men- logical knowledge must be acquired from different types of tioned at all, and other concepts such as tides are men- source material. It appears that in some domains, the sub- tioned only incidentally or in an implicit manner, for ex- ontologies thus generated are likely to different not only lin- ample in defining entity classes and as attribute qualifiers guistically, but also in their topological profiles (i.e., depth for entities (e.g., foreshore areas, the part of the shore and other structure). The heuristics described in this paper covered and uncovered by tides). are designed for a computational approach to combining such sub-ontologies. 2. It uses a restricted terminology, i.e., usually only one of multiple synonymous terms. The ‘missing’ terms are sometimes used in other documents and it is necessary to 2 Sources of Ontological Knowledge establish synonym relationships to facilitate understand- The sources used for ontological knowledge were selected ing. from a canonical set, that is, thery are documents accepted Further semantic structure is induced from lexical clues within the domain as normative and that are widely used. and attribute sets. The heuristics used for this induction pro- cess currently consist of lexical clues from the linguistic sim- or more of two-column text in 10-point type. Included are ilarity of entity names and entity definitions, and comparison photographs, diagrams, and small maps. The flow of text fol- of attribute sets to compute measures of the semantic distance lows the coastline geographically, e.g., from north to south. between attributes. For example, there are multiple ”beacon” This is a ‘lightly structured’ document, with each volume objects in the object catalog (”cardinal” beacon, ”danger” containing a preliminary chapter containing navigation reg- beacon, etc.). Lexical comparison of the object names for ulations (which includes a compendium of rules and regula- these several classes, and of the descriptions associated with tions, specifications of environmentally protected zones, re- these classes (also scraped from the abovementioned object stricted areas, etc.), followed by chapters dealing with succes- catalog) indicated the possibility of a ‘beacon’ class as a su- sive sectors of the coast. Each chapter is further divided into perclass for these several classes. This is further described in sections (still in geographical order); each section is further Section 4. divided into sub-sections and paragraphs describing special hazards, recognizable landmarks, facilities, etc. The internal 2.2 Databases and Schemas structure of subsections and paragraphs provides taxonomical The primary database we have used so far is the sample Dig- hints, indicating, for example, which leaf entities are catego- ital Nautical Chart (DNC) data files available from NIMA. rizable as sub-classes of weather conditions, as well as pro- It has somewhat more semantic structure than the aforemen- viding a small amount of additional taxonomical information tioned standards, consisting as it does of feature classifica- that extends taxonomies derived from other classes (e.g., tide tions organized by ‘layers’, for example, environmental fea- races as a form of navigational hazard). The Coast Pilot is tures, cultural features, land cover features, etc. (‘Feature’, normative (in the sense of using well-understood terms) and as used in the domain, is equivalent to ‘class’). Induction comprehensive. A version marked up with XML would have of ontological knowledge from this consisted of mapping the proved invaluable for ontology learning, but there is no such structure to a class hierarchy. This mapping was also done version available at this time. automatically from the schema for the database. It resulted in 134 classes of which 118 are feature classes, 12 are coverage 2.5 Other Sources classes, and 4 are geographic structure type (point, line, area, Online content proved a useful and irreplaceable source of or text) classes. some information, especially attributes relating to weather As with the S-57 standard, this database and schema cov- data. Entry of this part was entirely manual. Other sources ers only chart entities, and the terminology is even more re- to be used include the Ports list and Light list, for informa- stricted (and to some extent, more opaque) than the S-57 stan- tion on port facilities and navigation aids respectively. dard, due to the use of abbreviated names for entities and at- tributes, and the lack of textual definitions. 3 Alignment, Merging, and Rationalization 2.3 Lexicons and Symbology Definitions We have discovered that though there is a certain amount of A separate effort used Protege [Grosso et al., 1999] and a duplication between the above sources, they are largely inde- standard collection of symbology definitions from NOAA’s pendent and produce different parts of the taxonomy for the Chart No. 1 [National Oceanic and Atmospheric Administra- maritime information domain as a whole, and sometimes dif- tion, 1997] to create an ontology of navigation aids, hazards, ferent taxonomical structures for some parts of the domain. and other entities. Chart No. 1 is a collection of symbology The need to merge and align the ontologies generated from for nautical charts accompanied by brief definitions of what the sources mentioned naturally arises, along with the need the symbol stands for. It is organized semantically (in that to reconcile conflicts between different ontologies. This sec- related symbols are in the same section or subsection). This tion describes the major issues arising in combining different was supplemented with a widely popular publication on nav- ontologies, and the techniques adopted to resolve them. In igation and seamanship (Chapman Piloting [Maloney, 1999]) addition, we are using some of these heuristics to rationalize and an online dictionary of chart terms (discovered and used individual ontologies by detecting anomalies in their struc- by the creator, a student unfamiliar with nautical terms). On- ture. tology creation based on these documents consisted of man- ual entry of information using Protege, due to the lack of elec- tronic versions of the symbology definitions. Approximately 3.1 Alignment and Merging 500 classes and 100 slots resulted from this effort, which was There are at least two distinct taxonomic hierarchies in our carried out by non-experts using the publications mentioned. source material: (i) a classification into point, area, or line (The paucity of slots is due to the nature of the documents, features, and (ii) a different, natural, semantic hierarchy (nat- which contain little mention of details corresponding to sym- ural in the sense that it is the categorization that a human bols). tends to create). Item (i) is attributable to the original pur- pose of the standards document that produced such a taxon- 2.4 Semi-Structured Normative Material omy — it was intended for geographical information systems The United States Coast Pilot is a 9-volume series containing and therefore its point of view is that of a computer graphics information that is important to navigators of US coastal wa- system instead of a knowledge-based system. Alignment of ters (including the Great Lakes) but which cannot be included the ‘sub-ontologies’ consists of assembling a jigsaw puzzle in in a nautical chart. Each volume consists of 200 to 300 pages the sense of [Noy and Musen, 2000]. Navigation Aid Navigation Aid Beacon ..... Cardinal Isolated Lateral ..... Beacon Beacon Beacon Cardinal Isolated Lateral Beacon Beacon Beacon Figure 1: Merging Similar Classes 3.2 Resolution of Structure Mismatches computational recommender. The current set of heuristics, Another issue is structure mismatch, leading to what can be and the recommendations indicated by them, is described be- called the reification question — should a concept distin- low: guishing two entities be made manifest through distinct val- ues for a slot, or should the distinction be manifest as a type Rule 1: Classes whose names are linguistically synony- within the class (thus giving distinct sub-classes). We have mous are suggested as candidates for merging. Distance be- discovered that automated extraction from an object catalog tween classes is measured in terms of the use of synonyms or schema tends to produce shallow, bushy, class hierarchies within class names. For example, two different ontologies (i.e., it prefers translating distinctions into a range for an at- contain ‘Bridge’ classes (the same word is used in each). Fur- tribute slot), while manual creation tends to create deeper and ther, cognate terms are discovered by looking for meaningful less bushy type hierarchies. It appears that choosing between synonyms within the class name. Figure 1 shows an instance the two may be merely a question of convenience of utiliza- of such cognate names (the different kinds of beacons). A tion, but investigations into this issue continue. (This dif- merger recommendation is issued when this rule is triggered. ference may be a characteristic of the source of ontological knowledge — databases vs. other source material.) The im- Rule 2: Class pairs which have a high proportion of slot mediate issue raised by this is that ontology merging or as- names that are linguistically synonymous, and sufficiently sembly will need to resolve questions of whether to sub-class low differences in the rest of their slots, are nominated as a class from one partial ontology, or de-sub-class a corre- candidates for merger or alignment. As for Rule 1, distance sponding collection of classes in the other, and how to de- between slot names is measured in terms of the appearance of tect this problem, i.e., identify which slot can be used as a synonyms. sub-class type. Comparison of two classes C1 and C2 with slot sets SL1 3.3 Rationalization and SL2 respectively, returns a 3-tuple (C; D12 ; D21 ), where The term ‘rationalization’ is used here to mean removal of C is a numeric value representing the degree of commonality anomalies within a single ontology, such as slots with differ- of the slot sets and D12 and D21 are numeric values repre- ent names but playing the same role, multiple indistinguish- senting the respective difference sets between the individual able (or almost indistinguishable) sibling classes that are not slot sets SL1 and SL2 and the union set SL1 [ SL2 of all the specializations of their own distinguished abstract class, etc. slots for either class. For example, D12 can be computed as Some such situations are justified and necessary, but where the number of slots of C1 that are not synonyms of slots of C2 . ontologies are generated automatically, it appears that numer- This computation is similar to that described by Chalupsky ous such anomalies may creep in. [2000], but uses individual elements instead of an all-round measure computed by combining the 3 numeric values. 4 Computational Approach Rule 2 recommends merger/alignment if C > 0 and A computational method for solving the problems described D12 ; D21 < , where  is chosen to minimize spurious posi- earlier has been designed and partially implemented. The ap- tive recommendations. proach to combining the ontologies and resolving conflicts is reinforcement-based in that multiple heuristics are applied Rule 3: Conceptual relatedness for class pairs is computed to detect candidates for merging, renaming and other opera- by comparing the class names using a lexicon of ‘included tions. Instead of making suggestions to a user based on trig- terms’, derived from the SDTS [FGDC, 1998]. This means gering single rules, the set of recommendations obtained by that hypernym/hyponym relationships between terms within applying all applicable heuristics is presented to the user (as a class names are included, in contrast to Rule 1, which uses list of positive or negative recommendations for possible ac- synonyms. The reason is that the ‘included terms’ are ex- tions); the user is expected to decide based on the evidence pected to be likely to result in alignment operations instead presented and considerations that may not be known to the of merger operations. Seabed Seabed Sand Mud Rock SeaBedType: Sand Mud Rocky Figure 2: Structure mismatches Similarity comparison in our heuristics is keyword based, etc.), combining them into a single feature with the sea-floor in that it assumes (supported by human observation of type as a slot. the class and slot names) that names are of the gen- eral form fQualifyingTerm KeyTermg (or AdjectivalPhrase Two further rules are being implemented; these operate Noun). Greater importance is given to the KeyTerm in com- not on the ontologies themselves, but on the knowledge base, puting semantic closeness, since the QualifyingTerm portion methods used for accessing it, and its contents: generally appears to define a sub-type of an abstract class de- noted by KeyTerm. A consequent limitation is that special requirements on the internal structure of class and slot names Rule 8: Determine how often the instances of a class are must be imposed, and further, the heuristics produce spurious retrieved in isolation. If there are many requests for entities results in several cases. of a specific class, there may be implementation reasons for Partial synonyms (complex names with synonymous key retaining the class as a unique class. This rule, of course, can terms) are recommended as candidates for abstraction or be effectuated only after a study of actual use of the ontology. merger, e.g., by merging their superclasses. Rule 9: Determine the population of instances for each con- Rule 4: Concept similarity for class pairs is computed by crete class, and compare with those for its siblings or merger comparing the names of their slots, using the same lexicon as candidates. If the population size is large, or if there is signif- before. The resultant recommendation suggests mergers of icant skew in the population of merger candidates, there may be implementation reasons (e.g., if instances are ultimately classes. retrieved from a database) for retaining distinct classes. As with Rule 8, this heuristic can be investigated only after pop- Rule 5: Sibling classes without unique slots, i.e., those that ulating the underlying knowledge store (database, frames, have only inherited slots, are examined. The implied solu- etc.). tion is to merge the two into their parent class or introduce an intermediate class and add a type or equivalent slot to the Rules 8 and 9 are expected to produce contra-indications immediate super class. (But see rules 8 and 9 for possible when triggered, i.e., recommend against mergers or align- reasons not to accept the recommendations generated by this ment. rule.) Instead of applying rules individually and effecting their suggestions as detected, we use them to detect problems and Rule 6 : Subsumption relationships are detected by com- suggest changes; the changes actually effectuated are ex- parison of slot names as in Rule 2, but the implication and pected to be those suggested by multiple rules, i.e., those sup- conditions respectively that must be satisfied by C , D12 , and ported by multiple forms of evidence. D21 are now: C1 is a subclass of C2 if C > 0, D12 > 0, D21 = 0. 5 Implementation All but one of the ontologies extracted are currently in the Rule 7: This heuristic is intended to detect structure mis- format used by the Protégé tool. However, implementation matches of the type vs. subclass category described earlier. of the rules above is currently ’off-line’ as far as Protégé is Figure 2 shows an instance of such a mismatch, arising from concerned, that is, it is being done by a separate program capturing the same information from different sources. Sib- that uses a translation of the ontologies into a different for- lings Xa , Xb , : : : of class X are compared to allowed value mat. This was adopted due to the necessity of including the ranges for slot S of class Y ; if the allowed values for slot ontologies in a Web server back-end program for extraneous S match (that is, are linguistically close to) the names of sib- reasons (the question answering site mentioned earlier). Cur- lings Xa , Xb , : : :, a structure mismatch is indicated. This rule rently individual rules are applied to pairs of ontologies and is applicable when values are categorical variables. This rule suggestions (and contra-indications) printed for separate eval- detects the commonality between different classes, each cor- uation by a human user. Work on incorporating these rules responding to a sea floor characteristic type (sand, pebbles, into a Protégé plugin will commence shortly. 6 Related Work  The above will hold even for a domain that has expe- rienced significant organization and standardization ef- Noy and Musen [1999; 2000] describe an algorithm and tool forts. for merging ontologies in Protégé. Chalupsky [2000] de- scribes OntoMorph, a tool for translating symbolic knowl- A computational approach for resolving anomalies in onto- edge from one KR formalism to another, and describes on- logical knowledge that exhibits the characteristics mentioned tology alignment in [Chalupsky et al., 1997]. Hovy [1998] above was also presented, and investigations into its use and describes a procedure for ontology alignment and heuristics applicability are ongoing. for suggestions, including pattern matching on strings, hier- archy matching and data/form heuristics . Acknowledgments Ontology analysis and merging in Chimæra is described in The efforts of Helen Wu and Koi-Sang “Leo” Leong in en- [McGuiness et al., 2000]. Syntactic analysis of class and slot tering ontological information and scraping ontological in- names, taxonomic resolution, and semantic evaluation (for formation from on-line sources are gratefully acknowledged. example, slot/value type checking and domain-range mis- This work was partially supported by the National Sci- matches) are also discussed. ence Foundation under grant EIA-9983267, NOAA, and the All the current methods for ontology alignment and merg- U.S. Coast Guard. Any opinions, findings, and conclusions ing generally use linguistic methods of determining similarity or recommendations expressed in this material are those of for class and slot names, as is done in some of the heuristics the author(s) and do not necessarily reflect the views of these described in Section 4 in this paper. Our approach appears agencies. to differ from those described in the form and utilization of the results of comparisons, and apparently also in the use of multi-criterion indicators/contra-indicators for suggesting op- References erations as compared to computing a single score. Further, [Chalupsky et al., 1997] H. Chalupsky, E. Hovy, and an additional heuristic is used for concept (class) linking, by T. Russ. Progress on an automatic ontology align- comparing similarities between the member slots of classes. ment methodology, 1997. ksl-web.stanford.edu/onto- Structure mismatches are also mentioned by Chalupsky. Ac- std/hovy/index.htm. cess convenience and instance population-based heuristics [Chalupsky, 2000] H. Chalupsky. Ontomorph: a transla- (rules 8 and 9) have not been discussed in descriptions of on- tion system for symbolic knowledge. In A.G. Cohn, tology merging and alignment. F. Giunchiglia, and B. Selman, editors, Principles of Knowledge Representation and Reasoning: Proceedings 7 Conclusion of the Seventh International Conference (KR2000), San Francisco, CA. Morgan Kaufman, 2000. The source material described here constitutes in a sense a [FGDC, 1998] FGDC. Spatial data transfer standard. Fed- canon for the domain of maritime information, in that the collection is (except for the items in Section 2.5) normative eral Geographic Data Committee, U. S. Geological Sur- and comprehensive for the domain of maritime information. vey. Proposed standard, 1998. Based on our observations while deriving ontological knowl- [Grosso et al., 1999] W. E. Grosso, H. Eriksson, R. W. Ferg- edge from it, the following positions and hypotheses are put erson, J. H. Gennari, S. W. Tu, and M. A. Musen. Knowl- forward, admittedly on the basis of a single experience: edge modeling at the millennium (the design and evolu- tion of Protege-2000). Technical report, Stanford Univer-  No single source (standard, schema, etc.), will suf- sity, Institute for Medical informatics, Stanford, CA, 1999. fice for a reasonably complete computational ontology. Technical Report SMI-1999-0801. This fairly tame conclusion has been remarked by other groups, and leads to the next: [Hovy, 1998] E.H. Hovy. Combining and standardizing large-scale, practical ontologies for machine translation  No single type of source will suffice for learning a com- and other uses. In Proceedings of the 1st Interna- putational ontology; i.e., it will be necessary to include tional Conference on Language Resources and Evaluation multiple kinds (structured, semi-structured, lexicon-like, (LREC). Granada, Spain, 1998. etc.) of sources; further, after the possibilities of ‘orga- nized’ or standardized sources have been exhausted, it [International Hydrographic Organization, 1996] will be necessary to fill in the gaps with inductions from International Hydrographic Organization. IHO transfer unstructured or ‘free-form’ content; this means that no standards for digital hydrographic data, edition 3.0, 1996. single means of ontology learning will suffice for a rea- [Maloney, 1999] Elbert S. Maloney. Chapman Piloting: sonably complete ontology. Seamanship and Boat Handling. Hearst Marine Books,  Ontological information extracted from different New York, 63rd edition, 1999. sources will be in qualitatively different structural [McGuiness et al., 2000] D. McGuiness, R. Fikes, J. Rice, forms; therefore, an attempt at combining these dif- and S. Wilder. An environment for merging and testing ferent sub-ontologies into an overall whole will need large ontologies. In Proceedings of the Seventh Interna- to resolve these structural differences before any other tional Conference on Principles of Knowledge Represen- form of merging can be usefully applied. tation and Reasoning (KR2000), Breckenridge, Colorado, April 2000. Tech. report KSL-00-16, Knowledge Systems Laboratory, Stanford University. [National Oceanic and Atmospheric Administration, 1997 ] National Oceanic and Atmospheric Administration. Chart no. 1: Nautical chart symbols, abbreviations, and terms, 1997. [Noy and Musen, 1999] N. F. Noy and M. Musen. SMART: Automated support for ontology merging and alignment. In Twelth Workshop on Knowledge Acquisition, Modeling, and Management, Banff, Canada, 1999. [Noy and Musen, 2000] N. F. Noy and M. A. Musen. PROMPT: Algorithm and tool for automated ontology merging and alignment. Technical report, Stanford Uni- versity, Institute for Medical informatics, Stanford, CA, 2000. Technical Report SMI-2000-0831.