Acquisition of Ontological Knowledge from Canonical Documents
                                                 Raphael Malyankar
                                      Dept. of Computer Science and Engineering
                                                Arizona State University
                                               Tempe, AZ 85287, USA.
                                                E-mail: rmm@acm.org
                                                    (Position Paper)

                          Abstract                                 2.1   Standards Documents
                                                                   The most recent normative standard for digital nautical chart
     This paper describes experiences with quasi-                  content is the S-57 Standard for [International Hydrographic
     automated creation of a computational ontology for            Organization, 1996]. The ‘object catalog’ section of this doc-
     maritime information from a mixed collection of               ument consists of a list of chart entities, definitions, and entity
     source material. Based on these experiences, hy-              attributes, which gives us a collection (sic) of domain entities
     potheses and conclusions concerning the creation              that can be considered canonical as far as the scope of the
     of computational ontologies for engineering and               standard goes. Extraction from this ‘object catalog’ was au-
     other technical or scientific domains are presented.          tomated by using graph traversal programs that exploit links
     Heuristics for resolving anomalies in ontologies              between entities and attributes in the object catalog. The au-
     generated from mixed sources are also described.              tomated extraction resulted in 173 classes and 186 slots. A
                                                                   comparison of 10% (selected at random) of the extracted in-
                                                                   formation with the original source indicated error rates of 8%
1 Introduction                                                     to 20% (for different categories of ontological knowledge -
This paper describes our experiences with ontology acquisi-        classes/types/attributes). The additional effort needed to re-
tion in the context of maritime information. Ontological in-       duce this error in the automated portion of the extraction was
formation is acquired from multiple types of sources, includ-      not undertaken, as it proved no very laborious task to make
ing standards documents, database schemas, lexicons, collec-       the corrections by hand (about 10 hours for a non-expert who
tions of symbology definitions, and also by inference from         compared the extracted ontology with the original source).
semi-structured documents. This is followed by a descrip-             A second source was the Spatial Data Transfer Standard
tion of the computational approach to rationalization, align-      [FGDC, 1998]. The parts we used were the sections that list
ment, and merging of the ontological information derived           ‘included terms’ (analogous to a synonym list) and attribute
from these sources. The computational ontology thus cre-           definitions. Extraction from this was less satisfactory in some
ated is intended to be used in creating a Maritime Informa-        ways, since these sections are less rigorous than the object
tion Markup Language (MIML) for tagging of documents in            catalog of the S-57 standard, but, on the other hand, the syn-
this domain. An example of the kind of application that will       onym list covers more of the terms used in practice.
be enabled is a question-answering system that extracts only          While the S-57 standard is normative, there are two defi-
necessary and relevant information from marked-up text doc-        ciencies involved in using it:
uments.                                                              1. It is limited in scope. This standard covers only objects
   The observations and heuristics described in this paper ap-          (entities) that are used in digital nautical charts. Impor-
ply to domains - here, maritime information - where onto-               tant concepts such are weather conditions are not men-
logical knowledge must be acquired from different types of              tioned at all, and other concepts such as tides are men-
source material. It appears that in some domains, the sub-              tioned only incidentally or in an implicit manner, for ex-
ontologies thus generated are likely to different not only lin-         ample in defining entity classes and as attribute qualifiers
guistically, but also in their topological profiles (i.e., depth        for entities (e.g., foreshore areas, the part of the shore
and other structure). The heuristics described in this paper            covered and uncovered by tides).
are designed for a computational approach to combining such
sub-ontologies.                                                      2. It uses a restricted terminology, i.e., usually only one
                                                                        of multiple synonymous terms. The ‘missing’ terms are
                                                                        sometimes used in other documents and it is necessary to
2 Sources of Ontological Knowledge                                      establish synonym relationships to facilitate understand-
The sources used for ontological knowledge were selected                ing.
from a canonical set, that is, thery are documents accepted          Further semantic structure is induced from lexical clues
within the domain as normative and that are widely used.           and attribute sets. The heuristics used for this induction pro-
cess currently consist of lexical clues from the linguistic sim-   or more of two-column text in 10-point type. Included are
ilarity of entity names and entity definitions, and comparison     photographs, diagrams, and small maps. The flow of text fol-
of attribute sets to compute measures of the semantic distance     lows the coastline geographically, e.g., from north to south.
between attributes. For example, there are multiple ”beacon”       This is a ‘lightly structured’ document, with each volume
objects in the object catalog (”cardinal” beacon, ”danger”         containing a preliminary chapter containing navigation reg-
beacon, etc.). Lexical comparison of the object names for          ulations (which includes a compendium of rules and regula-
these several classes, and of the descriptions associated with     tions, specifications of environmentally protected zones, re-
these classes (also scraped from the abovementioned object         stricted areas, etc.), followed by chapters dealing with succes-
catalog) indicated the possibility of a ‘beacon’ class as a su-    sive sectors of the coast. Each chapter is further divided into
perclass for these several classes. This is further described in   sections (still in geographical order); each section is further
Section 4.                                                         divided into sub-sections and paragraphs describing special
                                                                   hazards, recognizable landmarks, facilities, etc. The internal
2.2 Databases and Schemas                                          structure of subsections and paragraphs provides taxonomical
The primary database we have used so far is the sample Dig-        hints, indicating, for example, which leaf entities are catego-
ital Nautical Chart (DNC) data files available from NIMA.          rizable as sub-classes of weather conditions, as well as pro-
It has somewhat more semantic structure than the aforemen-         viding a small amount of additional taxonomical information
tioned standards, consisting as it does of feature classifica-     that extends taxonomies derived from other classes (e.g., tide
tions organized by ‘layers’, for example, environmental fea-       races as a form of navigational hazard). The Coast Pilot is
tures, cultural features, land cover features, etc. (‘Feature’,    normative (in the sense of using well-understood terms) and
as used in the domain, is equivalent to ‘class’). Induction        comprehensive. A version marked up with XML would have
of ontological knowledge from this consisted of mapping the        proved invaluable for ontology learning, but there is no such
structure to a class hierarchy. This mapping was also done         version available at this time.
automatically from the schema for the database. It resulted in
134 classes of which 118 are feature classes, 12 are coverage      2.5   Other Sources
classes, and 4 are geographic structure type (point, line, area,
                                                                   Online content proved a useful and irreplaceable source of
or text) classes.
                                                                   some information, especially attributes relating to weather
   As with the S-57 standard, this database and schema cov-
                                                                   data. Entry of this part was entirely manual. Other sources
ers only chart entities, and the terminology is even more re-
                                                                   to be used include the Ports list and Light list, for informa-
stricted (and to some extent, more opaque) than the S-57 stan-
                                                                   tion on port facilities and navigation aids respectively.
dard, due to the use of abbreviated names for entities and at-
tributes, and the lack of textual definitions.
                                                                   3 Alignment, Merging, and Rationalization
2.3 Lexicons and Symbology Definitions
                                                                   We have discovered that though there is a certain amount of
A separate effort used Protege [Grosso et al., 1999] and a
                                                                   duplication between the above sources, they are largely inde-
standard collection of symbology definitions from NOAA’s
                                                                   pendent and produce different parts of the taxonomy for the
Chart No. 1 [National Oceanic and Atmospheric Administra-
                                                                   maritime information domain as a whole, and sometimes dif-
tion, 1997] to create an ontology of navigation aids, hazards,
                                                                   ferent taxonomical structures for some parts of the domain.
and other entities. Chart No. 1 is a collection of symbology
                                                                   The need to merge and align the ontologies generated from
for nautical charts accompanied by brief definitions of what
                                                                   the sources mentioned naturally arises, along with the need
the symbol stands for. It is organized semantically (in that
                                                                   to reconcile conflicts between different ontologies. This sec-
related symbols are in the same section or subsection). This
                                                                   tion describes the major issues arising in combining different
was supplemented with a widely popular publication on nav-
                                                                   ontologies, and the techniques adopted to resolve them. In
igation and seamanship (Chapman Piloting [Maloney, 1999])
                                                                   addition, we are using some of these heuristics to rationalize
and an online dictionary of chart terms (discovered and used
                                                                   individual ontologies by detecting anomalies in their struc-
by the creator, a student unfamiliar with nautical terms). On-
                                                                   ture.
tology creation based on these documents consisted of man-
ual entry of information using Protege, due to the lack of elec-
tronic versions of the symbology definitions. Approximately        3.1   Alignment and Merging
500 classes and 100 slots resulted from this effort, which was     There are at least two distinct taxonomic hierarchies in our
carried out by non-experts using the publications mentioned.       source material: (i) a classification into point, area, or line
(The paucity of slots is due to the nature of the documents,       features, and (ii) a different, natural, semantic hierarchy (nat-
which contain little mention of details corresponding to sym-      ural in the sense that it is the categorization that a human
bols).                                                             tends to create). Item (i) is attributable to the original pur-
                                                                   pose of the standards document that produced such a taxon-
2.4 Semi-Structured Normative Material                             omy — it was intended for geographical information systems
The United States Coast Pilot is a 9-volume series containing      and therefore its point of view is that of a computer graphics
information that is important to navigators of US coastal wa-      system instead of a knowledge-based system. Alignment of
ters (including the Great Lakes) but which cannot be included      the ‘sub-ontologies’ consists of assembling a jigsaw puzzle in
in a nautical chart. Each volume consists of 200 to 300 pages      the sense of [Noy and Musen, 2000].
          Navigation Aid                                                                Navigation Aid

                                                                                              Beacon
                                                                                                                         .....


    Cardinal       Isolated      Lateral          .....
    Beacon         Beacon        Beacon                                           Cardinal       Isolated      Lateral
                                                                                  Beacon         Beacon        Beacon

                                                 Figure 1: Merging Similar Classes

3.2 Resolution of Structure Mismatches                               computational recommender. The current set of heuristics,
Another issue is structure mismatch, leading to what can be          and the recommendations indicated by them, is described be-
called the reification question — should a concept distin-           low:
guishing two entities be made manifest through distinct val-
ues for a slot, or should the distinction be manifest as a type      Rule 1: Classes whose names are linguistically synony-
within the class (thus giving distinct sub-classes). We have         mous are suggested as candidates for merging. Distance be-
discovered that automated extraction from an object catalog          tween classes is measured in terms of the use of synonyms
or schema tends to produce shallow, bushy, class hierarchies         within class names. For example, two different ontologies
(i.e., it prefers translating distinctions into a range for an at-   contain ‘Bridge’ classes (the same word is used in each). Fur-
tribute slot), while manual creation tends to create deeper and      ther, cognate terms are discovered by looking for meaningful
less bushy type hierarchies. It appears that choosing between        synonyms within the class name. Figure 1 shows an instance
the two may be merely a question of convenience of utiliza-          of such cognate names (the different kinds of beacons). A
tion, but investigations into this issue continue. (This dif-        merger recommendation is issued when this rule is triggered.
ference may be a characteristic of the source of ontological
knowledge — databases vs. other source material.) The im-            Rule 2: Class pairs which have a high proportion of slot
mediate issue raised by this is that ontology merging or as-         names that are linguistically synonymous, and sufficiently
sembly will need to resolve questions of whether to sub-class        low differences in the rest of their slots, are nominated as
a class from one partial ontology, or de-sub-class a corre-          candidates for merger or alignment. As for Rule 1, distance
sponding collection of classes in the other, and how to de-          between slot names is measured in terms of the appearance of
tect this problem, i.e., identify which slot can be used as a        synonyms.
sub-class type.
                                                                        Comparison of two classes C1 and C2 with slot sets SL1
3.3 Rationalization                                                  and SL2 respectively, returns a 3-tuple (C; D12 ; D21 ), where
The term ‘rationalization’ is used here to mean removal of           C is a numeric value representing the degree of commonality
anomalies within a single ontology, such as slots with differ-       of the slot sets and D12 and D21 are numeric values repre-
ent names but playing the same role, multiple indistinguish-         senting the respective difference sets between the individual
able (or almost indistinguishable) sibling classes that are not      slot sets SL1 and SL2 and the union set SL1 [ SL2 of all the
specializations of their own distinguished abstract class, etc.      slots for either class. For example, D12 can be computed as
Some such situations are justified and necessary, but where          the number of slots of C1 that are not synonyms of slots of C2 .
ontologies are generated automatically, it appears that numer-       This computation is similar to that described by Chalupsky
ous such anomalies may creep in.                                     [2000], but uses individual elements instead of an all-round
                                                                     measure computed by combining the 3 numeric values.
4 Computational Approach                                              Rule 2 recommends merger/alignment if C > 0 and
A computational method for solving the problems described            D12 ; D21 < , where  is chosen to minimize spurious posi-
earlier has been designed and partially implemented. The ap-         tive recommendations.
proach to combining the ontologies and resolving conflicts
is reinforcement-based in that multiple heuristics are applied       Rule 3: Conceptual relatedness for class pairs is computed
to detect candidates for merging, renaming and other opera-          by comparing the class names using a lexicon of ‘included
tions. Instead of making suggestions to a user based on trig-        terms’, derived from the SDTS [FGDC, 1998]. This means
gering single rules, the set of recommendations obtained by          that hypernym/hyponym relationships between terms within
applying all applicable heuristics is presented to the user (as a    class names are included, in contrast to Rule 1, which uses
list of positive or negative recommendations for possible ac-        synonyms. The reason is that the ‘included terms’ are ex-
tions); the user is expected to decide based on the evidence         pected to be likely to result in alignment operations instead
presented and considerations that may not be known to the            of merger operations.
                                    Seabed                                           Seabed


                     Sand           Mud             Rock                                  SeaBedType: Sand
                                                                                                      Mud
                                                                                                      Rocky
                                                    Figure 2: Structure mismatches

   Similarity comparison in our heuristics is keyword based,          etc.), combining them into a single feature with the sea-floor
in that it assumes (supported by human observation of                 type as a slot.
the class and slot names) that names are of the gen-
eral form fQualifyingTerm KeyTermg (or AdjectivalPhrase                 Two further rules are being implemented; these operate
Noun). Greater importance is given to the KeyTerm in com-             not on the ontologies themselves, but on the knowledge base,
puting semantic closeness, since the QualifyingTerm portion           methods used for accessing it, and its contents:
generally appears to define a sub-type of an abstract class de-
noted by KeyTerm. A consequent limitation is that special
requirements on the internal structure of class and slot names        Rule 8: Determine how often the instances of a class are
must be imposed, and further, the heuristics produce spurious         retrieved in isolation. If there are many requests for entities
results in several cases.                                             of a specific class, there may be implementation reasons for
   Partial synonyms (complex names with synonymous key                retaining the class as a unique class. This rule, of course, can
terms) are recommended as candidates for abstraction or               be effectuated only after a study of actual use of the ontology.
merger, e.g., by merging their superclasses.
                                                                      Rule 9: Determine the population of instances for each con-
Rule 4: Concept similarity for class pairs is computed by             crete class, and compare with those for its siblings or merger
comparing the names of their slots, using the same lexicon as         candidates. If the population size is large, or if there is signif-
before. The resultant recommendation suggests mergers of              icant skew in the population of merger candidates, there may
                                                                      be implementation reasons (e.g., if instances are ultimately
classes.
                                                                      retrieved from a database) for retaining distinct classes. As
                                                                      with Rule 8, this heuristic can be investigated only after pop-
Rule 5: Sibling classes without unique slots, i.e., those that        ulating the underlying knowledge store (database, frames,
have only inherited slots, are examined. The implied solu-            etc.).
tion is to merge the two into their parent class or introduce
an intermediate class and add a type or equivalent slot to the          Rules 8 and 9 are expected to produce contra-indications
immediate super class. (But see rules 8 and 9 for possible            when triggered, i.e., recommend against mergers or align-
reasons not to accept the recommendations generated by this           ment.
rule.)                                                                  Instead of applying rules individually and effecting their
                                                                      suggestions as detected, we use them to detect problems and
Rule 6 : Subsumption relationships are detected by com-               suggest changes; the changes actually effectuated are ex-
parison of slot names as in Rule 2, but the implication and           pected to be those suggested by multiple rules, i.e., those sup-
conditions respectively that must be satisfied by C , D12 , and       ported by multiple forms of evidence.
D21 are now: C1 is a subclass of C2 if C > 0, D12 > 0,
D21 = 0.                                                              5 Implementation
                                                                      All but one of the ontologies extracted are currently in the
Rule 7: This heuristic is intended to detect structure mis-           format used by the Protégé tool. However, implementation
matches of the type vs. subclass category described earlier.          of the rules above is currently ’off-line’ as far as Protégé is
Figure 2 shows an instance of such a mismatch, arising from           concerned, that is, it is being done by a separate program
capturing the same information from different sources. Sib-           that uses a translation of the ontologies into a different for-
lings Xa , Xb , : : : of class X are compared to allowed value        mat. This was adopted due to the necessity of including the
ranges for slot S of class Y ; if the allowed values for slot         ontologies in a Web server back-end program for extraneous
S match (that is, are linguistically close to) the names of sib-      reasons (the question answering site mentioned earlier). Cur-
lings Xa , Xb , : : :, a structure mismatch is indicated. This rule   rently individual rules are applied to pairs of ontologies and
is applicable when values are categorical variables. This rule        suggestions (and contra-indications) printed for separate eval-
detects the commonality between different classes, each cor-          uation by a human user. Work on incorporating these rules
responding to a sea floor characteristic type (sand, pebbles,         into a Protégé plugin will commence shortly.
6 Related Work                                                        The above will hold even for a domain that has expe-
                                                                       rienced significant organization and standardization ef-
Noy and Musen [1999; 2000] describe an algorithm and tool              forts.
for merging ontologies in Protégé. Chalupsky [2000] de-
scribes OntoMorph, a tool for translating symbolic knowl-            A computational approach for resolving anomalies in onto-
edge from one KR formalism to another, and describes on-           logical knowledge that exhibits the characteristics mentioned
tology alignment in [Chalupsky et al., 1997]. Hovy [1998]          above was also presented, and investigations into its use and
describes a procedure for ontology alignment and heuristics        applicability are ongoing.
for suggestions, including pattern matching on strings, hier-
archy matching and data/form heuristics .                          Acknowledgments
   Ontology analysis and merging in Chimæra is described in        The efforts of Helen Wu and Koi-Sang “Leo” Leong in en-
[McGuiness et al., 2000]. Syntactic analysis of class and slot     tering ontological information and scraping ontological in-
names, taxonomic resolution, and semantic evaluation (for          formation from on-line sources are gratefully acknowledged.
example, slot/value type checking and domain-range mis-            This work was partially supported by the National Sci-
matches) are also discussed.                                       ence Foundation under grant EIA-9983267, NOAA, and the
   All the current methods for ontology alignment and merg-        U.S. Coast Guard. Any opinions, findings, and conclusions
ing generally use linguistic methods of determining similarity     or recommendations expressed in this material are those of
for class and slot names, as is done in some of the heuristics     the author(s) and do not necessarily reflect the views of these
described in Section 4 in this paper. Our approach appears         agencies.
to differ from those described in the form and utilization of
the results of comparisons, and apparently also in the use of
multi-criterion indicators/contra-indicators for suggesting op-
                                                                   References
erations as compared to computing a single score. Further,         [Chalupsky et al., 1997] H. Chalupsky, E. Hovy, and
an additional heuristic is used for concept (class) linking, by       T. Russ. Progress on an automatic ontology align-
comparing similarities between the member slots of classes.           ment methodology, 1997. ksl-web.stanford.edu/onto-
Structure mismatches are also mentioned by Chalupsky. Ac-             std/hovy/index.htm.
cess convenience and instance population-based heuristics          [Chalupsky, 2000] H. Chalupsky. Ontomorph: a transla-
(rules 8 and 9) have not been discussed in descriptions of on-        tion system for symbolic knowledge. In A.G. Cohn,
tology merging and alignment.                                         F. Giunchiglia, and B. Selman, editors, Principles of
                                                                      Knowledge Representation and Reasoning: Proceedings
7 Conclusion                                                          of the Seventh International Conference (KR2000), San
                                                                      Francisco, CA. Morgan Kaufman, 2000.
The source material described here constitutes in a sense a
                                                                   [FGDC, 1998] FGDC. Spatial data transfer standard. Fed-
canon for the domain of maritime information, in that the
collection is (except for the items in Section 2.5) normative         eral Geographic Data Committee, U. S. Geological Sur-
and comprehensive for the domain of maritime information.             vey. Proposed standard, 1998.
Based on our observations while deriving ontological knowl-        [Grosso et al., 1999] W. E. Grosso, H. Eriksson, R. W. Ferg-
edge from it, the following positions and hypotheses are put          erson, J. H. Gennari, S. W. Tu, and M. A. Musen. Knowl-
forward, admittedly on the basis of a single experience:              edge modeling at the millennium (the design and evolu-
                                                                      tion of Protege-2000). Technical report, Stanford Univer-
   No single source (standard, schema, etc.), will suf-
                                                                      sity, Institute for Medical informatics, Stanford, CA, 1999.
    fice for a reasonably complete computational ontology.
                                                                      Technical Report SMI-1999-0801.
    This fairly tame conclusion has been remarked by other
    groups, and leads to the next:                                 [Hovy, 1998] E.H. Hovy. Combining and standardizing
                                                                      large-scale, practical ontologies for machine translation
   No single type of source will suffice for learning a com-         and other uses. In Proceedings of the 1st Interna-
    putational ontology; i.e., it will be necessary to include        tional Conference on Language Resources and Evaluation
    multiple kinds (structured, semi-structured, lexicon-like,        (LREC). Granada, Spain, 1998.
    etc.) of sources; further, after the possibilities of ‘orga-
    nized’ or standardized sources have been exhausted, it         [International Hydrographic Organization, 1996]
    will be necessary to fill in the gaps with inductions from        International Hydrographic Organization. IHO transfer
    unstructured or ‘free-form’ content; this means that no           standards for digital hydrographic data, edition 3.0, 1996.
    single means of ontology learning will suffice for a rea-      [Maloney, 1999] Elbert S. Maloney. Chapman Piloting:
    sonably complete ontology.                                        Seamanship and Boat Handling. Hearst Marine Books,
   Ontological information extracted from different                  New York, 63rd edition, 1999.
    sources will be in qualitatively different structural          [McGuiness et al., 2000] D. McGuiness, R. Fikes, J. Rice,
    forms; therefore, an attempt at combining these dif-              and S. Wilder. An environment for merging and testing
    ferent sub-ontologies into an overall whole will need             large ontologies. In Proceedings of the Seventh Interna-
    to resolve these structural differences before any other          tional Conference on Principles of Knowledge Represen-
    form of merging can be usefully applied.                          tation and Reasoning (KR2000), Breckenridge, Colorado,
  April 2000. Tech. report KSL-00-16, Knowledge Systems
  Laboratory, Stanford University.
[National Oceanic and Atmospheric Administration, 1997 ]
  National Oceanic and Atmospheric Administration. Chart
  no. 1: Nautical chart symbols, abbreviations, and terms,
  1997.
[Noy and Musen, 1999] N. F. Noy and M. Musen. SMART:
  Automated support for ontology merging and alignment.
  In Twelth Workshop on Knowledge Acquisition, Modeling,
  and Management, Banff, Canada, 1999.
[Noy and Musen, 2000] N. F. Noy and M. A. Musen.
  PROMPT: Algorithm and tool for automated ontology
  merging and alignment. Technical report, Stanford Uni-
  versity, Institute for Medical informatics, Stanford, CA,
  2000. Technical Report SMI-2000-0831.