=Paper= {{Paper |id=Vol-38/paper-9 |storemode=property |title=Acquisition of Ontological Knowledge from Canonical Documents |pdfUrl=https://ceur-ws.org/Vol-38/malaynaker.pdf |volume=Vol-38 |dblpUrl=https://dblp.org/rec/conf/ijcai/Malyankar01 }} ==Acquisition of Ontological Knowledge from Canonical Documents== https://ceur-ws.org/Vol-38/malaynaker.pdf

Acquisition of Ontological Knowledge from Canonical Documents
Raphael Malyankar
Dept. of Computer Science and Engineering
Arizona State University
Tempe, AZ 85287, USA.
E-mail: rmm@acm.org
(Position Paper)

Abstract 2.1 Standards Documents
The most recent normative standard for digital nautical chart
This paper describes experiences with quasi- content is the S-57 Standard for [International Hydrographic
automated creation of a computational ontology for Organization, 1996]. The ‘object catalog’ section of this doc-
maritime information from a mixed collection of ument consists of a list of chart entities, definitions, and entity
source material. Based on these experiences, hy- attributes, which gives us a collection (sic) of domain entities
potheses and conclusions concerning the creation that can be considered canonical as far as the scope of the
of computational ontologies for engineering and standard goes. Extraction from this ‘object catalog’ was au-
other technical or scientific domains are presented. tomated by using graph traversal programs that exploit links
Heuristics for resolving anomalies in ontologies between entities and attributes in the object catalog. The au-
generated from mixed sources are also described. tomated extraction resulted in 173 classes and 186 slots. A
comparison of 10% (selected at random) of the extracted in-
formation with the original source indicated error rates of 8%
1 Introduction to 20% (for different categories of ontological knowledge -
This paper describes our experiences with ontology acquisi- classes/types/attributes). The additional effort needed to re-
tion in the context of maritime information. Ontological in- duce this error in the automated portion of the extraction was
formation is acquired from multiple types of sources, includ- not undertaken, as it proved no very laborious task to make
ing standards documents, database schemas, lexicons, collec- the corrections by hand (about 10 hours for a non-expert who
tions of symbology definitions, and also by inference from compared the extracted ontology with the original source).
semi-structured documents. This is followed by a descrip- A second source was the Spatial Data Transfer Standard
tion of the computational approach to rationalization, align- [FGDC, 1998]. The parts we used were the sections that list
ment, and merging of the ontological information derived ‘included terms’ (analogous to a synonym list) and attribute
from these sources. The computational ontology thus cre- definitions. Extraction from this was less satisfactory in some
ated is intended to be used in creating a Maritime Informa- ways, since these sections are less rigorous than the object
tion Markup Language (MIML) for tagging of documents in catalog of the S-57 standard, but, on the other hand, the syn-
this domain. An example of the kind of application that will onym list covers more of the terms used in practice.
be enabled is a question-answering system that extracts only While the S-57 standard is normative, there are two defi-
necessary and relevant information from marked-up text doc- ciencies involved in using it:
uments. 1. It is limited in scope. This standard covers only objects
The observations and heuristics described in this paper ap- (entities) that are used in digital nautical charts. Impor-
ply to domains - here, maritime information - where onto- tant concepts such are weather conditions are not men-
logical knowledge must be acquired from different types of tioned at all, and other concepts such as tides are men-
source material. It appears that in some domains, the sub- tioned only incidentally or in an implicit manner, for ex-
ontologies thus generated are likely to different not only lin- ample in defining entity classes and as attribute qualifiers
guistically, but also in their topological profiles (i.e., depth for entities (e.g., foreshore areas, the part of the shore
and other structure). The heuristics described in this paper covered and uncovered by tides).
are designed for a computational approach to combining such
sub-ontologies. 2. It uses a restricted terminology, i.e., usually only one
of multiple synonymous terms. The ‘missing’ terms are
sometimes used in other documents and it is necessary to
2 Sources of Ontological Knowledge establish synonym relationships to facilitate understand-
The sources used for ontological knowledge were selected ing.
from a canonical set, that is, thery are documents accepted Further semantic structure is induced from lexical clues
within the domain as normative and that are widely used. and attribute sets. The heuristics used for this induction pro-
cess currently consist of lexical clues from the linguistic sim- or more of two-column text in 10-point type. Included are
ilarity of entity names and entity definitions, and comparison photographs, diagrams, and small maps. The flow of text fol-
of attribute sets to compute measures of the semantic distance lows the coastline geographically, e.g., from north to south.
between attributes. For example, there are multiple ”beacon” This is a ‘lightly structured’ document, with each volume
objects in the object catalog (”cardinal” beacon, ”danger” containing a preliminary chapter containing navigation reg-
beacon, etc.). Lexical comparison of the object names for ulations (which includes a compendium of rules and regula-
these several classes, and of the descriptions associated with tions, specifications of environmentally protected zones, re-
these classes (also scraped from the abovementioned object stricted areas, etc.), followed by chapters dealing with succes-
catalog) indicated the possibility of a ‘beacon’ class as a su- sive sectors of the coast. Each chapter is further divided into
perclass for these several classes. This is further described in sections (still in geographical order); each section is further
Section 4. divided into sub-sections and paragraphs describing special
hazards, recognizable landmarks, facilities, etc. The internal
2.2 Databases and Schemas structure of subsections and paragraphs provides taxonomical
The primary database we have used so far is the sample Dig- hints, indicating, for example, which leaf entities are catego-
ital Nautical Chart (DNC) data files available from NIMA. rizable as sub-classes of weather conditions, as well as pro-
It has somewhat more semantic structure than the aforemen- viding a small amount of additional taxonomical information
tioned standards, consisting as it does of feature classifica- that extends taxonomies derived from other classes (e.g., tide
tions organized by ‘layers’, for example, environmental fea- races as a form of navigational hazard). The Coast Pilot is
tures, cultural features, land cover features, etc. (‘Feature’, normative (in the sense of using well-understood terms) and
as used in the domain, is equivalent to ‘class’). Induction comprehensive. A version marked up with XML would have
of ontological knowledge from this consisted of mapping the proved invaluable for ontology learning, but there is no such
structure to a class hierarchy. This mapping was also done version available at this time.
automatically from the schema for the database. It resulted in
134 classes of which 118 are feature classes, 12 are coverage 2.5 Other Sources
classes, and 4 are geographic structure type (point, line, area,
Online content proved a useful and irreplaceable source of
or text) classes.
some information, especially attributes relating to weather
As with the S-57 standard, this database and schema cov-
data. Entry of this part was entirely manual. Other sources
ers only chart entities, and the terminology is even more re-
to be used include the Ports list and Light list, for informa-
stricted (and to some extent, more opaque) than the S-57 stan-
tion on port facilities and navigation aids respectively.
dard, due to the use of abbreviated names for entities and at-
tributes, and the lack of textual definitions.
3 Alignment, Merging, and Rationalization
2.3 Lexicons and Symbology Definitions
We have discovered that though there is a certain amount of
A separate effort used Protege [Grosso et al., 1999] and a
duplication between the above sources, they are largely inde-
standard collection of symbology definitions from NOAA’s
pendent and produce different parts of the taxonomy for the
Chart No. 1 [National Oceanic and Atmospheric Administra-
maritime information domain as a whole, and sometimes dif-
tion, 1997] to create an ontology of navigation aids, hazards,
ferent taxonomical structures for some parts of the domain.
and other entities. Chart No. 1 is a collection of symbology
The need to merge and align the ontologies generated from
for nautical charts accompanied by brief definitions of what
the sources mentioned naturally arises, along with the need
the symbol stands for. It is organized semantically (in that
to reconcile conflicts between different ontologies. This sec-
related symbols are in the same section or subsection). This
tion describes the major issues arising in combining different
was supplemented with a widely popular publication on nav-
ontologies, and the techniques adopted to resolve them. In
igation and seamanship (Chapman Piloting [Maloney, 1999])
addition, we are using some of these heuristics to rationalize
and an online dictionary of chart terms (discovered and used
individual ontologies by detecting anomalies in their struc-
by the creator, a student unfamiliar with nautical terms). On-
ture.
tology creation based on these documents consisted of man-
ual entry of information using Protege, due to the lack of elec-
tronic versions of the symbology definitions. Approximately 3.1 Alignment and Merging
500 classes and 100 slots resulted from this effort, which was There are at least two distinct taxonomic hierarchies in our
carried out by non-experts using the publications mentioned. source material: (i) a classification into point, area, or line
(The paucity of slots is due to the nature of the documents, features, and (ii) a different, natural, semantic hierarchy (nat-
which contain little mention of details corresponding to sym- ural in the sense that it is the categorization that a human
bols). tends to create). Item (i) is attributable to the original pur-
pose of the standards document that produced such a taxon-
2.4 Semi-Structured Normative Material omy — it was intended for geographical information systems
The United States Coast Pilot is a 9-volume series containing and therefore its point of view is that of a computer graphics
information that is important to navigators of US coastal wa- system instead of a knowledge-based system. Alignment of
ters (including the Great Lakes) but which cannot be included the ‘sub-ontologies’ consists of assembling a jigsaw puzzle in
in a nautical chart. Each volume consists of 200 to 300 pages the sense of [Noy and Musen, 2000].
Navigation Aid Navigation Aid

Beacon
.....

Cardinal Isolated Lateral .....
Beacon Beacon Beacon Cardinal Isolated Lateral
Beacon Beacon Beacon

Figure 1: Merging Similar Classes

3.2 Resolution of Structure Mismatches computational recommender. The current set of heuristics,
Another issue is structure mismatch, leading to what can be and the recommendations indicated by them, is described be-
called the reification question — should a concept distin- low:
guishing two entities be made manifest through distinct val-
ues for a slot, or should the distinction be manifest as a type Rule 1: Classes whose names are linguistically synony-
within the class (thus giving distinct sub-classes). We have mous are suggested as candidates for merging. Distance be-
discovered that automated extraction from an object catalog tween classes is measured in terms of the use of synonyms
or schema tends to produce shallow, bushy, class hierarchies within class names. For example, two different ontologies
(i.e., it prefers translating distinctions into a range for an at- contain ‘Bridge’ classes (the same word is used in each). Fur-
tribute slot), while manual creation tends to create deeper and ther, cognate terms are discovered by looking for meaningful
less bushy type hierarchies. It appears that choosing between synonyms within the class name. Figure 1 shows an instance
the two may be merely a question of convenience of utiliza- of such cognate names (the different kinds of beacons). A
tion, but investigations into this issue continue. (This dif- merger recommendation is issued when this rule is triggered.
ference may be a characteristic of the source of ontological
knowledge — databases vs. other source material.) The im- Rule 2: Class pairs which have a high proportion of slot
mediate issue raised by this is that ontology merging or as- names that are linguistically synonymous, and sufficiently
sembly will need to resolve questions of whether to sub-class low differences in the rest of their slots, are nominated as
a class from one partial ontology, or de-sub-class a corre- candidates for merger or alignment. As for Rule 1, distance
sponding collection of classes in the other, and how to de- between slot names is measured in terms of the appearance of
tect this problem, i.e., identify which slot can be used as a synonyms.
sub-class type.
Comparison of two classes C1 and C2 with slot sets SL1
3.3 Rationalization and SL2 respectively, returns a 3-tuple (C; D12 ; D21 ), where
The term ‘rationalization’ is used here to mean removal of C is a numeric value representing the degree of commonality
anomalies within a single ontology, such as slots with differ- of the slot sets and D12 and D21 are numeric values repre-
ent names but playing the same role, multiple indistinguish- senting the respective difference sets between the individual
able (or almost indistinguishable) sibling classes that are not slot sets SL1 and SL2 and the union set SL1 [ SL2 of all the
specializations of their own distinguished abstract class, etc. slots for either class. For example, D12 can be computed as
Some such situations are justified and necessary, but where the number of slots of C1 that are not synonyms of slots of C2 .
ontologies are generated automatically, it appears that numer- This computation is similar to that described by Chalupsky
ous such anomalies may creep in. [2000], but uses individual elements instead of an all-round
measure computed by combining the 3 numeric values.
4 Computational Approach Rule 2 recommends merger/alignment if C > 0 and
A computational method for solving the problems described D12 ; D21 < , where is chosen to minimize spurious posi-
earlier has been designed and partially implemented. The ap- tive recommendations.
proach to combining the ontologies and resolving conflicts
is reinforcement-based in that multiple heuristics are applied Rule 3: Conceptual relatedness for class pairs is computed
to detect candidates for merging, renaming and other opera- by comparing the class names using a lexicon of ‘included
tions. Instead of making suggestions to a user based on trig- terms’, derived from the SDTS [FGDC, 1998]. This means
gering single rules, the set of recommendations obtained by that hypernym/hyponym relationships between terms within
applying all applicable heuristics is presented to the user (as a class names are included, in contrast to Rule 1, which uses
list of positive or negative recommendations for possible ac- synonyms. The reason is that the ‘included terms’ are ex-
tions); the user is expected to decide based on the evidence pected to be likely to result in alignment operations instead
presented and considerations that may not be known to the of merger operations.
Seabed Seabed

Sand Mud Rock SeaBedType: Sand
Mud
Rocky
Figure 2: Structure mismatches

Similarity comparison in our heuristics is keyword based, etc.), combining them into a single feature with the sea-floor
in that it assumes (supported by human observation of type as a slot.
the class and slot names) that names are of the gen-
eral form fQualifyingTerm KeyTermg (or AdjectivalPhrase Two further rules are being implemented; these operate
Noun). Greater importance is given to the KeyTerm in com- not on the ontologies themselves, but on the knowledge base,
puting semantic closeness, since the QualifyingTerm portion methods used for accessing it, and its contents:
generally appears to define a sub-type of an abstract class de-
noted by KeyTerm. A consequent limitation is that special
requirements on the internal structure of class and slot names Rule 8: Determine how often the instances of a class are
must be imposed, and further, the heuristics produce spurious retrieved in isolation. If there are many requests for entities
results in several cases. of a specific class, there may be implementation reasons for
Partial synonyms (complex names with synonymous key retaining the class as a unique class. This rule, of course, can
terms) are recommended as candidates for abstraction or be effectuated only after a study of actual use of the ontology.
merger, e.g., by merging their superclasses.
Rule 9: Determine the population of instances for each con-
Rule 4: Concept similarity for class pairs is computed by crete class, and compare with those for its siblings or merger
comparing the names of their slots, using the same lexicon as candidates. If the population size is large, or if there is signif-
before. The resultant recommendation suggests mergers of icant skew in the population of merger candidates, there may
be implementation reasons (e.g., if instances are ultimately
classes.
retrieved from a database) for retaining distinct classes. As
with Rule 8, this heuristic can be investigated only after pop-
Rule 5: Sibling classes without unique slots, i.e., those that ulating the underlying knowledge store (database, frames,
have only inherited slots, are examined. The implied solu- etc.).
tion is to merge the two into their parent class or introduce
an intermediate class and add a type or equivalent slot to the Rules 8 and 9 are expected to produce contra-indications
immediate super class. (But see rules 8 and 9 for possible when triggered, i.e., recommend against mergers or align-
reasons not to accept the recommendations generated by this ment.
rule.) Instead of applying rules individually and effecting their
suggestions as detected, we use them to detect problems and
Rule 6 : Subsumption relationships are detected by com- suggest changes; the changes actually effectuated are ex-
parison of slot names as in Rule 2, but the implication and pected to be those suggested by multiple rules, i.e., those sup-
conditions respectively that must be satisfied by C , D12 , and ported by multiple forms of evidence.
D21 are now: C1 is a subclass of C2 if C > 0, D12 > 0,
D21 = 0. 5 Implementation
All but one of the ontologies extracted are currently in the
Rule 7: This heuristic is intended to detect structure mis- format used by the Protégé tool. However, implementation
matches of the type vs. subclass category described earlier. of the rules above is currently ’off-line’ as far as Protégé is
Figure 2 shows an instance of such a mismatch, arising from concerned, that is, it is being done by a separate program
capturing the same information from different sources. Sib- that uses a translation of the ontologies into a different for-
lings Xa , Xb , : : : of class X are compared to allowed value mat. This was adopted due to the necessity of including the
ranges for slot S of class Y ; if the allowed values for slot ontologies in a Web server back-end program for extraneous
S match (that is, are linguistically close to) the names of sib- reasons (the question answering site mentioned earlier). Cur-
lings Xa , Xb , : : :, a structure mismatch is indicated. This rule rently individual rules are applied to pairs of ontologies and
is applicable when values are categorical variables. This rule suggestions (and contra-indications) printed for separate eval-
detects the commonality between different classes, each cor- uation by a human user. Work on incorporating these rules
responding to a sea floor characteristic type (sand, pebbles, into a Protégé plugin will commence shortly.
6 Related Work The above will hold even for a domain that has expe-
rienced significant organization and standardization ef-
Noy and Musen [1999; 2000] describe an algorithm and tool forts.
for merging ontologies in Protégé. Chalupsky [2000] de-
scribes OntoMorph, a tool for translating symbolic knowl- A computational approach for resolving anomalies in onto-
edge from one KR formalism to another, and describes on- logical knowledge that exhibits the characteristics mentioned
tology alignment in [Chalupsky et al., 1997]. Hovy [1998] above was also presented, and investigations into its use and
describes a procedure for ontology alignment and heuristics applicability are ongoing.
for suggestions, including pattern matching on strings, hier-
archy matching and data/form heuristics . Acknowledgments
Ontology analysis and merging in Chimæra is described in The efforts of Helen Wu and Koi-Sang “Leo” Leong in en-
[McGuiness et al., 2000]. Syntactic analysis of class and slot tering ontological information and scraping ontological in-
names, taxonomic resolution, and semantic evaluation (for formation from on-line sources are gratefully acknowledged.
example, slot/value type checking and domain-range mis- This work was partially supported by the National Sci-
matches) are also discussed. ence Foundation under grant EIA-9983267, NOAA, and the
All the current methods for ontology alignment and merg- U.S. Coast Guard. Any opinions, findings, and conclusions
ing generally use linguistic methods of determining similarity or recommendations expressed in this material are those of
for class and slot names, as is done in some of the heuristics the author(s) and do not necessarily reflect the views of these
described in Section 4 in this paper. Our approach appears agencies.
to differ from those described in the form and utilization of
the results of comparisons, and apparently also in the use of
multi-criterion indicators/contra-indicators for suggesting op-
References
erations as compared to computing a single score. Further, [Chalupsky et al., 1997] H. Chalupsky, E. Hovy, and
an additional heuristic is used for concept (class) linking, by T. Russ. Progress on an automatic ontology align-
comparing similarities between the member slots of classes. ment methodology, 1997. ksl-web.stanford.edu/onto-
Structure mismatches are also mentioned by Chalupsky. Ac- std/hovy/index.htm.
cess convenience and instance population-based heuristics [Chalupsky, 2000] H. Chalupsky. Ontomorph: a transla-
(rules 8 and 9) have not been discussed in descriptions of on- tion system for symbolic knowledge. In A.G. Cohn,
tology merging and alignment. F. Giunchiglia, and B. Selman, editors, Principles of
Knowledge Representation and Reasoning: Proceedings
7 Conclusion of the Seventh International Conference (KR2000), San
Francisco, CA. Morgan Kaufman, 2000.
The source material described here constitutes in a sense a
[FGDC, 1998] FGDC. Spatial data transfer standard. Fed-
canon for the domain of maritime information, in that the
collection is (except for the items in Section 2.5) normative eral Geographic Data Committee, U. S. Geological Sur-
and comprehensive for the domain of maritime information. vey. Proposed standard, 1998.
Based on our observations while deriving ontological knowl- [Grosso et al., 1999] W. E. Grosso, H. Eriksson, R. W. Ferg-
edge from it, the following positions and hypotheses are put erson, J. H. Gennari, S. W. Tu, and M. A. Musen. Knowl-
forward, admittedly on the basis of a single experience: edge modeling at the millennium (the design and evolu-
tion of Protege-2000). Technical report, Stanford Univer-
No single source (standard, schema, etc.), will suf-
sity, Institute for Medical informatics, Stanford, CA, 1999.
fice for a reasonably complete computational ontology.
Technical Report SMI-1999-0801.
This fairly tame conclusion has been remarked by other
groups, and leads to the next: [Hovy, 1998] E.H. Hovy. Combining and standardizing
large-scale, practical ontologies for machine translation
No single type of source will suffice for learning a com- and other uses. In Proceedings of the 1st Interna-
putational ontology; i.e., it will be necessary to include tional Conference on Language Resources and Evaluation
multiple kinds (structured, semi-structured, lexicon-like, (LREC). Granada, Spain, 1998.
etc.) of sources; further, after the possibilities of ‘orga-
nized’ or standardized sources have been exhausted, it [International Hydrographic Organization, 1996]
will be necessary to fill in the gaps with inductions from International Hydrographic Organization. IHO transfer
unstructured or ‘free-form’ content; this means that no standards for digital hydrographic data, edition 3.0, 1996.
single means of ontology learning will suffice for a rea- [Maloney, 1999] Elbert S. Maloney. Chapman Piloting:
sonably complete ontology. Seamanship and Boat Handling. Hearst Marine Books,
Ontological information extracted from different New York, 63rd edition, 1999.
sources will be in qualitatively different structural [McGuiness et al., 2000] D. McGuiness, R. Fikes, J. Rice,
forms; therefore, an attempt at combining these dif- and S. Wilder. An environment for merging and testing
ferent sub-ontologies into an overall whole will need large ontologies. In Proceedings of the Seventh Interna-
to resolve these structural differences before any other tional Conference on Principles of Knowledge Represen-
form of merging can be usefully applied. tation and Reasoning (KR2000), Breckenridge, Colorado,
April 2000. Tech. report KSL-00-16, Knowledge Systems
Laboratory, Stanford University.
[National Oceanic and Atmospheric Administration, 1997 ]
National Oceanic and Atmospheric Administration. Chart
no. 1: Nautical chart symbols, abbreviations, and terms,
1997.
[Noy and Musen, 1999] N. F. Noy and M. Musen. SMART:
Automated support for ontology merging and alignment.
In Twelth Workshop on Knowledge Acquisition, Modeling,
and Management, Banff, Canada, 1999.
[Noy and Musen, 2000] N. F. Noy and M. A. Musen.
PROMPT: Algorithm and tool for automated ontology
merging and alignment. Technical report, Stanford Uni-
versity, Institute for Medical informatics, Stanford, CA,
2000. Technical Report SMI-2000-0831.