Building a domain ontology from glossaries:
              a general methodology

             Loris Bozzato, Mauro Ferrari, and Alberto Trombetta

                   Dipartimento di Informatica e Comunicazione
                        Università degli Studi dell’Insubria
                        Via Mazzini 5, 21100, Varese, Italy


      Abstract. We propose a general methodology to build up a domain
      ontology from one or more domain glossaries. The particular feature of
      this methodology is in the parallel construction of a domain ontology
      and a complete domain terminology. In this paper we fully describe the
      methodology phases and we apply them to a real-world example from
      the medical domain.


1   Introduction

Nowadays, ontologies have become a relevant representation formalism and many
application domains are considering their adoption. This attention claims for
methods for reusing domain knowledge resources in the development of domain
ontologies. Accordingly, in this paper we discuss a general methodology to create
a domain ontology from one or more domain glossaries. Before explaining the
actual steps of the methodology, let us introduce some of the ideas that led to
its composition. Basically, the methodology is to be manually carried out by
the ontology developers with the help from domain experts. Some steps of the
methodology cannot be easily automatized because an explicit interpretation of
the semantics of terms is required as, e.g., in the initial phase of clustering. In
analogy to other proposals for the translation of thesauri [9, 20], the translation
is filtered by the interpretation of the domain experts. Hence, the proposed
methodology provides a way for defining a domain ontology from a domain
glossary and not a simple translation. A particular feature of our methodology
is that, while building the domain ontology, one also reconstructs a complete
terminology for the domain of interest. However, these two representations are
distinct: the terminology should not necessarily be complete with respect to
entities external to the domain (or to entities that are not essential or specific
for the examined domain), while the ontology should even represent concepts
without a specific name in the terminology and be complete with respect to the
represented domain. Another feature of our methodology is that it allows one to
build up a complete domain ontology (i.e., without undefined objects) starting
from a possibly incomplete glossary: this is obtained by a sort of saturation on the
glossary. We restrict our attention to input glossaries with unstructured terms,
as opposed to what is assumed, for example, in thesauri [9, 16, 20]. We define a
glossary as a list of (possibly lexicographically ordered) lemmas; each lemma is
composed by a term and a textual definition that provides the meaning of the
lemma; references to others lemmas can appear in the definition. Our definition
of glossary is coherent with the one of [1, 18] and the definition of lexicon in [8].


2     Example: an ontology for orthodontic terminology

In order to illustrate our methodology, we will employ a real-world motivating
example, taken from an ongoing collaborative effort between our Department
and the Department of Clinical and Biological Sciences. In fact, this project also
constituted the motivation for the definition of our methodology. The objective
of the project is in the definition and representation of a standard terminology
for orthodontics: the motivation for this research can be found in the complexity
of the orthodontic technical terminology and in the lack of a common reference
standard for such terminology.
    The complex relations among orthodontic terms suggested a structured rep-
resentation as an ontology: this representation allows to formally define semantic
relationships among domain concepts and explore implicit connections by auto-
mated inference. As for the representation formalism, we have employed OWL
(Web Ontology Language) [21]. Its choice has been dictated by the well-founded
semantics of the language and its wide acceptance as a de-facto standard.
    The basis for the development of the domain terminology has been identified
in the AAO Orthodontic Glossary [15], a simple yet quite comprehensive glossary
developed by the American Association of Orthodontists 1 . The particular form
of the initial data led to the present proposal for a general ontology development
methodology from one or more glossaries.
    In the next section, we will formulate a running example based on terms
extracted from the original glossary that covers the first steps of development of
the actual project ontology.


3     Methodology specification

In this section we present the phases of our methodology using the following
schema. For each phase we describe the required activities and the input and
output documents, possibly enriched with an example of their structure. In some
cases, we also provide some possible extensions or guidelines for the presented
step. In the following examples, starting from a minimal (and incomplete) frag-
ment of the initial glossary, we will concentrate on the development of the clus-
ter for AnatomicPart and we show how to combine this with the cluster for
AbnormalCondition in the conceptual schema.

1
    http://www.braces.org/


                                         2
Phase 1 – Clustering

Group terms of the original glossary in clusters. To define clusters, follow the is-a
direction from terms to clusters: that is, every term should be seen as instance
of a “type” defined by a cluster. Clusters do not have to be disjoint: however,
terms must be classified in their most specific clusters. Some cluster inclusions
can be suggested in this phase.

Input: original glossary (one or more).
    Example:
    Gingiva The tissue that surrounds the teeth, consisting of a fibrous tissue
       that is continuous with the periodontal ligament and mucosal covering.
    Ligament, periodontal See periodontal ligament.

Output: table of clusters. A grouping by cluster of the terms from the original
  glossary.
    Example:
               Cluster       Anatomic Part
                             Element that is part of the human
               Description
                             anatomy.
               Elements      Term                   Source
                             Gingiva               O
                             Ligament, periodontal O
                             Periodontal ligament O


An alternative way to define clusters of terms can be represented by a special-
ization of a general superstructure: the structure could be either a foundational
ontology (as, for example, the well-known DOLCE ontology [11]) or a general
high level ontology that is specific for the domain of interest.
    As for the structure of the documents, the table of clusters should contain
a textual definition for each cluster, in order to clarify the intended meaning of
the grouping. It is also desirable to maintain a column of information about the
source glossary of each term. When suggesting inclusions between clusters, this
should be specified in the table of clusters by indicating the superclusters of each
cluster.


Phase 2 – Saturation

Find every term in the definitions that does not appear as a lemma in the
glossary. Add the new terms to the glossary, providing them with a description:
the description can also be a reference (e.g., see...) to present terms. Classify the
new terms in the previously found clusters.

Input: original glossary, table of clusters.

                                         3
Output: updated table of clusters, updated glossary.
    Example:
               Cluster       Anatomic Part
                             Element that is part of the human
               Description
                             anatomy.
               Elements      Term                   Source
                             Gingiva               O
                             Ligament, periodontal O
                             Periodontal ligament O
                             Tissue                A
                             Tooth                 A


In the output document for this phase it is useful to keep track of the added
terms: in the given example, in the Source column we specify with “A” that a
term has been added, and with “O” if it belongs to the original terms. During
this phase, it is possible to enrich the original glossary with other terms not
contained in the original glossary.


Phase 3 – Relationship identification

Find relationships between terms, extracted from term definitions. Proceed a
cluster at a time, searching for the distinctive relations of each cluster. Classify
the relations under their respective cluster. An early informal description of the
meaning and features (e.g., intended range and domain) of each relation should
also be given.

Input: glossary, table of clusters.
Output: table of relations. Analogously to the table of clusters, it contains a
   grouping of the identified relations for each cluster and a description of their
   meaning.
    Example:
                                  Anatomic Part
    Relation         Range           Domain          Description
    hasContinuity    Anatomic Part Anatomic Part Continuity between two a-
                                                 natomic parts.
    isSupportOf      Anatomic Part Anatomic Part Relates a part with its sup-
                                                 porting parts.
    isSurroundingOf Anatomic Part Anatomic Part Relates a part with its sur-
                                                rounding parts.
    isPartOf         Anatomic Part Anatomic Part Relates a part to the parts
                                                 it is direct component of.


                                         4
Phase 4 – Disambiguation
Group terms with their synonyms (e.g., see...). Determine, among the synonyms,
the preferred term to represent each concept.
Input: table of clusters.
Output: table of concepts. Refinement of the table of clusters in which terms
   are grouped to their synonyms.
    Example:
                                      Anatomic Part
                       Concept              Synonym
                       Gingiva
                       Periodontal ligament Ligament, periodontal
                       Tissue
                       Tooth


Phase 5 – Class grouping
Find concepts that represent subclasses of the cluster they are in. Group concepts
belonging to the new found classes. In general, try to create a partition of the
cluster, possibly adding new classes at the same level of the found subclasses.
Input: table of concepts.
Output: table of concepts, updated with class groupings.
    Example:
                                      Anatomic Part
               Class        Concept              Synonym
               Tissue       Gingiva
                            Periodontal ligament Ligament, periodontal
               (Generic) Tooth


Phase 6 – Conceptual modelling
Using a graphical modelling formalism (such as E-R diagrams or UML models),
proceed through the following steps:
1. Skeleton schema: group clusters in areas. Find abstract relationships between
   them, by generalization of the previously found relations.
2. Refinement: separately develop every area, starting from the subclasses found
   in the previous phase. Define inclusions between classes and relations and
   refine features and restrictions of the relations.
3. Integration: build the final model by combining area schemas and refining
   relationships among areas.


                                            5
Input: table of concepts, table of relations.
Output: ontology conceptual model. A graphical representation of the concep-
   tual structure of the domain ontology.
    Example:
    A skeleton schema for the clusters AnatomicPart and AbnormalCondition
    is shown in Figure 1. In Figure 2 we show the refinement for the class
    AnatomicPart and for its relation hasContinuity. By joining this with the
    refinement for AbnormalCondition (not shown), we obtain the final
    integration schema in Figure 3.

We do not put a constraint on the graphical language used to develop this phase,
as long as it is enough usable and expressive to develop the ontology conceptual
model: in our examples we have adopted an UML metamodel inspired to the
one of [2].

Phase 7 – Schema representation
Represent class and property hierarchies in the final representation language
(OWL, description logics).
Input: ontology conceptual model.
Output: ontology schema. Final representation of the class and property tax-
   onomies.

Phase 8 – Ontology representation
Represent class instances as individuals; encode textual definitions for each con-
cept by means of property values and restrictions.
Input: ontology conceptual model, ontology schema, glossary.
Output: domain ontology. Final representation of the domain ontology, com-
   plete with information about classes, relations and individuals.
The previous phase should be formalized to explain how to encode the tex-
tual definitions in terms of relations and logical connectives. This formalization
clearly depends on the expressivity of the final representation language and its
constructors.

Phase 9 – Annotation
Annotate each class and individual with information derived from glossary: the
most relevant information is preferred term, synonyms, textual definition.
Input: domain ontology, table of concepts, glossary.
Output: annotated domain ontology.
    Example:
    An example of a domain ontology enriched with an external annotation is
    given in Figure 4.


                                         6
                   Fig. 1. Skeleton schema


Fig. 2. Refinement for AnatomicPart and relation hasContinuity


                  Fig. 3. Integration schema


                              7
             Fig. 4. Example of a domain ontology external annotation


With the last example we have shown a possible way to annotate the final
domain ontology: we propose to include the annotations in a separate ontology,
in which the description of each concept appear as an individual. For every
description, the information about the described concept (such as its preferred
term and synonyms) is linked as datatype properties. In this way, it is easy to
uniformly associate different descriptions for each concept (for example, distinct
by language or target users), regardless of the representation of the concept as
a class, individual or property.


4     Related works

As ontologies have been applied in the modeling of domains of ever increasing
complexity, a principled approach to their design is necessary. As such, the field
of ontology engineering has become a rich and articulated research area (see, for
example, the effort in the development of a general methodology for ontology
networks in the EU funded project NeOn 2 ). We give a brief outline of the field
in this section: for more complete and detailed surveys, see [6, 13, 18].
    Most of the proposals of general methodologies for ontology development are
inspired by a top-down approach: well-known examples are the DOLCE method-
ology [11] and Ontology Development 101 [12]. A slightly different approach is
taken by Methontology [7], which differentiates the activities revolving around
ontology development not only on a technical view, but also in the managing and
supporting activities. Such attention on collateral activities is shared for example
with OTKM [19], which develops the idea of ontology requirements specification
2
    http://www.neon-project.org/


                                         8
document (ORSD). Recent methodology proposals, like DILIGENT [14], also
introduce methods for ontology development in a collaborative setting.
    Such general approaches are very interesting in that they show the benefits
of a structured and domain independent approach when dealing with ontology
development. However, it is widely accepted that taking into account relevant
domain features from the beginning of the ontology development process yields
more effective results: as such, more specialized, domain dependent methodolo-
gies have been proposed. Moreover, as emphasized in [18], the reuse of existing
domain knowledge in ontology development allows to reduce the effort for the
representation of the domain and permits to build ontologies based on already
consensuated and well accepted knowledge sources. In fact, also from what has
been noted in Section 1, our work can be located in the ontology engineering
subfield dealing with non ontological resource reuse and reengineering, in par-
ticular in the definition of methodologies for reengineering a given set of (more
or less structured) vocabularies or thesauri. Notable works in this area are [9,
16, 20], while in the related field of lexicon to ontology translation we cite [3–5,
10]. In a wider scope, by the definitions given in [18], this work can be seen as a
proposal for a non ontological resource reengineering pattern for glossaries: this
view also suggest that our methodology can be “plugged-in” in a more general
framework for ontology development and lifecycle.


5   Conclusion and future works

In this paper we have discussed a methodology for the construction of a domain
ontology on the basis of one or more (possibly incomplete) glossaries: we have
also discussed how this methodology allows to build a complete terminology for
the domain of interest. As a motivating example, we have applied it to a medical
domain case. In the following, we summarize some of the directions for future
work, both for the original project and for the methodology refinement. As for
the orthodontic terminology development project, our next step is to complete
the ontology with terms not yet considered, also by solving some current repre-
sentation issues. Before proceeding to the publication and diffusion of the final
ontology, the quality of the ontology has to be checked by domain experts. Also
to help in this publication and valuation activity, we are currently developing a
web application for the consultation of the ontology: the application will allow
users to explore the ontology structure starting from terms in the final glossary.
Our future work in the methodology refinement should include a formal defi-
nition for transformations from textual descriptions of concepts to their logic
representation. Similarly, we need also to define formal definitions for methods
for extracting relations from textual descriptions of concepts. Other interesting
directions of work concern the analysis on the potential automatization of some
phases and, on the ontology evaluation side, the study of the formal properties
of the resulting ontologies. The first can involve the use of NLP techniques for
clustering or saturation phases. Both automatization and evaluation can benefit
from the application of our methodology in different domains.

                                         9
References
 1. ISO 1087-1:2000. Terminology. Vocabulary. Part 1: Theory and application.
 2. S. Brockmans and P. Haase. A Metamodel and UML Profile for Networked On-
    tologies - A Complete Reference. Technical report, Universität Karlsruhe (TH),
    Apr. 2006.
 3. P. Buitelaar, M. Sintek, and M. Kiesel. A Multilingual/Multimedia Lexicon Model
    for Ontologies. In ESWC 2006, pages 502–513, 2006.
 4. F. Busa, N. Calzolari, A. Lenci, and J. Pustejovsky. Building a Semantic Lexi-
    con: Structuring and Generating Concepts. In Computing Meaning, pages 29–51.
    Kluwer Academic Publishers, 2001.
 5. P. Cimiano, P. Haase, M. Herold, M. Mantel, and P. Buitelaar. LexOnto: A Model
    for Ontology Lexicons for Ontology-based NLP. In Proceedings of the OntoLex07
    Workshop, held in conjunction with ISWC’07, 2007.
 6. M. Cristiani and R. Cuel. Methodologies for the Semantic Web: state-of-art of
    ontology methodology. SIGSEMIS Bulletin, 1(2):103–112, 2004.
 7. M. Fernández-López, A. Gómez-Pérez, and N. Juristo. METHONTOLOGY: From
    Ontological Art Towards Ontological Engineering. In Spring Symposium on Onto-
    logical Engineering of AAAI, pages 33–40, 1997.
 8. G. Hirst. Ontology and the Lexicon. In Staab and Studer [17], pages 209–230.
 9. E. Hyvönen. Miksi asiasanastot eivät riitä vaan tarvitaan ontologioita? (why the-
    sauri are not enough but ontologies are needed?). Tietolinja, (2), 2005. In Finnish.
10. A. Lenci. Building an Ontology for the Lexicon: Semantic Types and Word Mean-
    ing. In Ontology-Based Interpretation of Noun Phrases, pages 103–120, 2001.
11. C. Masolo, S. Borgo, A. Gangemi, N. Guarino, and A. Oltramari. Ontologies library
    (final). WonderWeb Deliverable D18, ISTC-CNR, Padova, Italy, Dec. 2003.
12. N. F. Noy and D. McGuinness. Ontology development 101: A guide to creating
    your first ontology. Technical Report KSL-01-05, Stanford KSL, 2001.
13. H. S. Pinto and J. P. Martins. Ontologies: How can They be Built? Knowl. Inf.
    Syst., 6(4):441–464, 2004.
14. H. S. Pinto, S. Staab, and C. Tempich. DILIGENT: Towards a fine-grained method-
    ology for Distributed, Loosely-controlled and evolving Engineering of oNTologies.
    In ECAI 2004, pages 393–397, 2004.
15. D. R. Poulton, C. J. Burstone, T. M. Graber, L. E. Keso, and D. L.
    Turpin. AAO Orthodontic Glossary. On-line version: http://www.braces.org/
    healthcareprofessionals/dentists/glossary/.
16. D. Soergel, B. Lauser, A. Liang, F. Fisseha, J. Keizer, and S. Katz. Reengineering
    Thesauri for New Applications: The AGROVOC Example. Journal of Digital
    Information, 4(4), 2004.
17. S. Staab and R. Studer, editors. Handbook on Ontologies. International Handbooks
    on Information Systems. Springer, 2004.
18. M. C. Suárez-Figueroa. D5.4.1. NeOn Methodology for Building Contextualized
    Ontology Networks. NeOn Project Deliverable D5.4.1/v1.0, NeOn, Feb. 2008.
19. Y. Sure, S. Staab, and R. Studer. On-To-Knowledge Methodology (OTKM). In
    Staab and Studer [17], pages 117–132.
20. M. van Assem, M. R. Menken, G. Schreiber, J. Wielemaker, and B. J. Wielinga.
    A Method for Converting Thesauri to RDF/OWL. In ISWC 2004, volume 3298 of
    LNCS, pages 17–31. Springer, 2004.
21. F. van Harmelen and D. L. McGuinness. OWL Web Ontology Language
    Overview. W3C Recommendation, W3C, Feb. 2004. http://www.w3.org/TR/
    2004/REC-owl-features-20040210/.


                                          10