Use of shared lexical resources for efficient
                ontological engineering

     Antonio Jimeno-Yepes1 , Ernesto Jiménez-Ruiz2 , Rafael Berlanga2 , and
                       Dietrich Rebholz-Schuhmann1
                         1
                         European Bioinformatics Institute,
             Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
                           {yepes,rebholz}@ebi.ac.uk
                   2
                     Dept. of Computer Systems and Languages,
                            Universitat Jaume I, Spain,
                          {ejimenez,berlanga}@uji.es


       Abstract This paper is intended to approach one of the main problems
       in ontology engineering: the lack of a shared terminology. Nowadays there
       exists several biomedical ontologies describing overlapping domains, but
       there is not a clear correspondence between the concepts that are sup-
       posed to be equivalent or just similar. These resources are quite precious
       but their integration and further development are expensive. Termino-
       logical or lexical resources may support the ontological development in
       several stages of the lifecycle of the ontology including ontology integra-
       tion and the labeling of concepts. In this paper we investigate the use of
       lexical resources during the ontology lifecycle using the example of the
       Health-e-Child (HeC) project. We claim that the proper creation and
       use of a shared lexicon is a cornerstone for the successful application of
       the Semantic Web technology within life sciences.


1     Introduction

Large domain ontologies are emerging from collaborative efforts in the life sci-
ences, being its main aim to achieve the interoperability among the different
research resources by assuming a common conceptualization. These resources
mainly consist of both domain ontologies and terminological resources (e.g. the-
sauri), which allow researchers to process, store and share the ever increasing
knowledge derived from their experiments. So far, these two kinds of resources
have usually lived apart, being its later integration a very hard task. However,
some exceptions exist where the lexicon is integrated with a semantic network
(e.g. the Unified Medical Language System3 ). In this paper, we show that both
cases have serious drawbacks.
    Instead, we propose a loose coupling between the domain ontologies and a
unique lexicon. Along this paper we show that the use and maintenance of such
a shared lexicon will enable both a better integration of domain ontologies with
3
    http://www.nlm.nih.gov/research/umls/
existing lexical resources and the proper evolution of the lexicon according to
these ontologies. We claim that the use of a shared lexicon will ease some of the
problems present during the development of ontologies and the interoperability
of the ontologies.
    In this paper we assume that domain ontologies and lexicons have different
purposes, and therefore they cannot be treated with the same techniques nor
simply merged into a common resource. A lexicon consists of a compendium of
words enriched with information of its usage [1], being concern with the linguis-
tic properties of words. We may encounter as well the term terminology, which
is usually referred as a specialized lexicon [2]. Instead, a domain ontology is an
explicit specialization of a conceptualization [3]. Domain ontologies have much
more specific purposes than lexicons, as their intended consumers are computer
applications rather than humans. Thus, domain ontologies do not need to care
about variants and syntactic categories of the terms they use. In addition, the
specific purpose of the ontologies motivates the development of different ontolo-
gies that still can label the concepts based on a shared lexicon.
    Regarding the semantically equivalent groups defined in a lexicon (e.g. the-
saurus entries, synsets, etc.), they also present significant differences with respect
to the concepts of domain ontologies. These semantic groups do not offer a clear
cut on their meaning as in an ontology where the concepts present disjoint inter-
pretations. Rather lexicons present fuzzy frontiers to allow the slightly different
interpretations humans can express with them.
    In Figure 1 we have ordered the existing formalisms (denoted by boxes)
according to their semantic expressiveness. Existing biomedical resources are
placed to their closer formalism. Genuine lexical resources are placed closer to
the left of the diagram like the Biolexicon[4], that contains terminology from
several resources with some linguistic relevant information. We find as well the
UMLS Specialist lexicon that has been used within several NLP and text mining
applications. Closer to the limit between a lexicon and an ontology we find
several resources that include links between lexical entries (e.g. UNIPROT).
More complex resources lie in between the definition of ontology and lexicon like
the NCI Metathesaurus, MeSH, ICD, the UMLS Metathesaurus and the OBO
ontologies that account for more complex representations similar to semantic
networks. Finally, at the end of the spectrum we find more formal ontologies
such as FMA or Galen, which expresses stronger semantics. Unfortunately, these
formal ontologies usually lack of lexical entries. As mentioned before, the aim of
this paper is to approach the problem of making these resources interoperable.
    The selected examples and use cases presented in this paper come from the
the application domain of the EC FP6 Health-e-Child (HeC) project [7] that
aims to develop an integrated health care platform for European paediatrics and
decision support tools to access personalized health information. HeC project is
mainly focused in paediatric heart diseases, inflammatory diseases (e.g. Juvenile
Idiopathic Arthritis) and brain tumours.
    The paper is organized as follows. Section 2 presents the ontology lifecycle
steps and motivates the relevance of having and using a shared lexicon/thesaurus.
                                                 Concepts and Formal Ontology
       Terms & Linguistic Relationships

                      Simple                                                          Complex
                                      Terminology              Frames
                     Taxonomy                                                          Logics
  Weak                          Terminologia
                    HGNC         Anatomica UMLS
Semantics                                                                    GALEN
               UMLS Lex          WordNet ICD         UMLS SN             (Grail - OWL)


               BioLexicon UniProtKB     NCI      OBO            FMA SNOMED
                          Taxonomy                             (Protégé)        Strong
                                                                             (KRSS)
                  UniProt                     MesH             (Frames)
                                                                                 Semantics
    Terms/                                      Semantic              Description
                             Thesauri
    Glossary                                    Network                 Logics


               Figure 1. Adapted Ontology Spectrum based on [5, 6, 2]


A discussion about current efforts, limitations and desired requirements for a
shared lexicon are presented in Section 3, moreover the main lexicon engineering
techniques are introduced. Section 4 comments the main experiences carried out
within the HeC domain. Finally some conclusions are given in Section 5.


2     The Role of Lexicons in the Ontology Lifecycle
Lexical forms present in available resources can be used for labeling ontological
concepts. The reuse of these labels in different ontologies in combination with
a proper definition of the ontological concepts may enable better integration of
ontologies. This section is intended to show the main problems that experts,
knowledge engineers and ontology engineers find in the different stages of the
lifecycle of the ontology development and how the use of a shared lexicon could
ease these problems.
    In this paper we adopt the METHONTOLOGY methodology [8] to illustrate
how a shared lexicon can help the development of an ontology and vice versa.
METHONTOLOGY proposes several steps for the lifecycle of an ontology: Re-
quirements Specification, Knowledge Acquisition, Conceptualization, Integration
with top ontologies, Implementation, Evaluation and Evolution/Maintenance.
As Figure 2 shows, the shared lexicon interacts with almost all the development
phases. Moreover, external resources like domain protocols, domain ontologies
and research articles will also play an important role as sources of knowledge.
In the following subsections we describe in detail the role of the lexicon at each
development phase.

2.1    Requirements Specification
Within the objectives of the HeC project, an ontology to describe a kind of
arthritis called JIA (Juvenile Idiopathic Arthritis) requires to be created. This
Figure 2. The Lexicon within the Ontology Life Cycle. Solid arrows represent an
essential role, whereas dashed arrows mean auxiliary role.


ontology is intended to represent the involved knowledge in JIA by means of
different levels of granularity: molecular (e.g. genomic and proteomic data), cel-
lular (e.g. results of blood tests), tissue (e.g. synovial fluid tests), organ (e.g.
affected joints), body (e.g. damage index, rheumatology examinations, treat-
ments), population (e.g. epidemiological studies). The purpose of this multilevel
representation is to give a complete characterization of the different JIA sub-
types in order to provide a rich ontological layer to the HeC System. This se-
mantic layer will be applied in Query Enhancement over the patient data, and
in the Decision Support Systems. JIA is a rare kind of Arthritis and there is
not yet a consensus about its classification nor even its name [9]. So far, three
classification schemes have been proposed, namely: ACR (American College of
Rheumatology), which uses Juvenile Rheumatoid Arthritis (JRA) as preferred
name and proposes three disease subtypes, EULAR (European League Against
Rheumatism), which opts for Juvenile Chronic Arthritis (JCA) and proposes
six disease subtypes, and finally ILAR (International League of Associations for
Rheumatology) which prefers JIA and proposes eight subtypes. In this stage, a
classification criterion should be chosen and the initial set of terms for describing
the disease and subtypes must be defined. Clearly, the use of a lexicon would
make easier the selection of terms (synonyms) for labeling the desired concepts.


2.2   Knowledge Acquisition

The knowledge acquisition in HeC is based on a set of medical protocols (in
[10] several techniques to automatically extract the main concepts from HeC
acquisition protocols are proposed) and the correspondent specifications of the
mentioned classification criterion. Each subtype of JIA is characterized by af-
fecting different set and number of joints, the occurrence of some symptoms like
fever or rash, the laboratory tests that are analysed, the different treatments
that are applied, etc. The development of the ontology from scratch would im-
ply the conceptualization of the different joints of the body, the classification
of the drugs for the treatments, the characterization of the different laboratory
tests, etc. Nevertheless this knowledge is already well known by the community
(unlike JIA) and it is assumed to be already defined in the available biomedi-
cal ontologies. As far as we know, the NCI thesaurus4 , the GALEN ontology5 ,
and the OBO ontologies6 contains information that is relevant to JIA such as
descriptions of diseases, drugs, laboratory tests, cells, human anatomy, etc.
     The reuse of knowledge represented in ontologies (see [11] for a survey) could
be interesting due to the following reasons: (a) developers save time through
reusing existing ontologies rather than writing their own; (b) the used knowl-
edge is commonly accepted by the community and used in similar applications;
(c) developers are not always experts in all the areas covered by a concrete
disease (i.e. drug classification). However, in practice important drawbacks arise
when merging ontologies. In this case, Ontology Matching 7 should be performed,
that is, to discover the correspondences between entities of the different ontolo-
gies. This task is rather hard [12] since in most cases there is not a common
nomenclature for the entity names. String matching techniques could provide an
approximate results in some cases like “NCI:Juvenile Rheumatoid Arthritis”
and “Galen:JuvenileArthritis”. However in other examples like “DiseaseOntol-
ogy:Chronic Childhood Arthritis” 8 additional knowledge should be provided in
other to establish the matching between concepts.
     Additionally, Semantic Compatibility should be also taken into account. Once
the lexical correspondence between concepts has been established, the ontologies
(or ontology modules) can be merged. At this point new challenges about the
semantic compatibility between the ontology axioms (e.g. unsatisfiability when
merging) arise, but they are outside the scope of this paper.
     Currently there exist several efforts in the creation of large biomedical on-
tologies. However, it seems they are evolving in a rather independent way. We
can understand that the conceptualization and formalization evolve with respect
to the specific requirements of a specific application, but the used nomenclature
should be shared. For example, the use of the concept Chronic Childhood Arthri-
tis 9 could vary between different domain ontologies, but the used term (JIA, JCA
or JRA) should refer to the same entity in the domain. The reuse of terms for
labels from a shared lexicon (e.g. UMLS) will relax considerably the required
4
  NCI thesaurus: ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NCI_Thesaurus
5
  GALEN Ontology: http://www.co-ode.org/galen
6
  Open Biomedical Ontologies: http://www.obofoundry.org/
7
  Ontology Matching Initiative: http://www.ontologymatching.org/
8
  Disease Ontology is an example of ontology enriched with synonyms coming from
  shared thesauri like UMLS or ICD. However, we propose that ontologies should only
  maintain a link to the correspondent thesaurus, see Section 3.4
9
  Chronic Childhood Arthritis is the preferred concept name in UMLS
matching tasks between ontologies. Ontology concepts could use any preferred
nomenclature (no spaces, use of hyphens, acronyms, short expressions, etc.) but
they will be annotated with a unique concept interpretation, that is, they will
point to an entry in the shared lexicon.
    As commented earlier, knowledge acquisition can require merging different
sources and ontologies. METHONTOLOGY already proposes the creation of a
glossary to enrich ontologies with synonyms and definitions in order to facilitate
the integration with other resources. Undoubtedly, this proposal should be kept,
but we should go further by making such a glossary available to all community.

2.3   Conceptualization
As commented above, the granularity of the ontology will be connected to the
purposes of the application, in this sense, the same entry in a lexicon could
have different interpretations within different ontologies. This characteristic is
related to the localized semantics proposed in [13], in which the concept context
is defined as local models representing a partial or concrete view of the domain.
For our purposes the concepts and theory treated in [13] are rather complex but
the general idea of the local use of a shared concept is important. For example,
following the mentioned classification criteria, the concept Chronic Childhood
Arthritis may have the interpretations given in axioms 1 to 3.


         ACR : JRA ≡ SystemicJRA t PolyArticularJRA t PauciarticularJRA            (1)
      EULAR : JCA ≡ SystemicJCA t PolyArticularJCA t PauciarticularJCA t
                      Juvenile Psoriatic Arth. t Juvenile Ankylosing Spondylitys   (2)
         ILAR : JIA ≡ SystemicJIA t PolyArticularJIA t OligoarticularJIA t
                      Psoriatic Arthritis t Enthesisrelated Arthritis              (3)


    Such interpretations may belong to three different JIA ontologies used, prob-
ably, for different application purposes. If in some moment they are required to
be integrated in only one ontology (perhaps a consensus is achieved and it is
established a unique classification criteria) the matching between labels (terms)
would be easier if a shared lexicon was used to annotate the concepts (i.e. Chronic
Childhood Arthritis ≡ JIA ≡ JCA ≡ JRA). The semantic integration, as com-
mented in Section 2.2, will depend on the compatibility of the used axioms within
the conceptualization and formalization of the merged JIA ontologies.
    It is worth mentioning that the design requirements of an ontology may in-
volve concepts with labels that are not present in most of the available lexicons.
For example, not all the subtypes of JIA are properly described in UMLS. As
commented, lexicons will help ontologies to use a common nomenclature, but
ontologies will also help lexicons to evolve. In general, ontologies will require a
finer granularity than the initially expressed by lexicons and will demand the
necessity of new concepts given the specific requirements of the domain. Obvi-
ously, a new challenge arises, that is, how to maintain consensual and shared
lexicons up-to-date with respect to the new specific ontologies and their evolu-
tions. Additionally, hypernym relationships within the lexicon may be useful in
order to check the coherence of the ontology conceptualization, that is, it may
be helpful to consider desired subsumptions (e.g. JIA v Systemic Disease) or
even to avoid non desired ones (e.g. JIA v N on Systemic Disease).


2.4     Evolution and Maintenance

The evolution and maintenance (addition of new concepts, the deletion of obso-
lete ones, the re-structuring of the already defined concepts, the addition of new
facts, etc.) of an ontology may be produced due to different reasons: requirements
changed, the domain has changed (e.g. new facts were discovered) or the point
of view of the domain changed (e.g. use of a different classification criterion).
The evolutions will imply to come back to previous steps in order to acquire
new knowledge and to integrate this knowledge within the ontology. Again, the
lexicon will play a key point providing the concepts necessities when possible or
being updated with new ontology requirements in order to keep up-to-date for
further ontology demands.
    In biomedicine the change and extension of the domain evolves quickly. Pub-
lications represent an important source of brand new facts of domain knowledge.
For example Medline10 indexes more than 800,000 new journal papers per year
containing the last research done in more than 700 topics. Text mining tech-
niques try to identify within the text concepts and facts relating them. These
techniques usually use domain lexicons in order to detect interesting entities
within text. However several studies (e.g. [14]) have already shown that the link
between the most relevant biomedical resources and the literature is not obvi-
ous. This is not only due to the complexity of the required matching algorithms
but also due to the decouple of the ontology/lexicon development effort and the
literature. In an important number of cases current lexical resources (in Section
3.1 some examples are given) do not provide useful synonyms to be detected
within the text. In order to overcome these problems lexicons should better se-
lect the synonyms to characterize their concepts considering, at least, the lexical
variants used in texts.


3      Towards a Thesaurus for Life Sciences

We have presented the relevance of the lexicon in the ontology lifecycle and how
this lexicon could be updated accordingly. Basically, the lexicon will provide
the necessary terminology required for the existing concepts. In case there is no
entry (e.g. subtypes of JIA) for a concept, the current process may suggest the
creation of this new entry.
    The proper creation of new entries will require the selection of the appropriate
terms (i.e. preferred name and synonyms). These terms may be provided by a
10
     Medline: http://medlineplus.gov/
community effort, where several domain experts study the appropriate set of
terms, and/or using natural language processing (NLP) and text mining [15]
to extract such terms from the literature [16]. A proposal for automatic term
management (ATM) can be found in [17]. This approach identifies three modules.
The first module is about automatic term recognition, which identifies lexical
structures that can be mapped to domain concepts. The second module implies
term structuring for identifying relevant relations or term associations, mainly
by using classification and clustering techniques. The last module consists of
an intelligent term manager that in addition of storing the terms accordingly it
may provide links and definitions to existing resources. Existing resources can
be reused either to train the classifiers or to use them in dictionary approaches
to term recognition. In addition to this approach we can use approaches that
collect existing structures from available terminological resources. For example,
UMLS is the result of merging several medical resources and thesauri. In this
case, similar issues to the alignment of ontologies have to be addressed. It is
worth mentioning that even this approach requires ATM solutions for extending
and maintaining the resulting meta thesaurus.
    As a consequence, the existence of a common thesaurus can help to link
concepts from existing resources at the same time that it ensures no duplicate
entries for the same concepts. This thesaurus will collect the different terms in a
common repository allowing ontologies to be linked accordingly. Thus, the final
scenario consists of one thesaurus and many specific ontologies. These ontologies
may be designed according to different criteria, for they are usually applied in
different contexts. We find the best example in the OBO ontologies where several
ontologies can overlap in some of their concepts.
    The generation of a common thesaurus requires the resolution of several
issues like an agreement concerning the meaning of the entries in the lexicon.
As we have seen, JIA already presents a difficult conceptualization even among
domain experts. The outcome of the research in the field may require not only to
create new concepts but also to split existing ones. This will imply the necessity
of maintaining the ontology up-to-date, since some of the links have become
obsolete (see Section 3.4). Additionally, another way of solving the problem
would consist of the generation of several versions.
    Although current approaches represent an important initiative for the con-
struction of a shared lexicon they still lack some important requirements to al-
low a straight forward interoperability with ontologies and text resources. Next
section presents the main limitations of current efforts and proposes some re-
quirements to be followed in order to get the intended lexicon.

3.1   Limitations of current reference lexicons
The UMLS Metathesaurus (UMLS-Meta) represents the best effort for the cre-
ation of a reference thesaurus. However it has several drawbacks, most of them
because of its complexity, since in some cases the UMLS-Meta is closer to an
ontology than to our intended thesaurus/lexicon. The UMLS-Meta contains con-
cepts from more than 100 terminologies, classifications, and thesauri, for exam-
ple: MesH, SNOMED CT or ICD. This makes UMLS-Meta a really rich source of
knowledge, but also a source of ambiguity, redundancy and meaningless entries.
In the literature we can find some efforts [18, 19] to normalize the UMLS-Meta by
filtering redundancy and solving a basic level of the ambiguity 11 . However some
ambiguity cases are rather hard to solve. This is the case of the term Prostate
Cancer which has associated two UMLS-Meta entries: C0600139 and C0376358.
Both concepts refer to the Neoplastic Processes, Carcinoma of prostate and Ma-
lignant tumor of prostate, respectively. These Neoplastic Processes have a close
relationship, indeed the former is represented as a child of the later within the
NCI and UMLS-Meta taxonomies.
     After filtering redundant cases and solving some of the trivial ambiguity prob-
lems the UMLS-Meta still contains a huge number of concept labels that surely
will not have a correspondence neither in ontology labels nor texts. Next we
present some representative cases (extracted from a portion of UMLS-Meta re-
lated to the JIA domain) that our intended thesaurus should avoid:

Descriptive names Some synonyms are closer to a text definition than to a
   concept name. For example, Therapeutic or Preventive concepts C0199105:
   “Anaesthesia for open procedure on knee joint Procedure”, and C0580168:
   “Amputation of finger through distal interphalangeal joint”. Nevertheless,
   not all concepts can be described with a few words. Indeed, such complex con-
   cepts should be described in formal ontologies by combining somehow smaller
   units of meaning of the lexicon, e.g. concept C0580168 can be formally de-
   scribed as Amputationu∃involve.F ingeru∃through.InterphalangealJoint,
   where the semantics for each of its elements is defined in a formal ontology.
   Additionally, each of the concept constituents can be linked to entries of the
   lexicon.
Parametrization in the label The Clinical Drug C1614077 has the preferred
   name “Etanercept 50 mg/mL subcutaneous solution”. This term indicates
   not only the drug name but also the dosage for this pharmaceutical prod-
   uct. Therefore the lexicon should contain only the generic name, and then
   the formal ontologies should represent “Etanercept 50 mg/mL subcutaneous
   solution” as either a subclass of “Etanercept” or just as an instance.
Complex nomenclature Chemicals formulae as in concept C0255404: “N-
   methyltropan-3-yl 2-(4-bromophenyl)propionate” are useful as a definition
   of the concept, but an ontology concept or a thesaurus term should not use
   this nomenclature. Moreover, entities detected in text rarely will match with
   this term.
Inappropriate syntax Concept C0366794 with Semantic Type Clinical At-
   tribute has the preferred name string “Hemoglobin C/Hemoglobin.total:Mass
   Fraction:Point in time:Whole blood”. Obviously this string is encoding some
   data perhaps only understandable in the source vocabulary.

    As commented above, UMLS-Meta is more complex than a simple thesaurus
or a glossary of terms and it does not only contain synonymy relations but also
11
     Filtering UMLS and solving Ambiguity: http://skr.nlm.nih.gov/papers/
inclusion relations like hyponymy and hypernymy (i.e. is-a or subsumption rela-
tions in ontologies) and part-whole relationships like meronymy and holonymy
(i.e. has-part, part-of). This makes UMLS-Meta really hard to evolve and main-
tain properly, and the inclusion of new vocabularies may introduce unexpected
classifications of the concepts (i.e. cycles). The desired thesauri should contain
a clearer hierarchy with or only hypernymy or only meronymy. The granularity
of the lexicon hierarchy could vary from a top level ontology classification (e.g.
UMLS Semantic Network) to fine granularity hierarchies like the OBO classifica-
tions or the UMLS-Meta hypernymy hierarchy itself. More complex classification
of the concepts should be delegated to the ontology conceptualization process.


3.2   Limitations of current reference ontologies

The OBO ontologies present a huge community effort in the development of on-
tologies, but we still miss the use of a common lexicon/thesaurus to normalize
the used nomenclature. Moreover, the OBO ontologies, like the UMLS-Meta ,
are in the middle of what we expect from an ontology and from a lexicon. The
underlying logic of the OBO ontologies is not too complex, being in most cases
limited to simple taxonomies (e.g. Disease Ontology). The Gene Ontology has
also assertions but in the most of cases they refer to concept metadata. The
use of more complex logic would give more expressive power to express complex
concepts that can not be described only by a name an a set of subsumptions.
Moreover this kind of ontologies will provide a framework to classify facts or
concepts of the world according to a concept definition without making an ex-
plicit specification of the subsumption. This would make easier the introduction
of new concepts within the hierarchy by only giving the definition of the con-
cept. For example, from the simple set of axioms 4 to 6 we would infer that
JIA v Systemic Disease, but without defining explicitly this axiom.


          Disease u ∃affects.Whole Body v Systemic Disease                     (4)
                                   JIA v Arthritis u ∃affects.Whole Body       (5)
                               Arthritis v Disease                             (6)


    Nevertheless, complex logics have also drawbacks in the sense of computabil-
ity and therefore a good balance between efficiency and expressivity should be
achieved. Apart from the expressivity issues, the OBO ontologies also present
some lexical problems. On one hand some of the concepts names used in OBO
ontologies, like some UMLS-Meta entries, are closer to definitions than to a
concept name (e.g. GO:0007180 “transforming growth factor beta ligand binding
to type II receptor” (biological process) or GO:0016456 “X chromosome located
dosage compensation complex, transcription activating” (cellular component)).
Like in UMLS, these concept names are of little help when performing for ex-
ample text mining tasks. On the other hand, these ontologies are overloaded
with too much metadata (i.e. synonyms, definitions, references) making hard
their management. For example, the Human Disease Ontology contains 14772
classes (and 1 property), 18593 subsumption relationships and 442168 entity an-
notations (i.e. synonyms, references to entries of other thesaurus, mainly UMLS,
ICD, SNOMED and MESH), therefore an average of almost 30 annotations per
class. The case of the Gene Ontology is similar, containing more than 450000 en-
tity annotation axioms for less than 30000 classes, and 150000 assertion axioms
being used as a annotation values.
    More formal ontologies like the Foundational Model of Anatomy (FMA is
available as Protégé Frames) and Galen (available in Grail and OWL) seem to
be projects being developed independently with respect to the UMLS-Meta and
OBO foundry efforts. Galen contains some information about synonymy but, as
far as we know, it does not provide an explicit connection with a public lexi-
con, indeed they present the problem of label (i.e. term) selection12 to better
describe the concepts without ambiguity. As commented previously some con-
cepts are hard to describe and the selection of a proper label for them is not
a straightforward task. The lexicon should provide a consensual term for the
label and the corresponding definition. As known, natural language could be
rather ambiguous when describing complex and similar concepts, for this reason
the ontology should provide a logic based and non-ambiguous description of the
desired concept. On the other hand, the development of FMA13 represents a
quite interesting initiative since FMA uses Terminologia Anatomica (TA) [20]
as an official source of anatomical terms. In this way they are making a clear
distinction between terms and concepts, and which role they have within the
terminology and the ontology respectively. Perhaps TA it does not represents
the desired lexicon since is not general enough but it seems to overcome some of
the proposed limitations.


3.3   Necessity of a Lexicon for properties

Finally, the use of the proper properties in the ontology lifecycle will have also
an essential role. On the one hand a lexicon of properties will be really helpful in
the correct connection between concepts, on the other hand the identification of
which properties are mainly used to relate concepts will help text mining tech-
niques to discover interesting knowledge from texts. In the literature we can find
some efforts in this line, mainly the ones proposing Ontology Design Patterns
(e.g. [21]) in which the set of allowed properties and expression are established
in advance. Additionally the UMLS Semantic Network14 also provides a speci-
fication of 96 properties (86 of them with an associated inverse). These efforts
represent good initiatives but they should be integrated within a standard and
shared lexicon, that is, the desired lexicon not only should store information
about terms but also the proper description of the properties that will relate
terms.
12
   Problems of Labels: http://www.opengalen.org/themodel/labels.html
13
   About FMA: http://sig.biostr.washington.edu/projects/fm/FME/aboutFME.html
14
   UMLS Semantic Network: http://semanticnetwork.nlm.nih.gov/
3.4   Thesauri-Ontology Linkage
As commented previously, we intend to have per each domain one shared the-
saurus and several ontologies using subsets of the thesaurus concepts for differ-
ent application purposes. Each ontology concept will be annotated (i.e. entity
annotation axioms in OWL[22]) with the corresponding term identifier of the
thesaurus. Optionally information from the lexicon can be integrated in the on-
tology to make faster its processing.
    The desired thesaurus will require a unique entry identifier, the link to the
words representing the terms including the preferred term and the synonyms.
The different words can be kept in a common table referenced by the entries in
the thesaurus that allows ambiguity analysis, i.e. how many entries are related
to the same word. Metadata added to the thesaurus entries will ease the search
for existing entries and solve ambiguous cases. The link of each entry to a Se-
mantic Category (e.g. disease, gene, drug, organ, etc.) has been shown helpful
for disambiguation purposes in many fields. Additionally, Semantic Categories
can belong a top ontology similar to the UMLS Semantic Network. In this case,
the finer the granularity of this semantic network is, the more precise will be the
searches. However, if such a network is too intricate, the resulting lexicon will
be hard to maintain for coherence. Thus, lexicon entries and their relationships
must be just focused on the definition, origin and purposes of their entries ac-
cording to the community requirements. Table 1 shows an example of entry for
the desired lexicon:

Identifier:            SWAT4LS0000001
Preferred Name:        Chronic Childhood Arthritis
Pref. Ontology Label: Juvenile idiopathic arthritis
Synonyms:              Juvenile idiopathic arthritis, Juvenile rheumatoid arthritis,
                       Juvenile arthritis, JIA, JRA, JCA, . . .
Semantic Category: Disease
Hypernymy:             Rheumatoid Arthritis
Definition:            Rheumatoid arthritis of children occurring in three . . .
Status:                Up-to-date
        Table 1. Example of Lexicon Entry (Source: UMLS Metathesaurus)


    The evolution of the ontology may imply changes over the thesaurus like
the addition of new entries, the deprecation of obsolete entries or the split of
entries in several ones. Obviously the evolution of the thesaurus will also affect
the referencing ontologies. For this reason, the lexicon should release stable ver-
sions periodically if important changes were made. Moreover each entry of the
thesaurus should also have metadata about the status of the entry, indicating if
the entry is being reviewed (new entries), is obsolete (pointing to which entry
or entries should be used instead), or just if the entry is up-to-date. Referencing
ontologies should periodically check if the referenced version of thesaurus is the
last one and if the used lexical entries suffered any change or become obsolete.
   The ontology and the lexicon are going to be quite interconnected during all
the ontology lifecycle stages, therefore the used ontology editor should allow the
connection to the lexicon in order to search for lexicon terms and to annotate
ontology concepts with the proper lexicon entry. The UMLS Tab15 for the ontol-
ogy editor Protégé16 was a good initiative trying to integrate UMLS-Meta within
the ontology lifecycle. On the other hand, the OBO ontology editor17 also allows
the linking (i.e. cross references) of defined concepts to synonyms coming from
other resources.


4    Experiences within the Health-e-Child Project
The Health-e-Child project has provide us an excellent real application domain
for our experiments. The Biomedical Knowledge Representation Workpackage is
intended to give an ontology-based representation of the HeC domains (e.g. JIA
disease) and to link that knowledge with external resources (e.g. text resources,
thesauri, biomedical databases, etc.) We have mainly focused our efforts on the
linkage to external knowledge. For this end we have worked on three main issues:
text mining, annotation of medical protocols and ontology reuse. Currently, the
development of the HeC ontologies is still an ongoing task.
    The work presented in [19] analyses different techniques to annotate textual
resources with UMLS-Meta terms, and it compares the results with an anno-
tated corpus. Concerning the recall results, we found that some lexical variants
are not covered by UMLS-Meta, that is, it lacks the desired entry or it does
not provided the proper synonym to identify the concept. Concerning precision
results, ambiguous entries within UMLS-Meta and partial annotations usually
lead to errors. As commented in Section 3.1, UMLS-Meta represents the main
effort on building a medical reference thesaurus, however it still needs to be
further polished, refined and extended.
    Within this project, another interesting task is to extract information con-
tained in medical protocols (e.g. patient data forms) [10]. For this purpose, we
regard these medical protocols as a set of input controls (input fields in patient
data forms), where each control has an associated text label (e.g. Date of Di-
agnosis, Bone Erosion Evaluation (BEE)). UMLS-Meta based annotations[19]
were used to assign a set of UMLS-Meta terms to each form control. After-
wards, a set of logical representations are associated to each form control (e.g.
BEE v ∃hasU M LS.C0587240 u ∃hasU M LS.C1261322). Moreover, this logic rep-
resentations have been integrated within a classification purpose ontology (see
[10] for a more comprehensive explanation) which aims to classify controls into
categories (e.g. Medical Procedure, Measurement, etc.). Again, incomplete and
wrong annotations due to ambiguous entries were the main problems. Unlike
text mining, where wrong annotations not necessarily have an important conse-
quence, wrong and incomplete annotations may imply a wrong characterization
15
   UMLS tab: http://protegewiki.stanford.edu/index.php/UMLS Tab
16
   Protégé: http://protege.stanford.edu/
17
   OBO-Edit: http://oboedit.org/
of the medical protocols. Hence, a richer and cleaner controlled vocabulary will
be necessary in order to improve the quality of semantic annotations.
    Finally, regarding our ontology reuse, our main experience stems from build-
ing modules from Galen and NCI ontologies [23]. Modules allow us to extract
the desired portion of knowledge from a target ontology, given a set of concepts
of interest (e.g. Juvenile Arthritis, Joint). However, the definition of the ex-
act concept labels of interest became a really hard and ontology dependent
task since no common terminology were used in NCI and Galen. Moreover
the integration of the extracted modules was again a cornerstone since not
only a different conceptualization were found but also different concepts names
representing the same reality (e.g. “NCI:Juvenile Rheumatoid Arthritis” and
“Galen:JuvenileArthritis”).


5   Conclusions

In this paper we have presented a still opened issue: the necessity of use and
maintenance of a lexicon for ontology engineering, specially for the Life Sci-
ences. We have also emphasized the main limitations and problems of current
approaches, which should be better coordinated, integrated and reused. The gap
between knowledge representation languages and domain expert skills is another
important issue to be addressed. In this way, very expressive languages like OWL
are being used for representing simple taxonomies. Instead, defining more com-
plex biomedical concepts requires good skills in Description Logics, which are
difficult to understand by domain experts.
    Future work will be focused on applying the ideas of this paper to the de-
velopment of the HeC domain ontologies. We also aim at creating a light-weight
thesaurus following the guidelines of this paper so that it provides all the neces-
sary lexical information required by the HeC ontologies and their applications.
Moreover, we will study how to filter and enrich existing lexical resources in
order to create this new thesaurus.


Acknowledgments
The authors wish to thank the EU project Health-e-Child (IST 2004-027749) for
providing us the application domain. This work has been partially funded by the
Spanish National Research Program (contract number TIN2005-09098-C05-04).
Ernesto Jimenez-Ruiz was supported by the PhD Fellowship Program of the
Generalitat Valenciana. Antonio Jimeno-Yepes was supported by funding from
the EC STREP project BOOTStrep (FP6-028099, http://www.bootstrep.org).


References

 [1] Hirst, G.: Ontology and the lexicon. In: Handbook on Ontologies in Information
     Systems, Springer (2004) 209–230
 [2] Bodenreider, O.: Lexical, terminological and ontological resources for biological
     text mining. In: Text mining for biology and biomedicine. Artech House. (2006)
 [3] Gruber, T.R.: Towards Principles for the Design of Ontologies Used for Knowledge
     Sharing. In Guarino, N., Poli, R., eds.: Formal Ontology in Conceptual Analysis
     and Knowledge Representation. (1993)
 [4] Pezik, P., Jimeno, A., Lee, V., Rebholz-Schuhmann, D.: Static dictionary features
     for term polysemy identification. Building and evaluating resources for biomedical
     text mining, LREC Workshop (2008)
 [5] Bechhofer, S.: Ontology language standardisation efforts. OntoWeb. Technical
     Report. http://www.ontoweb.org/About/Deliverables/d4.0.pdf (2002)
 [6] McGuinness, D.L.: Ontologies come of age. In: Spinning the Semantic Web:
     Bringing the World Wide Web to Its Full Potential. (2003)
 [7] Freund, J., et al.: Health-e-child: An integrated biomedical platform for grid-based
     pediatrics. In: Proc of Health-Grid 2006, Valancia, Spain (2006) 259–270
 [8] Fernandez, M., Gomez-Perez, A., Juristo, N.: Methontology: from ontological art
     towards ontological engineering. In: Proceedings of the AAAI. (1997)
 [9] Duffy, C.M., et al.: Nomenclature and classification in chronic childhood arthritis:
     Time for a change? Arthritis and Rheumatism 52(2) (2005) 382–385
[10] Berlanga, R., Jimenez-Ruiz, E., et al.: Medical data integration and the semantic
     annotation of medical protocols. In: The 21th IEEE International Symposium on
     Computer-Based Medical Systems (CBMS). (2008)
[11] Pinto, H.S., Martins, J.P.: Reusing ontologies. In: AAAI 2000 Spring Symposium
     on Bringing Knowledge to Business Processes, AAAI Press (2000) 77–84
[12] Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Proceedings
     of ODBASE. (2008)
[13] Bouquet, P., Giunchiglia, F., Harmelen, F., Serafini, L., Stuckenschmidt, H.: C-
     OWL: Contextualizing ontologies. In: Proc. of ISWC. LNCS 2870 (2003)
[14] Beisswanger, E., Poprat, M., Hahn, U.: Lexical Properties of OBO Ontology Class
     Names and Synonyms. In: 3rd International Symposium on Semantic Mining in
     Biomedicine. (2008)
[15] Spasić, I., Schober, D., Sansone, S., Rebholz-Schuhmann, D., Kell, D., Paton, N.:
     Facilitating the development of controlled vocabularies for metabolomics tech-
     nologies with text mining. BMC Bioinformatics 9(5) (2008) S5
[16] Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms:
     the c-value/nc-value method. International Journal on Digital Libraries (2000)
[17] Ananiadou, S., Nenadic, G.: Automatic terminology management in biomedicine.
     Text mining for biology and biomedicine. Artech House (2006) 67–97
[18] Aronson, A.R.: Mapping text to the umls metathesaurus. Technical report:
     http://skr.nlm.nih.gov/papers/index.shtml (2001)
[19] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-
     Schuhmann, D.: Assessment of disease named entity recognition on a corpus
     of annotated sentences. BMC Bioinformatics 9(Suppl 3) (2008) S3
[20] Rosse, C.: Terminologia anatomica: Considered from the perspective of next-
     generation knowledge sources. Clinical Anatomy 14(2) (2001) 120–133
[21] Egana, M., Antezana, E., Kuiper, M., Stevens, R.: Ontology design patterns for
     bio-ontologies: a case study on the cell cycle ontology. BMC Bioinformatics (2008)
[22] Cuenca-Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler,
     U.: OWL 2: The next step for OWL. Journal of Web Semantics (2008) To Appear.
[23] Jimenez-Ruiz, E., et al.: Safe and economic re-use of ontologies: A logic-based
     methodology and tool support. In: European Semantic Web Conference. (2008)