=Paper=
{{Paper
|id=None
|storemode=property
|title=BLOOMS on AgreementMaker: results for OAEI 2010
|pdfUrl=https://ceur-ws.org/Vol-689/oaei10_paper3.pdf
|volume=Vol-689
|dblpUrl=https://dblp.org/rec/conf/semweb/PesquitaSCC10
}}
==BLOOMS on AgreementMaker: results for OAEI 2010==
BLOOMS on AgreementMaker: results for OAEI 2010
1 1
Catia Pesquita , Cosmin Stroe 2, Isabel F. Cruz2 , Francisco M. Couto
1
Faculdade de Ciencias da Universidade de Lisboa, Portugal
cpesquitaatxldb.di.fc.ul.pt, fcoutoatdi.fc.ul.pt
2
ADVIS Lab, Department of Computer Science, University of Illinois at Chicago
cstroe1@cs.uic.edu, ifc@cs.uic.edu
Abstract. BLOOMS is an ontology matching method developed as part of an ontology
extension system for biomedical ontologies. It combines two lexical similarity measures with
similarity propagation. These matchers are applied sequentially, following their precision yield:
first lexical similarity based on exact matches, followed by partial matches, and finally these
similarities are propagated throughout the ontologies. Partial matches are based on the
specificity of words within the ontologies vocabularies. Semantic propagation of similarities is
made according to the semantic distance between ontology concepts given by semantic
similarity measures. Alignments are extracted after each matcher, to favor precision, since
BLOOMS was specifically designed to be as automated as possible. For the participation in
OAEI 2010 BLOOMS was integrated into the AgreementMaker system, which provided
ontology loading and navigation capabilities. We participated only in the anatomy track, in the
tasks #1 and #2 (fmeasure and precision), given that BLOOMS was specifically designed for
the automated matching of biomedical ontologies. We obtained encouraging results with an f
measure of 0.828 for task #1 and a precision of 0.967 for task #2. Although the current
implementation of BLOOMS results in very good precision values, recall is below that of the
highest performing systems. This motivates our future work in improving our semantic
propagation algorithm and exploiting external resources.
1 Presentation of the system
BLOOMS is an ontology matching method specifically intended for application to
biomedical ontologies. The matching of biomedical ontologies has become a focus of
interest in recent years due to the increasingly important role that biomedical
ontologies are playing in the knowledge revolution that has swept the Life Sciences
domain in the last decade. The pressing need for these resources resulted in the
parallel development of ontologies by different groups and institutions, giving rise not
only to different ontologies covering the same domain, but also to a lack of shared
standards and logical links between related ontologies. The alignment of biomedical
ontologies is thus crucial to take full advantage of them.
Biomedical ontologies present specific challenges and opportunities for their
alignment. One relevant feature of many biomedical ontologies that hinders their
alignment is their size, for instance the Gene Ontology contains over 30,000 concepts
and ChEBI over 500,000. Many of the systems developed for other domains have
difficulty in handling such large ontologies. On the other hand, most biomedical
ontologies support few types of relationships, which can hinder the performance of
matchers that explore more complex structures. Also, in most biomedical ontologies
edges do not all represent the same semantic distance between concepts, for instance,
edges deeper in the ontology usually represent shorter distances than edges closer to
the root concept.
Another relevant feature is the rich textual information in the form of concept names,
synonyms and definitions that most biomedical ontologies have. This can play a
crucial role in matching algorithms that exploit lexical resources but it can also be an
obstacle since biomedical terminology has a high degree of ambiguity.
In recent years OAEI has been the major play field for biomedical ontologies
alignment, in its anatomy track. One important finding of previous OAEI anatomy
tracks is that several matches are rather trivial and can be found by simple string
comparison techniques. Based on this notion, the work in [1] has applied a simple
string matching algorithm to several ontologies available in the NCBO BioPortal, and
reported high levels of precision in most cases. There are several possible
explanations for this, including the simple structure of most biomedical ontologies,
their high number of synonyms and low language variability. To improve on the
results of simple string matching, the most successful systems in previous OAEI
editions [2,3] have shown the advantages of two distinct strategies: (1) exploitation of
external knowledge and (2) composition of different matchers followed by
propagation of similarity. The first strategy uses background knowledge resources
such as the UMLS to support lexical matching of concepts [46]. The second strategy
propagates similarities between ontology concepts throughout the ontology graphs,
based on the assumption that a match between two concepts should contribute to the
match of their adjacent concepts, according to a propagation factor [7].
BLOOMS was designed to leverage on the success of simple lexical matching
methods, while still finding alignments where lexical similarity is low, by using
global computation techniques. It couples a lexical matching algorithm based on the
specificity of words in the ontology vocabulary, with a novel global similarity
computation approach that takes into account the semantic variability of edges.
1.1 State, purpose, general statement
The original purpose of BLOOMS is to provide the ontology matching component of
an ontology extension system called Auxesia. This system combines ontology
matching and ontology learning techniques to propose new concepts and relations to
biomedical ontologies. Consequently, BLOOMS was specifically designed to match
biomedical ontologies in a fully automated fashion, favoring precision over recall.
Although BLOOMS was specifically designed to be applied to biomedical ontologies,
its current implementation is domainindependent since it can function without
external forms of knowledge. To capitalize on the specific characteristics of most
biomedical ontologies, BLOOMS joins a lexical matcher to exploit the rich textual
component with a global similarity computation technique to handle the cases where
synonyms exist but are not shared between ontologies. Furthermore, BLOOMS can
also exploit annotation corpora, which are available for some biomedical ontologies,
to improve the propagation of similarity.
1. Specific techniques used
BLOOMS has a sequential architecture composed of three distinct matchers: Exact,
Partial and Semantic Broadcast Match. While the first two matchers are based on
lexical similarity, the final one is based on the propagation of previously calculated
similarities throughout the ontology graph. Figure 1 depicts the general structure of
BLOOMS.
Figure 1. Diagram of BLOOMS architecture. Given two ontologies, BLOOMS first
extracts alignments based on Exact matches, then on Partial matches, and finally it
propagates the similarities generated by those two strategies using the Semantic
Broadcast approach.
1.2 .1 Lexical similarity
Exact and Partial matchers use lexical similarity based on textual descriptions of
ontology concepts. Textual descriptors of concepts include their labels, synonyms and
definitions. Since ontology concepts usually have several textual descriptors (e.g.,
name, synonyms, definitions), the similarity between two ontology concepts is given
by the maximum similarity between all possible combinations of descriptors.
The first matcher, Exact Match, is run on textual descriptions after normalization and
corresponds to a simple exact match, where the score is either 1 or 0.
The second matcher, Partial Match, is applied after processing all concept's labels,
synonyms and definitions through tokenizing strings into words, removing stopwords,
performing normalization of diacritics and special characters, and finally stemming
(Snowball). If the concepts share some of the words in their descriptors, i.e. are partial
matches, the final score is given by a Jaccard similarity, which is calculated by the
number of words shared by the two concepts, over the number of words they both
have. Alternatively, each word can be weighted by its evidence content.
The notion of evidence content (EC) of a word [1] is based on information theory and
can be considered a term relevance measure, since it measures the relevance of a
word within the vocabulary of an ontology. It is calculated as the negative logarithm
of the relative frequency of a word in the ontology vocabulary:
EC word =−log freq word ∈V ontology
The ontology vocabulary corresponds to all words in all descriptors of all concepts in
the ontology. The final frequency of a word within an ontology corresponds to the
number of concepts that contain it in any of their descriptors. This means that a word
that appears multiple times in the label, definition or synonyms of a concept is only
counted once, preventing bias towards concepts that have many synonyms with very
similar word sets. The evidence content of words that are common to both ontologies,
is given by the average of their ECs within each ontology.
1.2 .2 Semantic Broadcast
After the lexical similarities are computed, they are used as input for a global
similarity computation technique, Semantic Broadcast (SB). This novel approach
takes into account that the edges in the ontology graph do not all convey the same
semantic distance between concepts.
This strategy is based on the notion that concepts whose relatives are similar should
also be similar. A relative of a concept is an ancestor or a descendant whose distance
to the concept is smaller than a factor d. To the initial similarity between concepts, SB
adds the sum of all similarities of the alignments between all relatives weighted by
their semantic gap sG, to a maximum contribution of a factor c. This is given by the
following:
Sim final c a ,c b =Simlex c a ,c b +c ∑ Simlex r i ,r j . sG c a ,r i ,c b ,r j
∣D r i ,c a ∧D r j ,cb