<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <abstract>
        <p>  BLOOMS   is   an   ontology   matching   method   developed   as   part   of   an   ontology  extension system for biomedical ontologies. It combines two lexical similarity measures with  similarity propagation. These matchers are applied sequentially, following their precision yield:  first lexical similarity based on exact matches, followed by partial matches, and finally these  similarities   are   propagated   throughout   the   ontologies.   Partial   matches   are   based   on   the  specificity of words within the ontologies vocabularies. Semantic propagation of similarities is  made   according   to   the   semantic   distance   between   ontology   concepts   given   by   semantic  similarity   measures.   Alignments   are   extracted   after   each   matcher,   to   favor   precision,   since  BLOOMS was specifically designed to be as automated as possible.   For the participation in  OAEI   2010   BLOOMS   was   integrated   into   the   AgreementMaker   system,   which   provided  ontology loading and navigation capabilities. We participated only in the anatomy track, in the  tasks #1 and #2 (f­measure and precision), given that BLOOMS was specifically designed for  the automated matching of biomedical ontologies. We obtained encouraging results with an fmeasure  of   0.828   for  task    #1   and     a  precision   of  0.967  for   task  #2.  Although   the  current  implementation of BLOOMS results in very good precision values, recall is below that of the  highest   performing   systems.   This   motivates   our   future   work   in   improving   our   semantic  propagation algorithm and exploiting external resources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 Faculdade de Ciencias da Universidade de Lisboa, Portugal
cpesquita­at­xldb.di.fc.ul.pt, fcouto­at­di.fc.ul.pt
2 ADVIS Lab, Department of Computer Science, University of Illinois at Chicago 
cstroe1@cs.uic.edu, ifc@cs.uic.edu
BLOOMS   is  an   ontology  matching   method   specifically   intended   for   application   to 
biomedical ontologies. The matching of biomedical ontologies has become a focus of 
interest   in   recent   years   due   to   the   increasingly   important   role   that   biomedical 
ontologies are playing in the knowledge revolution that has swept the Life Sciences 
domain   in   the   last   decade. The pressing need for these resources resulted in the
parallel development of ontologies by different groups and institutions, giving rise not
only to different ontologies covering the same domain, but also to a lack of shared
standards and logical links between related ontologies. The alignment of biomedical
ontologies is thus crucial to take full advantage of them.</p>
      <p>Biomedical   ontologies   present   specific   challenges   and   opportunities   for   their 
alignment.   One   relevant   feature   of   many   biomedical   ontologies   that   hinders   their 
alignment is their size, for instance the Gene Ontology contains over 30,000 concepts 
and   ChEBI   over   500,000.   Many   of  the   systems  developed   for   other   domains   have 
difficulty   in   handling   such   large   ontologies.   On   the   other   hand,   most   biomedical 
ontologies support few types of relationships, which can hinder the performance of 
matchers that explore more complex structures. Also, in most biomedical ontologies 
edges do not all represent the same semantic distance between concepts, for instance, 
edges deeper in the ontology usually represent shorter distances than edges closer to 
the root concept.</p>
      <p>
        Another relevant feature is the rich textual information in the form of concept names, 
synonyms   and   definitions   that   most   biomedical   ontologies   have.   This   can   play   a 
crucial role in matching algorithms that exploit lexical resources but it can also be an 
obstacle since biomedical terminology has a high degree of ambiguity.
In   recent   years   OAEI   has   been   the   major   play   field   for   biomedical   ontologies 
alignment, in  its  anatomy  track.  One  important  finding  of previous  OAEI   anatomy 
tracks   is   that   several   matches   are   rather   trivial   and   can   be   found   by   simple   string 
comparison  techniques. Based  on this notion, the work  in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has  applied a  simple 
string matching algorithm to several ontologies available in the NCBO BioPortal, and 
reported   high   levels   of   precision   in   most   cases.   There   are   several   possible 
explanations for  this, including the simple structure  of most biomedical  ontologies, 
their   high   number   of   synonyms   and   low   language   variability.   To   improve   on   the 
results   of   simple   string   matching,   the   most   successful   systems   in   previous   OAEI 
editions [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] have shown the advantages of two distinct strategies: (1) exploitation of 
external   knowledge   and   (2)   composition   of   different   matchers   followed   by 
propagation   of   similarity.   The   first   strategy   uses   background   knowledge   resources 
such as the UMLS to support lexical matching of concepts [
        <xref ref-type="bibr" rid="ref4 ref5">4­6</xref>
        ]. The second strategy 
propagates   similarities   between   ontology   concepts   throughout   the   ontology   graphs, 
based on the assumption that a match between two concepts should contribute to the 
match of their adjacent concepts, according to a propagation factor [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ].
BLOOMS   was   designed   to   leverage   on   the   success   of   simple   lexical   matching 
methods,   while   still   finding   alignments   where   lexical   similarity   is   low,   by   using 
global computation techniques. It couples a lexical matching algorithm based on the 
specificity   of   words   in   the   ontology   vocabulary,   with   a   novel   global   similarity 
computation approach that takes into account the semantic variability of edges.
1.1 
      </p>
    </sec>
    <sec id="sec-2">
      <title>State, purpose, general statement</title>
      <p>The original purpose of BLOOMS is to provide the ontology matching component of 
an   ontology   extension   system   called   Auxesia.   This   system   combines   ontology 
matching and ontology learning techniques to propose new concepts and relations to 
biomedical ontologies. Consequently, BLOOMS was specifically designed to match 
biomedical ontologies in a fully automated fashion, favoring precision over recall.
Although BLOOMS was specifically designed to be applied to biomedical ontologies, 
its   current   implementation   is   domain­independent   since   it   can   function   without 
external   forms   of   knowledge.   To   capitalize   on   the   specific   characteristics   of   most 
biomedical  ontologies, BLOOMS joins a lexical matcher to exploit the rich textual 
component with a global similarity computation technique to handle the cases where 
synonyms exist but are not shared between ontologies. Furthermore, BLOOMS can 
also exploit annotation corpora, which are available for some biomedical ontologies, 
to improve the propagation of similarity.</p>
    </sec>
    <sec id="sec-3">
      <title>1. Specific techniques used</title>
      <p>BLOOMS has a sequential architecture composed of three distinct matchers: Exact, 
Partial   and   Semantic   Broadcast   Match.   While   the   first   two   matchers   are   based   on 
lexical similarity, the final one is based on the propagation of previously calculated 
similarities throughout the ontology graph. Figure 1 depicts the general structure of 
BLOOMS.</p>
    </sec>
    <sec id="sec-4">
      <title>1.2 .1 Lexical similarity</title>
      <p>Exact   and   Partial   matchers   use   lexical   similarity   based   on   textual   descriptions   of 
ontology concepts. Textual descriptors of concepts include their labels, synonyms and 
definitions.   Since   ontology   concepts   usually   have   several   textual   descriptors   (e.g., 
name, synonyms, definitions), the similarity between two ontology concepts is given 
by the maximum similarity between all possible combinations of descriptors.
The first matcher, Exact Match, is run on textual descriptions after normalization and 
corresponds to a simple exact match, where the score is either 1 or 0.</p>
      <p>
        The  second  matcher,   Partial  Match,  is  applied  after   processing  all  concept's  labels, 
synonyms and definitions through tokenizing strings into words, removing stopwords, 
performing normalization  of diacritics  and special characters,  and finally stemming 
(Snowball). If the concepts share some of the words in their descriptors, i.e. are partial 
matches, the final score is given by a Jaccard similarity, which is calculated by the 
number of words shared  by the two concepts,  over  the number  of words they both 
have. Alternatively, each word can be weighted by its evidence content.
The notion of evidence content (EC) of a word [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is based on information theory and 
can   be   considered   a   term   relevance   measure,   since   it   measures   the   relevance   of   a 
word within the vocabulary of an ontology. It is calculated as the negative logarithm 
of the relative frequency of a word in the ontology vocabulary:
      </p>
      <p>EC  word =−log freq  word ∈V ontology
The ontology vocabulary corresponds to all words in all descriptors of all concepts in 
the ontology. The final frequency  of a word within an ontology corresponds to the 
number of concepts that contain it in any of their descriptors. This means that a word 
that appears multiple times in the label, definition or synonyms of a concept is only 
counted once, preventing bias towards concepts that have many synonyms with very 
similar word sets. The evidence content of words that are common to both ontologies, 
is given by the average of their ECs within each ontology.</p>
    </sec>
    <sec id="sec-5">
      <title>1.2 .2 Semantic Broadcast</title>
      <p>After   the   lexical   similarities   are   computed,   they   are   used   as   input   for   a   global 
similarity   computation   technique,   Semantic   Broadcast   (SB).   This   novel   approach 
takes into account that the edges in the ontology graph do not all convey the same 
semantic distance between concepts. 
This strategy is based on the notion that concepts whose relatives are similar should 
also be similar. A relative of a concept is an ancestor or a descendant whose distance 
to the concept is smaller than a factor d. To the initial similarity between concepts, SB 
adds the sum of all similarities of the alignments between all relatives weighted by 
their semantic gap sG, to a maximum contribution of a factor c. This is given by the 
following:</p>
      <p>Sim final  ca ,cb =Simlex  ca ,c b +c ∑ Simlex  ri ,r j  . sG  ca ,r i ,cb ,r j  
∣D  r i ,ca ∧ D  r j ,cb  &lt;d ∧r i ,r j ∈ A
where ca and cb are concepts from ontologies a and b, and ri and rj are relatives of ca 
and  cb  at   a   distance  D  smaller   than   a   factor  d  whose   match   belongs   to   the   set   of 
extracted alignments A.</p>
      <p>
        The   semantic   gap   between   two   matches   corresponds   to   the   inverse   of   the   average 
semantic   similarity   between   the   two   concepts   from   each   ontology.   Several   metrics 
can   be   used   to   calculate   the   similarity   between   ontology   concepts,   in   particular, 
measures based on information content have been shown to be successful [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].  
In   BLOOMS   we   currently   implement   three   information   content   based   similarity 
measures:   Resnik   [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],   Lin   [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]   and   a   simple   semantic   difference   between   each 
concept's   ICs.   The   information   content   of   an   ontology   concept   is   a   measure   of   its 
specificity in a given corpus. Many biomedical ontologies possess annotation corpora 
that are suited to this application. Nevertheless, semantic similarity can also be given 
by simpler methods based on edge distance and depth.
      </p>
      <p>Semantic broadcast can also be applied iteratively, with a new run using the similarity 
matrix provided by the previous.</p>
    </sec>
    <sec id="sec-6">
      <title>1.2.3 Alignment Extraction</title>
      <p>Alignment   extraction   in   BLOOMS   is   sequential.   After   each   matcher   is   run, 
alignments   are   extracted   according   to   a   predefined   threshold   of   similarity   and 
cardinality   of   matches,   so   that   the   concepts   already   aligned   are   not   processed   for 
matchers down the line. Each successive matcher has its own predefined threshold.
1.3 </p>
    </sec>
    <sec id="sec-7">
      <title>Adaptations made for the evaluation</title>
      <p>With   the   purpose   of   participating   in   OAEI,   BLOOMS   was   integrated   into   the 
AgreementMaker system [5] due to its extensible and modular architecture. We were 
particularly   interested   in   benefiting   from   its   ontology   loading   and   navigation 
capabilities, and its layered architecture that allows for serial composition since our 
approach   combines   two   matching   methods   that   need   to   be   applied   sequentially. 
Furthermore, we also exploited the visual interface during the optimization process of 
our   matching   strategy,   since   although   it   is   not   a   requirement   for   our   methods,   we 
found it to be extremely useful,  it supports a very quick and intuitive evaluation.
Since neither the mouse or the human anatomy ontologies have an annotation corpus, 
the Semantic Broadcast algorithm used a semantic similarity measure based on edge 
distance and depth, where similarity decreases with the number of edges between two 
concepts,   and   edges   further   away   from   the   root   correspond   to   higher   levels   of 
similarity.
2 </p>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <p>BLOOMS   was   only   submitted   to   the   anatomy   track,   since   it   is   being   specifically 
developed to handle biomedical ontologies. The anatomy track contains 4 tasks: in the 
first   three   tasks,   matchers   should   be   optimized   to   favor   f­measure,   precision   and 
recall, in turn. In the fourth task, an initial set of alignments is given, that can be used 
to   improve   the   matchers   performance.   In   addition   to   the   classical   measures   of 
precision,   recall   and   f­measure,   the   OAEI   initiative   also   employs   recall+,   which 
measures   the   recall   of   non­trivial   matches,   since   in   the   anatomy   track   a   large 
proportion of matches can be achieved suing simple string matching techniques.
We   only   participated   in   tasks   #1   and   #2,   since   BLOOMS   is   designed   to   favor 
precision.
2.1 </p>
      <p>anatomy
Taking advantage  of the SEALS platform we  ran  several  distinct  configurations  of 
BLOOMS, testing different  parameters  and also analyzing  the contribution of  each 
matcher to the final alignment.</p>
      <p>We found that after the first matcher is run, the alignments produced have a very high 
precision   (0.98),   but   the   recall   is   somewhat   low   (0.63).   Each   of   the   following 
matchers   increases   recall   while   slightly   decreasing   precision,   which   was   expected 
given the increasing laxity they provide.</p>
      <p>We also found that weighting the partial match score using word evidence content did 
not significantly alter results when compared to the simple Jaccard similarity.
For  task  #1 we  used  a  Partial  Match  threshold  of  0.9 and  a  final  threshold  of  0.4. 
Semantic   Broadcast   was   run   to   propagate   similarities   through   ancestors   and 
descendants at a maximum distance of 2, and contribution was set to 0.4. Using the 
SEALS evaluation platform, we obtained 0.954 precision, 0.731 recall, for a final F­
measure of 0.828 and a recall+ of 0.315.</p>
      <p>For   task   #2   we   used   a   Partial   Match   threshold   of   0.9   and   did   not   use   Semantic 
Broadcast.   With   this   strategy,   we   ensured   a   higher   precision,   of   0.967.   However, 
recall was not much lower than the one in task #1,  0.725 , which resulted in a final f­
measure of 0.829. 
We did not participate in other tasks, since BLOOMS was originally intended to yield 
a high precision, as it is intended to be run in a fully automated fashion as a part of an 
ontology extension system.
3 </p>
    </sec>
    <sec id="sec-9">
      <title>General comments</title>
      <p>We   find   that   the   SEALS   platform   is   a   very   valuable   tool   in   improving   matching 
strategies.   We find however that the 100 minute time limit might be detrimental to 
strategies that need to process large external resources.
3.1 </p>
    </sec>
    <sec id="sec-10">
      <title>Comments on the results </title>
      <p>BLOOMS was  designed to be  as fully automated  as possible, so it is more  geared 
towards increased precision than recall. Comparing our results for tasks #1 and #2, 
they clearly  indicate  that our semantic  broadcast  strategy  does not represent  a very 
heavy contribution to recall, but that we do capture nearly 10% more matches when 
using both the Exact and Partial Match strategies, than Exact Match alone. Also our 
recall+ is not very high, again highlighting the need to expand our strategy to improve 
recall. 
Nevertheless, we find our performance to be comparable to the best systems in 2009, 
and in 2010 our f­measure in task #1 is 5% lower that the best performing system, 
whereas in task #2 we are the second best system, with a slight difference of 0.1% in 
precision. These are encouraging results and we fully intend to participate in future 
events with an improved version of BLOOMS.
3.2 </p>
    </sec>
    <sec id="sec-11">
      <title>Discussions on the way to improve the proposed system </title>
      <p>We   are   planning   on   implementing   several   strategies   for   improvement   in   the   near 
future,   some   of  which   were   already  a   part  of   our  initial   strategy,  but  were   not  yet 
implemented at the time of OAEI 2010. To improve the lexical similarity matchers, 
future   versions  of  BLOOMS  will  take  into  account  spelling  variants   and  mistakes, 
and we will also investigate the feasibility of using external resources such as UMLS 
and WordNet to increase the number of synonyms for both terms and words. We feel 
this   would   greatly   improve   the   recall   of   our   strategy.   Regarding   similarity 
propagation,   we   will   work   extensively   on   improving   our   semantic   broadcast 
approach,   by   exploring   alternative   strategies   for   the   computation   of   information 
content   independently   of   an   annotation   corpus,   and   thus   expand   the   number   of 
semantic similarity measures that can be used. We will also adapt  semantic broadcast 
to propagate  dissimilarity,  and  decrease   the similarity  between  concepts   that might 
have a high lexical similarity but very distinct neighborhoods. 
4 </p>
      <p>Conclusion
Participating   in   the   anatomy   track   of   OAEI   2010   has   given   us   an   opportunity   to 
evaluate a matching algorithm developed with the practical purpose of being used in a 
semi­automated   ontology   extension   system,   Auxesia.     Our   matching   algorithm, 
BLOOMS,   is   intended   to   be   as   automated   as   possible,   and   thus   its   current 
implementation favors precision. This was clearly visible in the results we obtained in 
tasks #1 and #2 of the anatomy track of OAEI 2010, where we obtained high ranking 
precision values within the top 3, but lower recall.</p>
      <p>In   future   versions   of   BLOOMS   we   will   implement   several   strategies   designed   to 
improve recall, while minimizing precision loss.</p>
      <p>The   lessons   learned   throughout   this   period   will   undoubtedly   contribute   to   an 
improvement of our method. 
The   work   performed   at   University   of   Lisbon   by   Catia   Pesquita   and   Francisco   M. 
Couto was supported by the Multiannual Funding Program and the PhD grant SFRH/
BD/42481/2007.</p>
      <p>The   work   performed   at   UIC   by   Cosmin   Stroe   and   Isabel   F.   Cruz   has   been
partially   sponsored   by   the   National   Science   Foundation   under   Awards
IIS­0513553 and IIS­0812258</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Ghazvinian, A., Noy, N. , Musen, M. (
          <year>2009</year>
          ).
          <article-title> Creating mappings for ontologies in  biomedicine: Simple methods work</article-title>
          . In AMIA Annual Symposium (AMIA 
          <year>2009</year>
          ), 
          <fpage>2</fpage>
          . 
          <fpage>73</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Caracciolo, C., Hollink, L., Ichise, R., Meilicke, C., Pane, J. , Shvaiko, P. (
          <year>2008</year>
          ).  Results of the Ontology Alignment Evaluation Initiative 
          <year>2008</year>
          . The 7th International  Semantic Web Conference. 
          <volume>33</volume>
          , 
          <fpage>73</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Ferrara, A., Hollink, L., Isaac, A., Joslyn, C., Meilicke, C., Nikolov, A., Pane, J.,  Shvaiko,    P., Spiliopoulos, V. , Wang, S. (
          <year>2009</year>
          ). Results of the Ontology Alignment  Evaluation   Initiative  
          <year>2009</year>
          .   Fourth   International   Workshop   on   Ontology   Matching,  Washington, DC , 
          <fpage>1</fpage>
          . 16, 
          <volume>33</volume>
          , 
          <fpage>74</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.   Zhang,   S.,   Bodenreider,   O.   (
          <year>2007</year>
          ).   Lessons   Learned   from   Cross­Validating  Alignments between Large Anatomical Ontologies. MedInfo, 
          <volume>12</volume>
          , 
          <fpage>822</fpage>
          ­­
          <lpage>826</lpage>
          . 33 5.   Lambrix,   P.,   Tan,   H.   (
          <year>2006</year>
          ).   SAMBO  
          <article-title>­   system   for   aligning   and   merging  biomedical ontologies</article-title>
          . Web Semantics: Science, Services and Agents on the World  Wide Web, 
          <volume>4</volume>
          , 
          <fpage>196</fpage>
          ­­
          <lpage>206</lpage>
          . 
          <volume>31</volume>
          , 
          <fpage>33</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          6. Jean­Mary,  
          <string-name>
            <surname>Y.R.</surname>
          </string-name>
          , Shironoshita,
          <string-name>
            <surname> E.P.</surname>
          </string-name>
           , Kabuka,  
          <string-name>
            <surname>M.R.</surname>
          </string-name>
           (
          <year>2009</year>
          ).
          <article-title> Ontology matching  with   semantic   verification</article-title>
          .   Web   Semantics:   Science,   Services   and   Agents   on   the  World Wide Web, 
          <volume>7</volume>
          , 
          <fpage>235</fpage>
          ­­
          <lpage>251</lpage>
          . 
          <fpage>33</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          7. Cruz,
          <string-name>
            <surname> I.F.</surname>
          </string-name>
          , Antonelli, 
          <string-name>
            <surname>F.P.</surname>
          </string-name>
          , Stroe, C. (
          <year>2009</year>
          ). AgreementMaker: Efficient Matching  for Large Real­World Schemas and Ontologies. PVLDB, 
          <volume>2</volume>
          , 
          <fpage>1586</fpage>
          ­
          <lpage> </lpage>
          1589. 82 8. Couto, F., Silva, M. &amp; 
          <string-name>
            <surname>Coutinho</surname>
          </string-name>
          , P. (
          <year>2005</year>
          ).
          <article-title> Finding genomic ontology terms in text  using evidence content</article-title>
          . BMC Bioinformatics, 6, 
          <fpage>S21</fpage>
          . 
          <volume>57</volume>
          , 
          <volume>64</volume>
          9. Pesquita, C., Faria, D., Falcão, 
          <string-name>
            <surname>A.O.</surname>
          </string-name>
          , Lord, P., Couto, 
          <string-name>
            <surname>F.M.</surname>
          </string-name>
          ,  Bourne, P.E. (
          <year>2009</year>
          ).  Semantic Similarity in Biomedical Ontologies. PLoS Computational Biology, 
          <volume>5</volume>
          <fpage>10</fpage>
          .  Resnik,  P.  (
          <year>1998</year>
          ).   Semantic  
          <article-title>Similarity   in  a   Taxonomy:  An  Information­Based  Measure </article-title>
          and its Application to Problems of Ambiguity in Natural Language. 
          <source>Journal  of Artificial Intelligence Research.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          11. Lin 
          <string-name>
            <surname>D</surname>
          </string-name>
           (
          <year>1998</year>
          )
          <article-title> An information­theoretic definition of similarity</article-title>
          .
          <source> Proc. of the 15th  International   Conference   on   Machine   Learning</source>
          .   San   Francisco,   CA:   Morgan  Kaufmann. pp. 
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>