Improving bio-ontologies matching using types
             and adaptive weights

                   Bastien Rance1 and Christine Froidevaux1

                    LRI, UMR 8623, Univ. of Paris-Sud, CNRS
                          F-91405 Orsay CEDEX France
                           firstname.surname@lri.fr

    Functional annotation consists in assigning a biological function to a given
protein. It is a crucial task in biology and has various impacts on many fields,
including understanding cellular processes and drug designing. In order to be able
to share and reuse annotations, biologists and bioinformaticians have developed
structured controlled vocabularies that were first simple classifications and then
more elaborated ontologies such as the Gene Ontology [1].
    In our project, biologists and bioinformaticians collaborators are interested in
proteins annotated with two distinct ontologies, such that no protein is annotated
with both of them. These ontologies are merely functional hierarchies (Subtilist
[2] and FunCat [3]) that share common features: (i) a simple structure with no
explicit relationships (subsumption relationships can be deduced from concepts
identifiers), (ii) high broadness and small depth, and (iii) variable size.
    The system O’Browser [4] we have designed to align functional hierarchies,
is based on a weighted combination of matchers as many ontology matching
systems [5], with two original characteristics. Indeed, we had to face two issues:
(a) a high number of candidates pairs of concepts, and (b) a variable quality of
the results of the matchers with respect to the gold standard built by the expert.
    As the number of candidates pairs of concepts can be unnecessarily huge,
we propose to reduce it by exploiting domain knowledge. For it, we have used
types (groups of concepts sharing the same semantic context). Concepts that
are related to the same field (in our case the same functional genomic field) are
assigned to the same type. As an example, the concepts Utilization of Carbon
and Synthesis of Glucose are related to the type Metabolism. As in [6], concepts
of distinct types will never be mapped (e.g. Germination in the context of plants
and Germination in the context of bacteria). In our approach, an expert man-
ually assigns types to the top concepts of the hierarchies, that represent only a
small part of the whole set of concepts of both hierarchies. Types are then spread
to all concepts using subsumption relationships. In our experiment, the use of
types has allowed to divide the number of candidate pairs by 7. The originality
of our contribution is to propose a machine learning strategy to assign types to
concepts.
    The second issue is about the variable quality of the scores of a given matcher.
It has been shown that the good results of a matcher may be spoiled by the scores
of other matchers [7, 8]. To address this issue, we would like to give a high weight
to a matcher in a combination of matchers only when its results are informative.
We claim that the weight of a matcher in a combination should partially depend
on its scores (adaptive weighting). As an example, let us consider a string-
based matcher that compares concepts from two biological ontologies. If the
labels of the concepts are close, the two concepts are likely to be equivalent.
On the opposite, distant labels do not indicate necessarily that the concepts are
distant. Consequently the weight of the string-based matcher should be high for
high scores and weak for low scores.
For each matcher, we define a weighting function which associates a weight to
each score of the matcher. Let O1 (resp. O2 ) be the set of concepts of the first
(resp. second) ontology and let Mi be a matcher: O1 ×O2 → Domi , the weighting
function Wi is defined on Domi and has [0, 1] as a range. For example, assume
that the range of the string-based matcher is DomString−based = [0, 1]. Then
a weighting function could be the following simple function: WString−based :
[0, 1] → [0, 1], where WString−based (α) = 1 if α > 0.5 and WString−based (α) =
0.25 otherwise. Unlike in [9], we allow to associate a strong confidence (and
thus a high weight) to low results of a matcher in the case where the score
of the matcher is a strong indicator of the absence of equivalence between the
considered concepts.
     We successfully used types and adaptive weighting to align Subtilist and
FunCat and compared the results to the gold standard. O’Browser with adaptive
weighting found 80 % of the actual correspondences, while O’Browser with the
best classical matcher combination found only 70 % of them.


References
1. The Gene Ontology Consortium: Creating the gene ontology resource: design and
   implementation. Genome Res. 11 (2001) 1425–1433 http://www.geneontology.org.
2. Moszer, I., Jones, L., Moreira, S., Fabry, C., Danchin, A.: Subtilist: the reference
   database for the Bacillus subtilis genome. Nucleic Acids Res 30 (2002) 62–5
3. Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I.,
   Gldener, U., Mannhaupt, G., Mnsterktter, M., Mewes, H.: The FunCat, a functional
   annotation scheme for systematic classification of proteins from whole genomes.
   Nucleic Acids Res. 14((32)18) (2004) 5539–5545
4. Rance, B., Gibrat, J.F., Froidevaux, C.: An adaptive combination of matchers:
   application to the mapping of biological ontologies for genome annotation. In:
   Proc. of the 5th Data Integration in the Life Sciences workshop DILS’09. LNBI
   5647 (2009) 113–126
5. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (DE)
   (2007)
6. Zhang, S., Mork, P., Bodenreider, O., Bernstein, P.A.: Comparing two approaches
   for aligning representations of anatomy. Artificial Intelligence in Medicine 39(3)
   (2007) 227–236
7. Ghazvinian, A., Noy, N.F., Musen, M.A.: Creating mappings for ontologies in
   biomedicine: Simple methods work. Technical report, Stanford Center for Biomed-
   ical Informatics Research (2009)
8. Ontology Alignment Evaluation Initiative: http://www.oaei.ontologymatching.org
9. Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The harmony integration
   workbench. J. Data Semantics 11 (2008) 65–93