1


  Measuring the importance of annotation granularity
   to the detection of semantic similarity between
                  phenotype profiles
                                    Prashanti Manda1 , James P. Balhoff2 and Todd J. Vision1
                      1
                          Department of Biology, University of North Carolina at Chapel Hill, NC, USA
                                                 2
                                                   RTI International, NC, USA


   Abstract—In phenotype annotations curated from the biolog-            The EQ formalism has more recently been adopted by the
ical and medical literature, considerable human effort must be           Phenoscape project to curate phenotypes from the literature
invested to select ontological classes that capture the expressivity     that are reported to vary among evolutionary lineages [4] with
of the original natural language descriptions, and finer annotation      the goal of linking them to gene phenotypes and generating
granularity can also entail higher computational costs for partic-       hypotheses about the genetic bases of evolutionary transitions
ular reasoning tasks. Do coarse annotations suffice for certain
                                                                         [5].
applications? Here, we measure how annotation granularity
affects the statistical behavior of semantic similarity metrics.            In the EQ approach, an entity represents a biological ob-
We use a randomized dataset of phenotype profiles drawn                  ject, e.g. an anatomical structure, an anatomical space, or
from 57,051 taxon-phenotype annotations in the Phenoscape                a biological process, while a quality represents a trait or
Knowledgebase. We compared query profiles having variable                property that an entity possesses, e.g, shape, color, or size.
proportions of matching phenotypes to subject database profiles          Curators often create complex logical expressions called post-
using both pairwise and groupwise Jaccard (edge-based) and               compositions by combining ontology terms, relations, and
Resnik (node-based) semantic similarity metrics, and compared            spatial properties from multiple ontologies in different ways
statistical performance for three different levels of annotation
granularity: entities alone, entities plus attributes, and entities
                                                                         to create entities and qualities that adequately represent phe-
plus qualities (with implicit attributes). All four metrics examined     notypic descriptions. For example, “big supraorbital bone” is
showed more extreme values than expected by chance when                  represented as E: supraorbital bone (UBERON 0004747), Q:
approximately half the annotations matched between the query             enlarged size (PATO˙0000586). A more complex description
and subject profiles, with a more sudden decline for pairwise            such as “parietal fused with supraoccipital bone” is repre-
statistics and a more gradual one for the groupwise statistics.          sented by relating the two affected entities, supraorbital bone
Annotation granularity had a negligible effect on the position of        (UBERON 0004747) and parietal (UBERON 2001997) using
the threshold at which matches could be discriminated from noise.        the quality fused with (PATO 0000642).
These results suggest that coarse annotations of phenotypes, at
the level of entities with or without attributes, may be sufficient to
                                                                            Annotation of phenotypes at this level of ontological detail
identify phenotype profiles with statistically significant semantic      is time consuming and expensive [6]. Annotating evolutionary
similarity.                                                              phenotypes at the finest level of granularity often requires
                                                                         curators to create new ontology terms and request those terms
   Keywords—ontology, phenotype, curation, annotation granular-          to be added to the ontology. Coarse annotation removes the
ity, semantic similarity                                                 need for ontology development by limiting curators to a small
                                                                         set of attribute level qualities already present in the ontology.
                          I.   I NTRODUCTION                             Reducing the effort on curatorial tasks such as ontology
                                                                         development and data preparation improves the annotation rate
   To make phenotype descriptions in the biological and medi-            from two characters per hour to 14 characters per hour [6].
cal literature amenable to large-scale discovery and compu-              Thus coarse annotation can be part of an efficient annotation
tation, a variety of efforts have been launched to convert               workflow, and permit larger datasets to be curated for equiv-
such descriptions into logical expressions using ontologies and          alent resources. In addition, reasoning over the combinatorial
to integrate them into the larger ecosystem of online, open              entity and quality ontology space for EQ annotations poses a
biological information resources [1]. Typically, this involves           serious computational challenge.
curation and annotation of phenotypes in the Entity-Quality
                                                                            Given these competing considerations, what level of anno-
(EQ) formalism [2], which is widely used by model organ-
                                                                         tation granularity is optimal? The answer may depend on the
ism communities for representation of gene phenotypes [3].
                                                                         particular application. For Phenoscape, a major goal is to be
  manda.prashanti@gmail.com                                              able to find sets of phenotypes that show greater semantic simi-
  jbalhoff@rti.org                                                       larity than would be expected by chance when comparing sets
  tjv@bio.unc.edu                                                        of phenotypes from different biological domains (e.g. those
                                                                                                                                           2


observed in evolutionary lineages versus those induced by           Since the minimum value of I(A) is zero, at the root of the
genetic manipulations in the laboratory) [5]. When comparing        ontology, while the maximum value is − log(1/T ), we can
phenotypes with such different biological origins, we would         compute a Normalized Information Content (In ) with range
not expect to see congruence in fine detail for a variety of        [0, 1]
reasons. For instance, even if the same or homologous genes                                        I(A)
                                                                                       In (A) =
have contributed to the two profiles, independent changes to                                    − log(1/T )
those genes may underpin the phenotypes, they may be in
                                                                    The Resnik similarity (sR ) of two ontology classes is defined
lineages for which the genetic networks have diverged, and
                                                                    as the Normalized Information Content of the least common
there may have been considerable evolutionary modification
                                                                    subsumer (LCS) of the two classes.
of the phenotype since its first origin. Even if two biological
phenotypes are identical, the way in which the phenotypes                             sR (A, B) = In (LCS(A, B))
are observed and described by independent researchers may
lead to natural language descriptions, and thus profiles of         B. Profile similarity
annotations, that are quite different. With such weak matches,
do finer annotations enable similarities to be detected, or are        A set of ontology-based phenotype annotations is called a
finer annotations superfluous or even distracting?                  phenotype profile. When comparing two profiles, X and Y ,
   To explore this issue, we have conducted experiments to          where each has at least one, and potentially many annotations,
test the statistical sensitivity of semantic similarity at vary-    we could either summarize all the pairwise combinations
ing annotation granularity. Our approach involves simulating        of annotations, or we could compute a groupwise similarity
phenotype profiles by sampling from real annotations drawn          measure directly as a function of graph overlap.
from the Phenoscape Knowledgebase [5]. We measured sim-                1) Best Pairs: Pairwise approaches summarize the distribu-
ilarity between profiles that shared all, some or none of           tion of pairwise Jaccard or Resnik similarity scores between
their annotations, with the remainder drawn randomly from           annotations in the two profiles. Here we use the Best Pairs
the population of annotations. We assessed the decline of           score. For each annotation in X, the best scoring match in
semantic similarity to the point at which it could no longer        Y is determined, and the median of the |X| resultant values
be discriminated from random chance. This was done for four         is taken. Similarly, for each annotation in Y , the best scoring
different semantic similarity statistics, and for three levels of   match in X is determined, and the median of the |Y | values
annotation granularity.                                             is taken. The Best Pairs score pz (X, Y ) is the mean of these
                                                                    two medians. The index z can be used to denote whether the
                         II. M ETHODS                               pairwise values are Resnik (z = R) or Jaccard (z = J).
A. Semantic similarity metrics                                                pz (X, Y ) = (1/2)[bz (X, Y ) + bz (Y, X)]
   The four semantic similarity statistics we have chosen
represent extremes along two different dimensions by which          where
                                                                                                                        n              o
semantic similarity metrics vary [7–10]. Edge-based semantic           bz (X, Y ) =               median                 sz (Xi , Yj )
similarity metrics use the distance between terms in the on-                          i∈{1...|X|},j=argmax sz (Xi ,Yj )
                                                                                                    j=1...|Y |
tology as a measure of similarity. Node-based measures use
the Information Content of the annotations to the terms being       Note that, as defined, pz (X, Y ) = pz (Y, X).
compared and/or their least common subsumer. The similarity            2) Groupwise: Groupwise approaches compare profiles di-
metrics we have chosen are based on Jaccard (edge-based) and        rectly based on set operations or graph overlap.
Resnik (node-based) similarity, which are popular in biological        The Groupwise Jaccard similarity of profiles X and Y ,
applications (e.g. [11]). For each, we have one version that        gJ (X, Y ), is defined as the ratio of the number of classes
summarizes the distribution of pairwise similarities between        in the intersection to the number of classes in the union of the
two sets of annotations, and another that calculates a groupwise    two profiles
score directly.
   1) Jaccard similarity: The Jaccard similarity (sJ ) of two                                        |C(X) ∩ C(Y )|
                                                                                      gJ (X, Y ) =
classes (A, B) in an ontology is defined as the ratio of the                                         |C(X) ∪ C(Y )|
number of classes in the intersection of their subsumers over
the number of classes in their union of their subsumers [12].       where C(X) is the set of classes belonging to X plus their
                                                                    subsumers.
                               |S(A) ∩ S(B)|                           Similarly, the Groupwise Resnik similarity of profiles X
                  sJ (A, B) =
                               |S(A) ∪ S(B)|                        and Y , gR (X, Y ), is defined as the ratio of the normalized
where S(A) is the set of classes that subsume A.                    information content summed over all nodes in the intersection
   2) Resnik similarity: The Information Content of ontology        of X, Y to the information content summed over all nodes in
class A, denoted I(A) is defined as the negative logarithm of       the union.
the proportion of profiles annotated to that class f (A) out of
                                                                                               P
                                                                                                 t∈{C(X)∩C(Y )} In (t)
T profiles in total.                                                             gR (X, Y ), = P
                                                                                                 t∈{C(X)∪C(Y )} In (t)
                                     f (A)
                      I(A) = − log                                  where C(X) is defined as above.
                                       T
                                                                                                                                                                            3


Fig. 1. Profile decay via iterative replacement. Query profiles are selected
from the pool of simulated profiles (lower left). Filled circles represent                                      1.0
                                                                                                                          Best Pairs                        Groupwise
annotations, and annotations within the same profile are enclosed by boxes.
Circles of the same color represent the same annotation. At each iteration, one                                 0.8
of the remaining original annotations in the query profile is replaced with a                                   0.6
randomly selected annotation from the pool. The process continues until each         Jaccard
of the annotations in the original query profile has been replaced.                                             0.4
                                               Decayed Proﬁles                                                  0.2


                                                                                             Similarity score
                                        1 random          2 random                                              0.0                                                         E
                                        annotation       annotations                                            1.0                                                         EA
                       Query Proﬁle                                                                                                                                         EQ
                                                                                                                0.8
                                                                                                                0.6
                                                                                    Resnik
          Simulated                                                                                             0.4
           Proﬁles
                                                            Sampling                                            0.2
                                                       without replacement                                      0.0
                                                                                                                      2    4       6      8      10      2    4   6     8   10
                                          Annotation                                                                           Number of annotations replaced
                                            Pool


                                                                                  Fig. 2. Pattern of similarity decay with E, EA, and EQ data as profiles
                                                                                  are decayed via Random Replacement. Solid lines represent the mean best
                                                                                  match similarity of the 5 query profiles to the database after each annotation
C. Source data                                                                    replacement. Error bars show 2 standard errors of the mean. Dotted lines
                                                                                  represent the 99.9th percentile of the noise distribution.
   The Phenoscape Knowledgebase contains a dataset of 661
taxa with 57,051 evolutionary phenotypes, which are phe-
notypes that have been inferred to vary among the taxon’s                         granularity: entity only (E), entity-attribute (EA), and entity-
immediate descendents [5]. A simulation dataset of subject                        quality (EQ), we used three different phenotype ontologies,
profiles having the same size distribution of annotations per                     one for each granularity level, containing phenotype concepts
taxon was created by permutation of the taxon labels.                             combining terms from Uberon (entities) and PATO (attributes
                                                                                  and qualities). In each evaluation, annotations in the query pro-
                                                                                  files and the simulated database will match at the granularity
D. Simulating profile ‘decay’                                                     level available in the generated phenotype ontology.
   To simulate decay of profile similarity, five query profiles
of size ten were randomly selected from the simulated dataset.                                   III. R ESULTS AND D ISCUSSION
For each, there is one profile among the set of subjects for                         We measured semantic similarity between each of the five
which each annotation has a one-to-one perfect match. For                         query profiles and their decay series to all 661 profiles in
each of the five profiles, ten progressively decayed profiles                     the subject database. This was done for each of the four
were obtained by iteratively replacing one of the original                        semantic similarity metrics (Best Pairs and Groupwise variants
annotations with an annotation randomly selected from among                       of Jaccard and Resnik metrics) and for each of the three
the 57,051 available (Figure II-D). Thus, for each original                       granularity levels (E: Entity only, EA: Entity-Attribute, and
profile, there is a profile in which one original annotation has                  EQ: Entity-Quality). The results are shown in Figure 2). For
been replaced with random annotation, another in which two                        ease of interpretation, we take the upper 99.9% of the simi-
have been replaced, and so on, through to a fully decayed                         larity distribution for random profile matches as an arbitrary
profile in which all original annotations have been replaced                      threshold for comparing the sensitivity of the different series.
with a random one. To characterize the noise distribution for                     All series cross this threshold when approximately half of
each metric in the absence of semantic similarity, we also                        the annotations have been replaced, with a sudden decline
generated 5,000 profiles of size ten by drawing annotations                       in similarity for the Best Pairs statistics and a more gradual
randomly from among the 57,051 available. These profiles                          decline for the groupwise statistics. While the differences
would not be expected to have more than nominal similarity                        in sensitivity among the annotation granularity levels are
with any of the simulated subject profiles.                                       subtle, the annotations of intermediate granularity (EA) have
                                                                                  marginally greater sensitivity for all four statistics.
                                                                                     The sharp decline in similarity under the Best Pairs statistics
E. Adjusting annotation granularity                                               at approximately 50% decay can be understood as a result
   The evolutionary phenotypes available from Phenoscape                          of summarizing the pairwise distribution with the median. In
have been annotated with both entities and qualities, and                         future work, we aim to explore how the sensitivity of pairwise
the intermediate level of attribute is implicit in the quality                    statistics might be tuned by using different percentiles. Given
annotation due the structure of the PATO quality ontology                         the relatively flat performance of the Best Pairs statistics when
[4]. In order to measure semantic similarity for three levels of                  decay was under 50%, we suggest that groupwise statistics are
                                                                                                                                                               4


likely to provide greater discrimination between true matches                           J. Fernandez-Triana, M. Vihinen, P. D. Vize, L. Vogt, C. E. Wall, R. L.
of varying quality and thus better for rank ordering the                                Walls, M. Westerfeld, R. A. Wharton, C. S. Wirkner, J. B. Woolley,
outcome of semantic similarity searches, e.g. [5]. Our results                          M. J. Yoder, A. M. Zorn, and P. Mabee, “Finding our way through
                                                                                        phenotypes,” PLoS Biology, vol. 13, no. 1, p. e1002033, 2015.
also illustrate how difficult it can be to statistically discriminate
                                                                                  [2]   G. V. Gkoutos, E. C. Green, A.-M. Mallon, J. M. Hancock, and
weakly matching profiles from noise, something which has                                D. Davidson, “Using ontologies to describe mouse phenotypes,”
received relatively little consideration in many applications of                        Genome Biology, vol. 6, no. 1, p. R8, 2004.
semantic similarity search to date.                                               [3]   C. J. Mungall, G. V. Gkoutos, C. L. Smith, M. A. Haendel, S. E. Lewis,
   The relatively minor differences in statistical perfor-                              and M. Ashburner, “Integrating phenotype ontologies across multiple
mance with varying annotation granularity, with EA showing                              species,” Genome Biology, vol. 11, no. 1, p. R2, 2010.
marginally greater sensitivity, has implications both for the                     [4]   W. M. Dahdul, J. P. Balhoff, J. Engeman, T. Grande, E. J. Hilton,
process of generating annotations and the implementation of                             C. Kothari, H. Lapp, J. G. Lundberg, P. E. Midford, T. J. Vision,
                                                                                        M. Westerfield, and P. M. Mabee, “Evolutionary characters, phenotypes
semantic similarity computation. As noted in the Introduction,                          and ontologies: curating data from the systematic biology literature,”
annotation to EA requires considerably less human curation                              PLoS One, vol. 5, no. 5, p. e10708, 2010.
effort than EQ, and is almost identical in effort to curation                     [5]   P. Manda, J. P. Balhoff, H. Lapp, P. Mabee, and T. J. Vision, “Using the
to E. Restricting annotation granularity to EA may also ease                            phenoscape knowledgebase to relate genetic perturbations to phenotypic
the challenge of speeding of curation through machine-aided                             evolution,” Genesis, vol. 53, no. 8, pp. 561–571, 2015.
natural language processing, e.g. [13].                                           [6]   W. Dahdul, T. A. Dececchi, N. Ibrahim, H. Lapp, and P. Mabee,
   Second, the computational expense of measuring semantic                              “Moving the mountain: analysis of the effort required to transform
                                                                                        comparative anatomy into computable anatomy,” Database, vol. 2015,
similarity can be prohibitive for fine-grained annotations due                          p. bav040, 2015.
to an explosion in the number of classes required for reasoning                   [7]   C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M. Couto, “Semantic
when annotations draw from multiple ontologies [14]. If the                             similarity in biomedical ontologies,” PLoS Computational Biology,
inclusion of qualities does not improve sensitivity, that opens                         vol. 5, no. 7, p. e1000443, 2009.
up the possibility of conducting fast, scalable on-the-fly web-                   [8]   R. M. Othman, S. Deris, and R. M. Illias, “A genetic similarity algorithm
based semantic searches at coarser annotation levels.                                   for searching the gene ontology terms and annotating anonymous
   One of the contributions of this work is in introducing a                            protein sequences,” Journal of Biomedical Informatics, vol. 41, no. 1,
                                                                                        pp. 65–81, 2008.
framework for evaluating the statistical sensitivity of semantic
                                                                                  [9]   J. Z. Wang, Z. Du, R. Payattakool, S. Y. Philip, and C.-F. Chen, “A new
similarity metrics. Nonetheless, the results reported here are                          method to measure the semantic similarity of go terms,” Bioinformatics,
specific to one particular model for the decay of similarity                            vol. 23, no. 10, pp. 1274–1281, 2007.
between two profiles, in which some portion of annotations                       [10]   X. Wu, E. Pang, K. Lin, and Z.-M. Pei, “Improving the measurement
that match perfectly while others do not match at all. We                               of semantic similarity between gene ontology terms and gene products:
recognize a need to explore other models, especially ones                               insights from an edge-and ic-based hybrid method,” PloS One, vol. 8,
where pairs of annotations may match imperfectly. We also                               no. 5, p. e66745, 2013.
propose that other evaluation criteria should be examined                        [11]   N. L. Washington, M. A. Haendel, C. J. Mungall, M. Ashburner,
                                                                                        M. Westerfield, and S. E. Lewis, “Linking human diseases to animal
to more fully understand the trade-offs involved in building                            models using ontology-based phenotype annotation,” PLoS Biology,
datasets with a particular level of annotation granularity.                             vol. 7, no. 11, p. e1000247, 2009.
                                                                                 [12]   M. Mistry and P. Pavlidis, “Gene ontology term overlap as a measure
                IV. ACKNOWLEDGEMENTS                                                    of gene functional similarity,” BMC Bioinformatics, vol. 9, no. 1, p.
                                                                                        327, 2008.
  We thank W. Dahdul, T.A. Dececchi, N. Ibrahim and                              [13]   H. Cui, W. Dahdul, A. T. Dececchi, N. Ibrahim, P. Mabee, J. P. Bal-
L. Jackson for curation of the original dataset, along                                  hoff, and H. Gopalakrishnan, “Charaparser+eq: Performance evaluation
with the larger community of ontology contributors and                                  without gold standard,” Proceedings of the Association for Information
data providers (http://phenoscape.org/wiki/Acknowledgments#                             Science and Technology, vol. 52, pp. 1–10, 2015.
Contributors), and useful feedback from P. Mabee, H. Lapp,                       [14]   P. Manda, C. Mungall, J. P. Balhoff, H. Lapp, and T. Vision, “Investigat-
W. Dahdul, and other members of the Phenoscape team. This                               ing the importance of anatomical homology for cross-species phenotype
                                                                                        comparisons using semantic similarity,” Biocomputing 2016, pp. 132–
work was funded by the National Science Foundation (DBI-                                143, 2016.
1062542).

                              R EFERENCES
 [1]   A. R. Deans, S. E. Lewis, E. Huala, S. S. Anzaldo, M. Ashburner,
       J. P. Balhoff, D. C. Blackburn, J. A. Blake, J. G. Burleigh, B. Chanet,
       L. D. Cooper, M. Courtot, S. Cssz, H. , W. Dahdul, S. Das, A. T.
       Dececchi, A. Dettai, R. Diogo, R. E. Druzinsky, M. Dumontier, N. M.
       Franz, F. Friedrich, G. V. Gkoutos, M. Haendel, L. J. Harmon, T. F.
       Hayamizu, Y. He, H. M. Hines, N. Ibrahim, L. M. Jackson, P. Jaiswal,
       C. James-Zorn, S. Khler, G. Lecointre, H. Lapp, C. J. Lawrence,
       N. Le Novre, J. G. Lundberg, J. Macklin, A. R. Mast, P. E. Midford,
       I. Mik, C. J. Mungall, A. Oellrich, D. Osumi-Sutherland, H. Parkinson,
       M. J. Ramrez, S. Richter, P. N. Robinson, A. Ruttenberg, K. S.
       Schulz, E. Segerdell, K. C. Seltmann, M. J. Sharkey, A. D. Smith,
       B. Smith, C. D. Specht, B. Squires, R. W. Thacker, A. Thessen,