Definition Coverage in the OBO Foundry Ontologies: The Big Picture Daniel R. Schlegel Selja Seppälä and Peter L. Elkin Department of Health Outcomes and Policy Department of Biomedical Informatics University of Florida, FL, USA University at Buffalo, SUNY, Buffalo, NY, USA Email: sseppala@ufl.edu Email: @buffalo.edu I. I NTRODUCTION II. M ETHODS High quality ontologies have both textual and logical def- Textual definitions tell us about the properties of the in- initions for their terms. Definitions serve many purposes: stances of a class in an ontology. They typically have two good textual definitions allow for experts and non-experts parts: (i) a genus that states the type of thing of which they alike to understand the content of an ontology and use it are instances, and (ii) one or more differentia(e) that state in the manner the authors intended; logical definitions are the properties of these instances that differentiate them from necessary for reasoners to verify that an ontology is consistent, instances of neighboring types. and may make application of the ontology easier for users. To identify textual definitions, we used the IAO annotation Ideally, logical and textual definitions would convey the same property definition used in 103 of the 119 ontologies in information, and each can provide an accuracy check on the this study. We also examined the set of annotation properties other [1], [2]. used in the OBO Foundry ontologies that contained the string def but did not contain the strings editor, source, Producing definitions is difficult and time-consuming. Thus, citation, defines, or defined to try to capture any despite the best efforts of ontology developers and the exis- non-standard annotation properties which might have been tence of a number of tools and methods to populate ontologies used to signal a definition. We also included the IAO annota- with definitions, it is not uncommon to see missing textual or tion property elucidation for ontologies that contain some logical definitions, if not both. This is also the case in the primitive classes that cannot be, strictly speaking, defined. Open Biomedical Ontologies (OBO) Foundry [3] ontologies. One of the main components of ontologies are classes, The OBO Foundry contains 9 ‘core’ ontologies and 128 which are defined by class expressions. Class expressions non-core ontologies.1 These ontologies are developed in a represent conditions that individuals must satisfy to be mem- coordinated way according to a set of shared principles.2 One bers of a class. Some axioms, such as SubClassOf and of the OBO Foundry principles is about definitions: member EquivalentClass, define relationships between class ex- ontologies should have “textual definitions ... for a substantial pressions. These two axiom types, specifically, constitute the and representative fraction [of terms], plus equivalent formal logical definitions of the ontology terms. We say an axiom definitions (for at least a substantial number of terms).”3 contains a genus for the definition of class c1 if the axiom The statement of this principle is rather vague and elicits an contains some other class, c2 , where c2 is not part of an obvious question: How much is ‘substantial’? object property restriction; an axiom contains one or more We examine the coverage of textual and logical definitions differentiae for the definition of a class if the axiom contains throughout the OBO Foundry ontologies. In particular, we aim any object property restrictions. to determine: (1) if the prevalence of definitions is different For each ontology, we computed the number of classes that between the core and non-core ontologies; (2) if there are more contain: (i) at least one genus; (ii) at least one differentia; and textual than logical definitions; (3) if the size of ontologies has (iii) at least one of both. A class specified by both a genus an effect on definitional coverage. To conclude, we discuss and one or more differentiae has a complete logical definition. ways of quantifying the notion of ‘substantial’ definition coverage to determine to what extent the principle of having III. R ESULTS textual and logical definitions for a substantial number of terms We review our results in light of our goals stated in section I. is upheld. Item (1): Table I shows that the prevalence of definitions is different between the core and non-core ontologies. We found 1 Our study focuses on 119 ontologies out of the 137 present in the OBO that coverage within the 9 core ontologies was quite high, with Foundry, since 18 non-core ontologies were either unavailable on the web due to broken links, or they failed to load using the OWL API. 6 having textual definitions for over 90% of their terms. On 2 http://obofoundry.org/principles/fp-000-summary.html. average, core ontologies have textual definitions for 85.6% 3 http://obofoundry.org/principles/fp-006-textual-definitions.html. of their terms (stdev = 21%); non-core ontologies, 63% TABLE I IV. D ISCUSSION AND C ONCLUSION C OVERAGE OF TEXTUAL DEFINITIONS , LOGICAL DEFINITIONS , AND PARTS OF LOGICAL DEFINITIONS ACROSS THE CORE , NON - CORE , AND Determining if the principle of having textual and logical SUM TOTAL OF THE ANALYZED ONTOLOGIES IN THE OBO F OUNDRY. definitions for a substantial number of terms is upheld requires quantifying the notion of ‘substantial’ definition coverage. Core Non-Core Total Textual Definition Coverage 86% 64% 66% If we consider that ‘substantial’ equates with the average Logical Definition Coverage 53% 28% 30% definition coverage measured over the core ontologies, then Genera Covereage 91% 86% 86% an adequate coverage to be included in the OBO Foundry Genera Only Coverage 39% 58% 57% Differentiae Covereage 53% 34% 36% would be to have at least 86% of the terms specified with a Differentiae Only Covereage 0% 6% 6% textual definition and 53% with a complete logical definition. Whereas, considering all of the (analyzed) ontologies in the OBO Foundry, we get, respectively, 66% and 30%. To expect that all ontologies have coverage as complete as the core ontologies is unrealistic. Therefore, we quantify ‘sub- (stdev = 38%). Coverage for complete logical definitions stantial’ at roughly 65% for textual definitions, and propose among the core ontologies was 53% (stdev = 34%), and only that logical definitions be held to this standard as well. 28% (stdev = 29%) for the non-core ontologies. Over the full Having set a measure for substantial definition coverage in set of analyzed OBO Foundry ontologies, textual definition the OBO Foundry ontologies, our results show that on average coverage is on average 66% (stdev = 37%) and complete there is substantial coverage of textual definitions, but not of logical definition coverage, 30% (stdev = 30%). logical definitions. Item (2): Figure 1 shows that the studied ontologies have Definitions, both logical and textual, are essential compo- more textual than logical definitions and that the trends are nents of an ontology. The OBO Foundry has the noble goal of nearly opposite. Relatively few ontologies have poor textual creating a repository for ontologies developed using a shared definition coverage, while a large number have 90-100% set of principles, including some (vague) requirements for coverage. Conversely, a large number of ontologies have very including definitions. This study is the first one not only to poor logical definition coverage (0-10%), and few have good analyze the “big picture” of definition coverage in the OBO logical definition coverage. Foundry, but also to suggest a numeric value for ‘substantial’ definition coverage. Item (3): Figure 2 shows a correlation between ontology size and logical definition coverage. We grouped the ontologies R EFERENCES as follows: ‘very small’ (0-99 terms, n=17); ‘small’ (100-999, [1] S. Seppälä, Y. Schreiber, and A. Ruttenberg, “Textual and logical n=42); ‘medium’ (1,000-9,999, n=44); ‘large’ (10,000-99,999, definitions in ontologies,” in Proceedings of DIKR 2014, IWOOD 2014, n=11); and ‘very large’ (100,000+, n=3). We found that nearly and OBIB 2014, Boyce, R., et al., Ed., vol. Vol-1309. Houston, TX, USA: CEUR Workshop Proceedings (CEUR-WS.org), October 6-7 all groups had textual definitions for roughly 60-70% of their 2014, pp. 35–41. [Online]. Available: http://ceur-ws.org/Vol-1309/ terms. The ‘large’ category formed the only outlier, with a 33% [2] S. Seppälä, Y. Schreiber, A. Ruttenberg, and B. Smith, “Definitions in coverage. We examined logical definition coverage in three ontologies,” Cahiers de lexicologie, vol. 4, no. Numéro thématique ”Au coeur de la définition”, forthcoming. ways — the percent of classes: with genera; with differentiae; [3] B. Smith, M. Ashburner, C. Rosse et al., “The OBO Foundry: coordinated and with both. The percent coverage of complete logical evolution of ontologies to support biomedical data integration,” Nature definitions rose slowly as ontology size grew. biotechnology, vol. 25, no. 11, pp. 1251–1255, 2007. Definition Coverage by Ontology Size Number of Ontologies with % Definition Coverage 100 60 90 80 Coverage Percent 50 70 60 Number of Ontologies 40 50 40 30 30 20 20 10 0 10 Very Small Small Medium Large Very Large (0-99 terms) (100-999) (1,000-9,999) (10,000-99,999) (100,00+) 0 Ontology Size (terms) 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 Percent Coverage Textual Definition Coverage Logical Genera Coverage Logical Differentia Coverage Complete Logical Definition Textual Definitions Complete Logical Definitions Fig. 2. The coverage of textual and logical definitions by ontology size. Both Fig. 1. The number of ontologies with percent coverage of textual and the genus and differentia components of the logical definitions are shown, complete logical definitions. along with coverage for the complete logical definitions.