Modest Use of Ontology Design Patterns in a Repository of Biomedical Ontologies Jonathan M. Mortensen, Matthew Horridge, Mark A. Musen, and Natalya F. Noy Stanford Center for Biomedical Informatics Research Stanford University, Stanford CA 94305, USA Abstract. Ontology Design Patterns (ODPs) provide a means to capture best practice, to prevent modeling errors, and to encode formally common modeling situations for use during ontology development. Despite the popularity of ODPs and supposed positive effects from their use, there is scant empirical evidence of their level of adoption in real world ontologies or on their effectiveness. Knowing the goals of ODPs, they may assist in the development of large-scale biomedical ontologies. Before studying ODP effectiveness and applicability, we ask the following questions to understand better the landscape of ODP use: Are ODPs used in biomedical ontologies? Which patterns do the ontology developers use? In which ontologies? How frequently are patterns used? To answer these questions, we determined the adoption of ODPs from two popular ODP libraries among the ontologies in BioPortal, a large ontology repository that contains over 300 biomedical ontologies. We encoded 68 ODPs from two online libraries in the Ontology Pre-Processor Language, and, using these encodings, determined ODP prevalence in BioPortal ontologies. We found modest use of ODPs, with 33% of the ontologies containing at least one pattern. Upper Level Ontology, Closure, and Value Partition were the three most commonly used patterns, occurring in 20%, 9%, and 6% of the BioPortal ontologies, respectively. The low prevalence of ODPs may be due to lack of proper tooling, lack of user knowledge of and education about them, the age of the ontologies in the repository, or the specificity of some ODPs. We noted that there is a tension between the high expressivity of many ODPs and the goal of maintaining low expressivity of some biomedical ontologies. Additional tooling is necessary to make ODPs more accessible to domain experts. Furthermore, we suggest that ODPs may be developed in a bottom-up fashion, much like software-design patterns. 1 Keywords: OWL, biomedical ontologies, BioPortal, Ontology Design Pattern, Ontology Pre-Processor Language 1 Ontology Design Patterns There is a large body of research establishing and creating Ontology Design Patterns (ODPs) [11, 5]. Yet, there is little work to determine their use or effectiveness. In biomedicine, the development and use of ontologies are growing rapidly. This 1 Accompanying online resources at http://www.stanford.edu/people/mortensen/odp 2 development process can be difficult and/or error prone. As such, ODPs would likely assist with this development process. In this study, as initial work in evaluating the effectiveness and applicability of ODPs in biomedical ontologies, we examine the prevalence of ODPs in a large corpus of ontologies related to biomedicine. 1.1 ODPs and ODP libraries Software Design Patterns emerged in the 1990s, capturing recurring software design techniques seen in software [10]. Following a similar motivation, the Semantic Web community developed ODPs to alleviate some of the complexities in developing ontologies. ODPs, defined as “a modeling solution to solve a recurrent ontology design problems” [11], capture best practice and common modeling situations. The developers of ODPs suggest that by using the patterns, one can more easily avoid modeling errors, improve ontology quality, maintainability, and reuse [3]. ODPs have become quite popular recently, with multiple workshops held at ISWC, including one during ISWC 2012. There are two online catalogs of ODPs, the Manchester ODPs Public Catalog for bio-ontologies (MBOP) and OntologyDesignPat- terns.org (ODP-Wiki) [9, 1]. These catalogs describe each pattern by the problem that it solves, the proposed solution, and the formal representation by which to instantiate the pattern. MBOP contains 17 patterns derived from its authors’ experience in modeling ontologies in the biomedical domain and working with OWL-based ontologies in general. ODP-Wiki is a crowd-sourced effort to create an ODP library. The website owners ask for pattern submissions and then a committee reviews these submissions for approval. The approved patterns are then noted as such online. As of this writing, the committee has not approved any patterns but there are over 150 submissions. Most of the submissions on ODP-Wiki are “content” ODPs. However, the site cate- gorizes many other different types of ODPs. ODP-Wiki includes “structural” (methods to workaround for language expressivity limitations or define ontology shape/structure), “content” (modeling solutions for a specific domain), “correspondence” (methods to re- engineer an ontology to a different form or map an ontology to another), “reasoning” (patterns that enable one to obtain desired reasoning results),“presentation” (good prac- tices for readability and usability), and “lexico-syntactic” (mapping linguistic structures to ontology entities) patterns—a categorization based on descriptions by Gamgemi and colleagues [11]. MBOP categorizes patterns as “extension” (workarounds for language expressivity limitations), “good practice” (good modeling practice) and “domain modeling” (solutions specific to certain domains). The “structural” classification encompasses the majority of the MBOP patterns. In this work, the structural and content ODPs are most relevant. Structural patterns are either logical, adding logical expressions not contained directly in the ontology language, or architectural, defining the structure/hierarchy of the ontology itself. Content ODPs model a specific domain situation, and are directly re-usable (i.e., they should be directly imported into an ontol- ogy and used). We omit lexico-syntactic, presentation, reasoning, and correspondence patterns from this work, as we cannot test for them using our framework. Accompanying the MBOP, the Manchester group also developed the Ontology Pre- Processing Language (OPPL), both a language based on the Manchester syntax for OWL, and a software library, which leverages the OWL-API [14]. OPPL provides a way to manipulate ontologies, query for ODPs and instantiate them [16, 15, 2]. 3 1.2 Biomedical Ontologies In biomedicine, ontology use is rapidly increasing [7, 21]. For example, the National Center for Biomedical Ontology’s BioPortal,2 a repository of biomedical ontologies, contains over 300 ontologies and controlled terminologies as of this writing [18]. Biologists use biomedical ontologies to manage the large amount of data. Hospitals and related entities use them in the process of recording information about clinical encoun- ters, during clinical decision support, billing, and so on. Because biomedical ontologies are often large and complex, developing them and ensuring that they conform to best practices poses a formidable challenge. Even the widely used ontologies frequently contain modeling errors. For instance, Rector and colleagues discovered modeling issues in SNOMED CT, one of the most widely used biomedical ontologies [19]. Researchers have found modeling errors in the National Cancer Institute thesaurus [8]. ODPs may be especially important in assisting with the challenge of modeling the large and complex biomedical domains while preventing errors. Before assessing the effect of using ODPs on the biomedical ontology modeling process, we first find the prevalence of ODPs in a large biomedical ontology corpus. 2 Methods We quantified the use of ODPs from both MBOP and ODP-Wiki in BioPortal using OPPL and the OWL API. We first encoded ODPs in OPPL and validated their correctness (1) by using an expert opinion and (2) by comparing them to the examples in the library that served as a gold standard. We then obtained the ontologies from BioPortal, removing cases by use of predefined filtering criteria (See section 2.2). We normalized the ontologies to remove any differences in how they were specified, and then checked both the normalized and the original version for each encoded pattern, first filtering out patterns that cannot be represented in the ontology because it lacks the proper relations. 2.1 Pattern Selection We used the following criteria to select the set of patterns for this study: The pattern must be (1) detectable, (2) non-trivial (that is, not just a template), (3) positively reviewed (if a review is available), and (4) available in a public catalog (in our case, either MBOP or ODP-Wiki). We use these criteria for the following reasons: 1. Using only detectable patterns may seem obvious; however, there are many patterns such as n-ary relations, or re-engineering patterns that cannot be detected without more information than just the ontology. 2. A template style pattern may not require the presence of any particular elements. Thus, it would be trivially present even if the ontology contained no elements of the pattern. 3. When available, we considered review information on ODP-Wiki. Poorly reviewed patterns may not yet be refined, making them difficult to encode, especially if they have a logical error. 4. We chose only publicly available patterns, as it is a necessary condition for both reproducibility of this study and the expectation of pattern re-use. 2 http://bioportal.bioontology.org 4 Applying the criteria above to MBOP and ODP-Wiki, produced the following results: – From the 17 patterns in MBOP, we used 15. The remaining 2 were undetectable – From the 150 patterns in ODP-Wiki, we used 53. The remaining patterns were either optional or not positively reviewed. Thus, we selected 68 patterns of 167. 2.2 Ontology Selection From the available ontologies in BioPortal, we selected those ontologies that were publicly available, parseable, locatable (a file was easily obtainable), non-retired, available as a single file, and available as either OWL or OBO format. Applying these criteria to the 312 ontologies that were available in BioPortal as of January 2012, resulted in a set of 256 ontologies. 2.3 Pattern Encoding OPPL and the OWL API are open-source standard libraries available to work with ontology design patterns and ontologies. We encoded the MBOP and ODP-Wiki patterns with OPPL. Some patterns could not be encoded in OPPL. Those patterns we encoded directly in Java using the OWL API. An example OPPL encoding of the Value Partition pattern (a way to specify a set of disjoint qualities the describe a concept) follows: ?v1:CLASS, ?v2:CLASS, ?param:CLASS SELECT ASSERTED ?param EquivalentTo ?v1 or ?v2, ASSERTED ?v1 DisjointWith ?v2 BEGIN ADD ?v1 subClassOf Thing END; In order to reduce computational complexity, we pruned pattern–ontology pairs by first checking whether the ontology contains the specific relationships between concepts that a given ODP requires. An ontology without those relationships cannot have the pattern as the catalog specifies it. Furthermore, for those patterns that could not occur in any ontology from our selection, based on the required relationships, we did not encode the pattern. In particular, many content patterns refer to specific relationships in the ontology. For example, according to ODP-Wiki, the pattern Part Of requires the relationship “isPartOf”. Thus, if an ontology does not have this relationship “isPartOf”, we know that it will not have the pattern. When searching, we disregard the namespace of any given pattern, in case the pattern simply uses a different namespace (i.e., we only match on the URI fragment, not including the namespace). One might consider searching with possible lexical variants of this relationship term to ensure one finds occurrences which capture the intension of the specified relationship. However, the point at which a given string no longer matches the initial string is not well defined. Furthermore, content ODPs directly import a small module, thus the relation should not vary across ontologies. 5 Table 1. Transforms applied exhaustively to an ontology to normalize it. Axiom Transformation prop min 1 C prop some C prop exactly n C prop min n C, prop max n C prop value i prop some i Property in Anonymous Class Simplify Property (Removing inverses) and re- insert C1 and (C2 and C3) C1 and C2 and C3 C1 or (C2 or C3) C1 or C2 or C3 C1 EquivalentTo C2 C1 SubClassOf C2, C2 SubClassOf C1 C1 DisjointUnionOf C2 ... Cn DisjointClasses: C2 ... Cn, C1 EquivalentTo (C2 ... Cn) C1 or ... or Cn SubClassOf D1 and ... and Dn C1 SubClassOf D1 ... Cn SubClassOf D1 ... C1 SubClassOf Dn ... Cn SubClassOf Dn DisjointClasses: C1 ... Cn Ci DisjointWith Cj for 1 <= i