=Paper=
{{Paper
|id=Vol-1515/regular12
|storemode=property
|title=Scaffolding the mitochondrial Disease Ontology from extant knowledge sources
|pdfUrl=https://ceur-ws.org/Vol-1515/regular12.pdf
|volume=Vol-1515
|dblpUrl=https://dblp.org/rec/conf/icbo/WarrenderL15
}}
==Scaffolding the mitochondrial Disease Ontology from extant knowledge sources==
Scaffolding the Mitochondrial Disease Ontology from extant knowledge sources Jennifer D. Warrender and Phillip Lord∗ School of Computing Science, Newcastle University, Newcastle-upon-Tyne, UK ABSTRACT is_a: GO:0048878 {is_inferred="true"} Bio-medical ontologies can contain a large number of concepts. ! chemical homeostasis Often many of these concepts are very similar to each other, and intersection_of: GO:0048878 similar or identical to concepts found in other bio-medical databases. ! chemical homeostasis This presents both a challenge and opportunity: maintaining many intersection_of: similar concepts is tedious and fastidious work, which could be regulates_levels_of CHEBI:16040 ! cytosine substantially reduced if the data could be derived from pre-existing relationship: knowledge sources. In this paper, we describe how we have achieved regulates_levels_of CHEBI:16040 this for an ontology of the mitochondria using our novel ontology {is_inferred="true"} ! cytosine development environment, the Tawny-OWL library. As well as the axiomatisation, termgenie also generates a number of different annotations including a definition, submitter 1 INTRODUCTION information, and status. With termgenie, patterns are specified through the use of JavaScript functions. Bio-medical ontologies vary in size, with largest containing millions In addition to termgenie, other systems also allow patterns. of concepts. Building ontologies of this size is complex, time- For example, both the desktop and web version of Protégé consuming and expensive and just as challenging to maintain and contain forms, which grant users the ability to customise the update. GUI and specify several axioms at once. In this case, patterns Ontologies are only one of many mechanisms for the are declaratively defined (implicitly, with a GUI design) in computational representation of knowledge. In some cases, XML (Tudorache et al., 2013). Applications like Populous (Jupp ontologies are created where many of the needed concepts will be et al., 2011) and Rightfield (Wolstencroft et al., 2011) use available elsewhere as terms in different structured representations. spreadsheets or spreadsheet-like interfaces to enter data, which Being able to reuse these representations as a scaffold for the rest is then transformed into a set of OWL axioms based on a of an ontology might be able to reduce the cost and work-load of pattern. In the case of these two, the patterns are specified in producing ontologies. OPPL, a pattern language for OWL which can also be used This is evidenced by, for instance, SIO (Dumontier et al., 2014) independently (Egana Aranguren et al., 2009). Finally, the Brain which contains a list of all the chemical elements. Or the Gene API allows programmatic construction of ontologies in an easy to Ontology (GO) (Ashburner et al., 2000), which contains many terms use manner using Java (Croset et al., 2013). related to chemical homeostasis, each of which need to relate to While these systems are all aimed at somewhat different use- a specific chemical described in ChEBI (Hastings et al., 2013). cases, they all address the same problem; how to produce a large In addition to being described elsewhere, these concepts are often number of concepts all of which are similar, and to do so with a high- highly similar to each other. In extreme cases such as the amino acid degree of repeatability. However, the use of this form of patternised ontology (Stevens and Lord, 2012), ontologies can consist of only ontology tool presents a number of problems. These tools provide related concepts, and “support” concepts that are used to describe a mechanism for adding many axioms at once, but not removing them. them again1 . If the knowledge changes, then this is a problem as the One solution to this is the use of patterns. A pattern is an axioms added from a given pattern need to be removed or updated. abstract specification of an ontology axiomatisation with a number Furthermore, if the knowledge engineering changes i.e. the pattern of “variables”. The pattern is instantiated by providing values for is updated, then all axioms added from any use of the pattern must these variables, which are then expanded into the full axiomatisation also be updated. providing one or more concepts. In this paper, we describe how we have addressed these problems Patterns have been implemented by a number of different tools, with the Mitochondrial Disease Ontology (MDO), through the use which differ in how the patterns are specified, and how and when of the Tawny-OWL environment, which is a fully programmatic the values are provided for the variables. For example, termgenie environment for ontology development. With Tawny-OWL, we is a website which allows submission to GO (and others) (Dietze can use a pattern-first ontology development process, building et al., 2014). Variable values are entered through a form which then with patterns and data from extant knowledge sources from the generates axioms, definitions and cross-references. For instance, start. This has allowed us to generate a scaffold which we can this is the axiomatisation from termgenie when defining the term then populate further with hand-crafted links between parts of this “cytosine homeostasis” scaffold where the knowledge exists. As a result, it is possible to ∗ To whom correspondence should be addressed: phillip.lord@newcastle.ac.uk 1 OPPL can remove axioms as well as add them but this is not automatic. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 J. D. Warrender and P. Lord update both the knowledge and the patterns by simply regenerating Patterns are encoded as functions and instantiated with function the ontology. This process promises to aid in both the construction calls. For instance, we could define some-only as follows: and maintenance of ontologies. (defn some-only [property & classes] The MDO is available from https://github.com/ (list (some property classes) jaydchan/tawny-mitochondria. Tawny-OWL is available (only property from https://github.com/phillord/tawny-owl. (or classes)))) 2 THE MITOCHONDRIA DISEASE ONTOLOGY Here defn introduces a new function, property & classes (MDO) are the arguments, and list packages the return values as a list. Mitochondria are complex organelles found in most eukaryotic some, only and or are defined by Tawny-OWL as the appropriate cells. Their key function is to enable the production of ATP OWL class constructors. through oxidative phosphorylation, providing usable energy for the It is, therefore, possible to build localised patterns — custom rest of the cell. The mitochondria carry their own small genome patterns for use predominately with the current ontology (Warrender, containing 37 genes in human. Many other genes are involved in 2015). Patterns can call each other and can be of arbitrary producing proteins involved in mitochondrial function, but these complexity. The use of Tawny-OWL, therefore, inverts the usual are encoded in the nuclear genome. A number of mitochondrial style of ontology development. Non-patternised classes are just genes are associated with diseases; the first identified of these is the trivial instantiations of patterns. MELAS (Pavlakis et al., 1984), which is most commonly caused by a point mutation in a tRNA found in the mitochondrial genome. 4 BUILDING A MITOCHONDRIAL SCAFFOLD As with many areas of biology, mitochondrial research is a Following a requirements gathering phase for MDO, it was clear large, knowledge-rich discipline. Our purpose with the MDO is to from our competency questions (for example “What are all the attempt to formalise this knowledge, using an incremental or “pay- genes/proteins that are associated with a specific syndrome?”) that as-you-go” data integration approach. The ontology here serves we needed many concepts which were heavily repetitive, and as a tool for reasoning and knowledge exploration, rather than further which have comprehensive and curated lists available. We to form as a reference ontology (Stevens and Lord, 2008). This describe these parts of the domain knowledge as the scaffold. For is an approach we have previously found useful in classifying example, there are around 761 genes whose products are involved phosphatases (Wolstencroft et al., 2006). The hope is that we in mitochondrial function. Classes representing these genes do not, can incorporate new knowledge as it is released, checking it for in the first instance, require complex descriptions, and are defined consistency and cross-linking it with existing knowledge. within MDO as follows: 3 TAWNY-OWL (defclass Gene) In this section, we give a brief description of Tawny-OWL (Lord, (defn gene-class [name] 2013) and how it supports pattern-first development. Tawny-OWL (owl-class name :label name :super Gene)) is a library written in Clojure, a dialect of lisp. It wraps the OWL API (Horridge and Bechhofer, 2011) and allows the fully This pattern is then populated using a simple text file, with programmatic constructions of ontologies. It has a simple syntax the 761 gene names present. The gene pattern is an extremely which was modelled on the Manchester Syntax (Horridge and Patel- simple pattern, as these concepts are self-standing. Other parts Schneider, 2012), modified to integrate well with Clojure. It can be of the ontology are even simpler; for instance, for describing used to make simple statements in OWL: mitochondrial anatomy, the classes have similar complexity to the (defclass A :super (some r B)) genes, but there are only 15. In this case, classes are defined with a pattern and a list “hard-coded” into the MDO source code, rather which makes defines a new class A such that A v ∃ r B. than using an external text file. Other patterns are more complex. Although this is similar to the equivalent Manchester Syntax For instance, the subclasses of Disease are defined as follows: statements, Tawny-OWL provides a feature called “broadcasting” (defn disease-class [name omim lname] which is, essentially a form of pattern. So this following statement: (let [disease (owl-class name (some r B C) :label name :super Disease)] is equivalent to the two statements ∃ r B and ∃ r C. We (if-not (nil? omim) apply the first two arguments (some and r) to the remaining ones (refine disease consecutively. It also provides simple patterns, such as the covering :annotation axiom, so: (see-also (str "OMIMID:" omim)))) (some-only r B C) (if-not (nil? lname) (refine disease is equivalent to three statements ∃ r B, ∃ r C and ∀ r (B t C). :label (str "Long name:" lname))))) While the patterns shown here are provided by Tawny-OWL, end ontology developers are using the same programmatic environment. 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Scaffolding the Mitochondrial Disease Ontology This function adds two annotations to each disease class, if created from this approach from around 30 papers. These terms they are available. This function also demonstrates the use of currently are not defined beyond their name and the source paper conditionals (if), predicates (nil?) and string concatenation from which they were identified. We do not consider them directly (str); these are not provided by Tawny-OWL, but by Clojure as part of the scaffold, as they are not from an extant knowledge and demonstrate the value of building Tawny-OWL inside a fully source, but one that we have created; they are the first layer build programmatic environment. on top of our scaffold. We expect future layers to use the Tawny- OWL syntax directly, as the knowledge increases in complexity and 5 FITTING OUT THE SCAFFOLD decreases in regularity. The top-level of the MDO is shown in Figure 1. Of these classes, “Paper” and “Term” are described later. 6 RESILIANCE TO CHANGE One key feature of our development process is that the OWL which defines the MDO is no longer source code but generated. Rather it is generated from patterns defined in Tawny-OWL and text files which are used to instantiate these patterns. The in-memory OWL classes and associated OWL files are generated on-demand, by evaluating the patterns. Effectively, we regenerate the ontology every time we restart the environment. In this section, we consider the types of changes that can happen, and how these changes impact on MDO. The scaffold of MDO is sensitive to changes in its dependency knowledge sources. First, new terms can be entered into extant sources, which will necessitate the addition of new classes. For the MDO, this simply necessitates re-importing the knowledge. The addition of equivalent new classes will then happen automatically according to the patterns already defined; no other changes should be necessary for the MDO, although we may wish to refer to the Fig. 1. The top-level structure of Mitochondrial Disease Ontology. Classes new classes in other parts of the ontology. that are a part of the scaffold are coloured in orange, while classes that are Second, terms may be removed from dependencies; so, for built on top of the scaffold are coloured in green. example, a disease may be redefined by the UMDF. In many cases, for the MDO, this is not problematic – the equivalent classes The remaining classes define the scaffold, which now has a total will simply disappear from the ontology. Tawny-OWL provides of 1357 classes; a break-down of these classes and their sources is two features to help with changes to terms in the scaffold when shown in Table 1. these terms are also referred to outside of the scaffold. Tawny- OWL uses a “declare-before-use” semantics, so removal of classes from the scaffold will cause fail-fast behaviour when they are Class type Count Data source used elsewhere. The Brain environment uses the same semantics Disease 41 UMDF website for similar reasons (Croset et al., 2013). In addition, Tawny- Gene 761 The NCBI Gene portal OWL provides a “deprecation” facility which allows the developer Human Anatomy 61 The Terminologia Anatomica. to continue refer to terms from the scaffold which have been Mitochondrial Anatomy 15 Mitochondrial Research Group removed, but to receive warnings about this use; this is rather like Protein 479 UniProt obsolescence, but happens automatically3 . Table 1. Table showing the type, number of and data source for each Third, the MDO scaffold can also cope straight-forwardly with generic mitochondrial ontology class changes to patterns. As with the addition or removal of terms from dependencies, pattern changes will simply take place by re-evaluating the ontology. Finally, the MDO is resilient to changes in ontology engineering For the next stage of the process, we are now building on top conventions. For example, MDO does not use OBO style numeric of this scaffold, using hand-crafted and bespoke knowledge. This identifiers, nor provide stable IRIs for integration with linked data is being achieved by manual extraction of knowledge from papers sources since these are not critical at the current time4 . They, about mitochondria. Our initial process is to find references in however, could be added easily to all existing (and future) terms papers to the terms that are represented by classes we have built in a few lines of code, using an existing facility within Tawny-OWL in the scaffold, and draw explicit relationships between these papers for minting and persisting numeric identifiers in an automatic, yet and the scaffolded knowledge that they describe. Currently, these managed, way. This change would just alter IRIs and would have classes also use a patternised approach; the raw data is held in a bespoke (but human readable) syntax2 , which is then parsed and used to instantiate patterns. In total, there are now 2174 classes 3 Tawny-OWL is implemented in a Lisp and so is homoiconic; this makes it particularly straight-forward to automate code updates if we choose. 2 In this case EDN which is a text representation of Clojure data structures; 4 Our initial intention was to use PURLS from www.purl.org but have it looks rather like JSON. found practical problems with generating these. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3 J. D. Warrender and P. Lord no impact on references between concepts inside or outside of the the gene lists, we can either import from a local, fixed copy of scaffold. this list, or take the current version live from the NCBI portal. In In conclusion, as well as enabling rapid construction of the MDO, software engineering terms, the former is a release dependency we believe that the pattern-first scaffolding approach should also and provides stability, while the latter is a snapshot dependency allow easy maintenance of the ontology. which will fail-fast, allowing rapid incorporation of new knowledge. The latter is particularly useful within a continuous integration 7 DISCUSSION environment which are used with other ontologies (Mungall et al., 2012), and are also fully supported by Tawny-OWL (Lord, 2013). In this paper, we have described how we have used a number of Although we have not described its usage here, with the MDO we extant knowledge sources, combined with patterns defined using the are not forced to use Tawny-OWL for all development. It would be Tawny-OWL library to rapidly, reliably and repeatedly construct a possible to combine predominately hand-crafted development using scaffold for MDO. Protégé, for instance, with some patternised classes; for example, We have previously used a related patternised methodology the OBI uses this approach (Brinkman et al., 2010). For, the MDO, to construct a complex ontology describing human chromosome in fact almost all terms other than the top-level has been created rearrangements (i.e. The Karyotype Ontology (KO) (Warrender and from other syntaxes, generally a flat-file. For larger projects, we Lord, 2013b)). However, unlike KO, the mitochondrial knowledge envisage that most ontology developers would not need to use we want to encapsulate is found in numerous independent sources the programmatic nature of Tawny-OWL. While we appreciate the (e.g. published papers and online databases) and in a variety value of a single environment, a tool should not force all users into of formats (e.g. “free text” and CSV); the use of several it. patterns to form a scaffold is unique to MDO. Conversely, the In this paper, we have described our approach to building the axiomatisation of MDO from these sources is simple; this cannot MDO using a patternised scaffold based around existing knowledge be said for KO, most of which is generated from a single sources. While the work described in this paper allows us to large pattern (Warrender and Lord, 2013a). In addition, while integrate structured data into an ontology, we are now investigating our knowledge of the karyotype is constrained and is essentially new ways of integrating unstructured literate-based knowledge into finished, the community’s understanding of mitochondria and our ontology; while we have started the process of formalising, mitochondrial disease is incomplete and will grow in response to this new knowledge is far from finished. As described in this the demands of changing knowledge. paper, though, a pattern-first, scaffolded approach to ontology This methodology is extremely attractive for a number of reasons. development has enabled us to make significant advances with the First of all, it allows a very rapid way of scaffolding an ontology MDO. We believe that this approach is likely to be applicable to for a complex area of knowledge. At this stage, most of the classes many other domains also. created are simple and self-standing, although in some cases do have relationships to other entities in the scaffold. At this point, we have built the ontological equivalent of a data warehouse: terms have been taken from elsewhere and have undergone a form of schema REFERENCES reconciliation into ontological classes. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, One key feature of the MDO is that it has been built using A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel- Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., tools designed for software development; these tools are relatively Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of advanced and well-maintained5 (Lord, 2013). Moreover, recreating biology. The Gene Ontology Consortium. Nature Genetics, 25(1), 25–29. the MDO ontology from our original Tawny-OWL source code is Brinkman, R., Courtot, M., Derom, D., Fostel, J., He, Y., Lord, P., Malone, an intrinsic part of the development process; there is no complex J., Parkinson, H., Peters, B., Rocca-Serra, P., Ruttenberg, A., Sansone, S.-A., Soldatova, L., Stoeckert, C., Turner, J., Zheng, J., and the OBI consortium (2010). release process and any ontology developer can recreate the OWL Modeling biomedical experimental processes with obi. Journal of Biomedical file with a single command. While, the system as it stands has Semantics, 1(Suppl 1), S7. a high-degree of replicability, the design decisions implicit in the Croset, S., Overington, J. P., and Rebholz-Schuhmann, D. (2013). Brain: biomedical source code are not necessarily apparent. For the basic scaffold this knowledge manipulation. Bioinformatics, 29(9), 1238–1239. is, perhaps, not a major issue, however as MDO is developed outside Dietze, H., Berardini, T. Z., Foulger, R. E., Hill, D. P., Lomax, J., OsumiSutherland, D., Roncaglia, P., and Mungall, C. J. (2014). Termgenie - a web application for of its scaffold , we expect to integrate more documentation into the pattern-based ontology class generation. Journal of Biomedical Semantics, 5(1), 48. source code itself, using lentic, a recently developed tool for literate Dumontier, M., Baker, C., Baran, J., Callahan, A., Chepelev, L., Toledo, J. C., Del Rio, programming (Lord, 2015). N., Duck, G., Furlong, L., Keath, N., Klassen, D., McCusker, J., Rosinach, N. Q., We believe that the engineering process that we have used to Samwald, M., Rosales, N. V., Wilkinson, M., and Hoehndorf, R. (2014). The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge build the scaffold is resilient to change, as described in Section 6. discovery. Journal of Biomedical Semantics, 5(1), 14+. Despite this resilience, our use of external sources of knowledge Egana Aranguren, M., Stevens, R., and Antezana, E. (2009). Transforming does bring with it new dependencies, with all the issues that this the axiomisation of ontologies: The ontology pre-processor language. Nature entails for change management. We believe that we can manage this Precedings. by borrowing best practice from software engineering. Importing Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N., Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C. (2013). The ChEBI knowledge into the scaffold can, in many cases, happens entirely reference database and ontology for biologically relevant chemistry: enhancements automatically from our extant knowledge sources. Considering just for 2013. Nucleic Acids Research, 41(D1), D456–D463. Horridge, M. and Bechhofer, S. (2011). The OWL API: A Java API for OWL ontologies. Semant. web, 2(1), 11–21. 5And, usefully, not dependent on academic developers for future Horridge, M. and Patel-Schneider, P. F. (2012). Owl 2 web ontology language maintenance. manchester syntax (second edition). Technical report. 4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Scaffolding the Mitochondrial Disease Ontology Jupp, S., Horridge, M., Iannone, L., Klein, J., Owen, S., Schanstra, J., Wolstencroft, K., Tudorache, T., Nyulas, C., Noy, N., and Musen, M. (2013). Using semantic web in and Stevens, R. (2011). Populous: a tool for building owl ontologies from templates. icd-11: Three years down the road. In H. Alani, L. Kagal, A. Fokoue, P. Groth, BMC Bioinformatics, 13(Suppl 1), S5. C. Biemann, J. Parreira, L. Aroyo, N. Noy, C. Welty, and K. Janowicz, editors, The Lord, P. (2013). The Semantic Web takes Wing: Programming Ontologies with Tawny- Semantic Web ISWC 2013, volume 8219 of Lecture Notes in Computer Science, OWL. http://arxiv.org/abs/1303.0213. pages 195–211. Springer Berlin Heidelberg. Lord, P. (2015). Lenticular text: Looking at code from different angles. http:// Warrender, J. D. (2015). The Consistent Representation of Scientific Knowledge: www.russet.org.uk/blog/3035. Investigations into the Ontology of Karyotypes and Mitochondria. Ph.D. thesis, Mungall, C., Dietze, H., Carbon, S., Ireland, A., Bauer, S., and Lewis, S. School of Computing Science, Newcastle University. (2012). Continuous integration of open biological ontology libraries. http: Warrender, J. D. and Lord, P. (2013a). A pattern-driven approach to biomedical //bio-ontologies.knowledgeblog.org/405. ontology engineering. SWAT4LS 2013. Pavlakis, S. G., Phillips, P. C., DiMauro, S., De Vivo, D. C., and Rowland, L. P. (1984). Warrender, J. D. and Lord, P. (2013b). The Karyotype Ontology: a computational Mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes: a representation for human cytogenetic patterns. Bio-Ontologies SIG 2013. distinctive clinical syndrome. Ann. Neurol., 16(4), 481–488. Wolstencroft, K., Lord, P., Tabernero, L., Brass, A., and Stevens, R. (2006). Protein Stevens, R. and Lord, P. (2008). Application of ontologies in bioinformatics. In S. Staab classification using ontology classification. Bioinformatics, 22(14), e530–538. and R. Studer, editors, Handbook on Ontologies in Information Systems. Springer, Wolstencroft, K., Owen, S., Horridge, M., Krebs, O., Mueller, W., Snoep, J. L., second edition. du Preez, F., and Goble, C. (2011). RightField: embedding ontology annotation Stevens, R. and Lord, P. (2012). Semantic publishing of knowledge about amino acids. in spreadsheets. Bioinformatics, 27(14), 2021–2022. http://ceur-ws.org/Vol-903/paper-06.pdf. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5