Ontological interpretation of biomedical database annotations Filipe Santana da Silva1,*, Ludger Jansen 2, Fred Freitas1 and Stefan Schulz3 1 Centro de Informática (CIn), Universidade Federal de Pernambuco (UFPE), Recife, Brazil 2 Institut für Philosophie, Universität Rostock, Germany 3 Institut für Medizinische Informatik, Statistik und Dokumentation, Medzinische Universität Graz, Austria ABSTRACT and combine it with existing domain ontologies. We show how an- Motivation: In general, the meaning of biological database rec- notation terms used in a typical BIO-DB entry can be interpreted as ords is not sufficiently specified from an ontological point of view. referring to entities from different ontological categories. Each of We explore the options for an ontology-based integration and in- these interpretations requires different means like the introduction terpretation of database content of individuals, defined classes, of individuals, the addition of new axioms to existing classes or the dispositions and a combination of these. introduction of additional defined classes. The resulting OWL mod- els are tested under three aspects: (i) database content retrieval, us- Results: Four interpretation models are created, interpreting an- ing ontologies as query vocabulary for data integration; (ii) infor- notations in database records as referring to (i) individuals, (ii) de- mation completeness; and (iii) reasoning behaviour in Description fined classes, (iii) disposition universals, and (iv) a combination of Logics (DL). these. Evaluation is done by using competency questions to test the retrieval capacities. 2 METHODS Availability: Interpretation models and sample data are available For the analyses, we selected a typical example from biomedical da- at http://www.cin.ufpe.br/~integrativo. tabases, generated by joining data from UniProt and Ensembl * Contact: fss3@cin.ufpe.br (Cunningham et al., 2014). Records in BIO-DBs are mainly com- posed of (i) one protein term (e.g., CBS); (ii) one taxon term (e.g., 1 INTRODUCTION Rattus norvegicus); (iii) one to many terms from GO for biological Biological databases (BIO-DBs) are used to store summarized re- processes (e.g., Methylation); (iv) one to many terms from GO for sults of laboratory experiments. Apart from numeric and textual en- cellular components (e.g., Cytoplasm); (v) zero to many phenotype tries, they include semantic annotations. E.g., the Unified Protein terms (e.g., Endocrine pancreas increased size); and (vi) one to Resource (UniProt) (The UniProt Consortium, 2015) includes anno- many small molecules (e.g. Homocysteine). We implement four dif- tations from the Protein Ontology (PR) (Natale et al., 2014) and the ferent interpretive strategies (IND, SUBC, DISP and HYB) in OWL Gene Ontology (GO) (The Gene Ontology Consortium, 2014). using the editor Protégé v.5 and the reasoner FACT++ (Tsarkov & While these ontologies, in isolation, obey formal principles and con- Horrocks, 2006) to check for consistency and taxonomic subsump- vey precise meaning, the meaning of the database record as a whole tion. We used BioTopLite2 (BTL2) as an upper-level ontology with remains vague and depends on implicit background assumptions. highly constrained classes and a small set of relations (Schulz & What it means when, e.g., in an annotation the UniProt protein term Boeker, 2013). To test each interpretation model, we created four Methionine synthase is linked to the GO process term Methylation, competency questions (CQs), first in natural language, and then is left to the user. Hence, on the one hand, we have rich and well- translated into DL queries. curated BIO-DBs with highly structured tabular content, but limited ontological explicitness. On the other hand, large bio-ontologies 3 RESULTS provide formal descriptions of their content, enabling logic-based reasoning. In order to use these features with BIO-BDs, we want to 3.1 Individuals as the referents of annotations (IND) make explicit what annotations exactly refer to and to express this The first interpretation rests on the fact that a database entry is about in a formal, computer-processable way. the outcome of a concrete experiment. Accordingly, the annotations It has already been argued that there are benefits for content re- that feature in such an entry can be interpreted as referring to the trieval, regarding correctness, completeness, and user-friendliness individual molecules, objects and processes that belonged to that given a seamless integration between BIO-DBs and ontologies, and particular experiment. Thus, the entry “Cystathionine gamma-lyase” that such systems could accommodate large amounts of data from denotes a molecule or a collection of molecules of the class ‘Cysta- BIO-DBs (Hoehndorf et al., 2011; Santana et al., 2011). It is, how- thionine gamma-lyase’. BIO-DB content is therefore represented as ever, still an open question (1) how implicit knowledge about the a set of Abox-level class-membership assertions and relationships. entities and relationships described in the structure of a BIO-DB be represented, (2) whether the content denoted by BIO-DBs (i.e. the 3.2 Subclasses as the referents of annotations (SUBC) domain entities represented by the data elements and the way how Second, database content can be interpreted by means of a number the former are connected) is fit to be represented, and, if this is the of maximally fine-grained defined classes, introduced by means of case, (3) how it can be translated into axioms using appropriate rep- equivalence axioms for each universal entity which the annotations resentational patterns, and finally, once database structure and con- refer to. For instance, the annotations of a record combining the pro- tent are expressed by formal-ontological means, (4) how the existing tein term Methionine synthase and the species term Rattus norvegi- bio-ontologies can be plugged into this structure. Addressing these cus are represented by a customized defined class combining the in- questions, we demonstrate that there are feasible ways to express formation about a subclass of Methionine synthase, defined as Me- implicit and explicit database content by formal-ontological means thionine synthase that is part of an organism of the type Rattus 1 F. Santana da Silva et al norvegicus. Using OWL-EL expressiveness, we can formalize this does not have severe consequences on performance because of the as follows: good scaling behaviour of OWL-EL ontologies. This has also been ‘Methionine synthase_in_Rattus Norvegicus’ equivalentTo confirmed by our preliminary experiments. Methionine_Synthase and (‘is part of’ some ‘Rattus norvegicus’) DISP alone is not helpful for most of the queries. It provides a more compact representation, but it is also incomplete because not 3.3 Dispositions as the referents of annotations (DISP) all knowledge embedded within a database record can be sensibly Real world entities are often described scientifically in terms of dis- expressed by dispositions. The combination of SUBC and DISP in positions, i.e. tendencies to behave in a certain way under certain HYB has finally the huge advantage that it enables querying whether circumstances. Biomedical observations yield statistical results in- certain biological entities are capable of participating certain pro- dicating that participants of an experiment (a protein Methionine cesses, assuming that we agree that parts of the underlying synthase) have dispositions to bear certain capabilities (Jansen, knowledge in BIO-DBs is about dispositions. 2007), like being able to perform a Methylation process. Interpreting database entries as statements about dispositions means that we rep- 5 CONCLUSION resent the database content regarding a disposition of organisms of We proposed four ontological representations of structure and con- a certain species, e.g., that all instances of Homo sapiens have the tent of biological databases. The solutions we presented targeted as- disposition to develop a pathological condition P. For this purpose, pects of ontology-based database retrieval, expressiveness and con- we use General Class Inclusion (GCI) axioms that allow for sub- tent retrieval based on DL reasoning. Only part of database content class assertions between two complex class expressions, e.g.: is really of ontological nature in a strict sense, i.e., expressible by ‘Endochondral ossification’ axioms that hold universally for all instances of a class. We ad- and (‘is included in’ some ‘Bos taurus’) dressed this limitation by three ways. Firstly, we interpreted the de- subClassOf ‘has participant’ some ‘Cysthationine beta-synthase’ noted entities as (prototypical) individuals, which requires represen- The output of DISP is an ontology file representing the classes re- tation and reasoning on an Abox level. Secondly, we expressed con- ferred to by the annotations together with a small set of GCIs, using tingent database content by creating defined subclasses for which DL-SHI expressiveness. then universally valid statements could be made. Thirdly, we inter- preted part of the database content as reporting dispositions, which 3.4 Hybrid interpretation (HYB) was, however, not very helpful for the answering of our queries, in To avoid the complexity of GCI expressions, we combine SUBC contrast with the second modelling approach, when DL reasoning with DISP. HYB uses subclass statements like SUBC, enriched by was used to check for the existence of subclasses. axioms about dispositions like in DISP. This combination reduces Funding: This work was funded by Conselho Nacional de Aperfei- the amount subclasses to be created. Disposition axioms are limited çoamento de Pessoal de Nível Superior (CAPES) 3914/2014-03 and to material objects like proteins and organisms, asserting that they Conselho Nacional de Desenvolvimento Científico e Tecnológico are capable of participating in specific biological processes. The (CNPq) 140698/2012-4. HYB output needs DL-SHI expressiveness. REFERENCES 3.5 Fitness test Ceusters, W., et al. (2014). Clinical Data Wrangling using Ontological The four ontology models were tested for consistency and the fol- Realism and Referent Tracking. In W. R. Hogan, et al. (Eds.), ICBO lowing queries were used for retrieval evaluation: (Q1) Which bio- 2014 (pp. 27–32). logical processes have proteins of the kind Prot1 as participants? Cunningham, F., et al.(2014). Ensembl 2015. Nucleic Acids Research, (Q2) In which cellular locations is Prot2 active in organisms of the 43(D1), D662–D669. type Org1? (Q3) Which proteins are involved in processes of the type Hoehndorf, R., et al. (2011). Integrating systems biology models and BProc in organisms of the type Org1? (Q4) Which organisms are biomedical ontologies. BMC Systems Biology, 5, 124. able to exhibit a specific phenotype Phen1? – These queries were Jansen, L. (2007). Tendencies and other Realizables in Medical Information translated into DL, which enabled the retrieval of content in inter- Sciences. The Monist, 90(4), 1–23. pretations IND, SUBC and HYB. The model HYB was the only one Natale, D.A., et al. (2014). Protein Ontology: A controlled structured net- able to retrieve content for Q4. As DISP expresses everything in work of protein entities. Nucleic Acids Research, 42(D1), D415-D421 GCIs, retrieval is not supported at all. Santana, F., et al. (2011). Ontology patterns for tabular representations of biomedical knowledge on neglected tropical diseases. Bioinformatics, 4 DISCUSSION 27(13), i349–i356. We proposed four interpretation strategies: IND, SUBC, DISP and Schulz, S., & Boeker, M. (2013). BioTopLite: An Upper Level Ontology for the Life Sciences. Evolution, Design and Application. In M. Horbach HYB. Of these, only IND is completely based on single individuals (Ed.), Informatik (pp. 1889–1899). GI. (Abox entities). Ceusters et al. (2014) use a similar approach for ap- plying relations between individuals in electronic health records. The Gene Ontology Consortium. (2014). Gene Ontology Consortium: going forward. Nucleic Acids Research, 43(D1), D1049-D1056. SUBC is based on generating customized definitions of classes. The UniProt Consortium. (2015). UniProt: a hub for protein information. This approach is not far from the work of Hoehndorf et al. Nucleic Acids Research, 43(D1), D204–D212. (Hoehndorf et al., 2011). However, in SUBC an annotation does not Tsarkov, D., & Horrocks, I. (2006). FaCT ++ Description Logic Reasoner : refer directly to the class matching to the annotation term, but to a System Description. In LNCS (pp. 292–297). Springer: defined subclass of it. This requires a non-standard interpretation of Berlin/Heidelberg. DL queries, targeting the existence of subclasses. On the downside, SUBC involves an excessive number of subclasses. However, this 2