Modelling threshold phenomena in OWL: Metabolite concentrations as evidence for disorders Janna Hastings1,2? , Ludger Jansen3,4 , Christoph Steinbeck1 , and Stefan Schulz5 1 Chemoinformatics and Metabolism, European Bioinformatics Institute, UK 2 Swiss Centre for Affective Science, University of Geneva, Switzerland 3 Department of Philosophy, University of Rostock, Germany 4 Department of Philosophy, RWTH Aachen University, Germany 5 Institute of Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria Abstract. While genomic and proteomic information describe the over- all cellular machinery available to an organism, the metabolic profile of an individual at a given time provides a canvas as to the current phys- iological state. Concentration levels of relevant metabolites vary under different conditions, in particular, in the presence or absence of different disorders. Metabolite concentrations thus mediate an important link be- tween chemistry and biology, contributing to a systems-wide understand- ing of biological processes and pathways. However, there are a number of challenges in the ontological representation of such information. Firstly, concentration information is numeric and ranges over continu- ous values, while ontologies consist of discrete classes. Secondly, ontolo- gies usually model only what is certain, and their logical formalisms are adapted to reasoning from certain axioms to logical deductions, how- ever, the link between chemicals and diseases via concentration levels, like many threshold phenomena, is both uncertain and vague. In this paper we evaluate the representation of this knowledge using a combination of concrete domains and probabilistic reasoning. We parse concentration values from HMDB and create an ontology able to dis- tinguish normal from abnormal concentrations and able to evaluate a probabilistic risk category for the presence of an associated disorder. Introduction Metabolomics is the study of the small molecule products of metabolic processes present in living organisms, called metabolites. Concentration levels of different metabolites in the fluids of the body provide evidence for which processes have taken place, and thereby can reliably indicate disorders [17], as well as providing additional support for functional genomics expression studies [7]. The ChEBI ontology is an ontology of chemical entities and their roles in biological contexts, presently containing around 25,000 classes. ‘Metabolite’ is ? To whom correspondence should be addressed: hastings@ebi.ac.uk included in ChEBI as a role which chemical entities take in biological contexts. ChEBI does not currently provide information on the differing concentration levels of metabolites, nor on their association with disorders. This information is provided by metabolome databases such as the Human Metabolome Database [18], but these resources are not organised into an ontology, with the disadvan- tage that they do not allow for automated reasoning and semantic computa- tional processing. It is therefore crucial to provide an ontological view on this metabolomics data, especially in the context of the ChEBI project. Concentration information has historically been difficult to represent onto- logically, for three reasons: 1. Until recently, OWL did not provide support for defining classes based on data value ranges (in Description Logics (DLs) [1], these are known as con- crete domains). This functionality is included in OWL 2 DL. 2. The link between concentrations and disorders is not certain, that is, it is information about what may be associated with a given disorder rather than what is always an indicator for the disorder. Logical reasoning cannot directly draw inferences from such associations. 3. The threshold between normal concentration levels and the levels associated with disorders is vague, that is, there is no hard numeric cutoff between normal concentration levels and disordered concentration levels. In this paper, we present an approach to representing and reasoning with metabolite concentration levels associated with disorders, using OWL 2 data ranges and probabilistic DL reasoning [13] as implemented in Pronto [12], a probabilistic extension of Pellet [15]. We draw the metabolite data from the Human Metabolome Database [18]. Our implementation is guided by the following questions: 1. Can we differentiate normal from abnormal metabolite concentrations? 2. What is the likelihood that a patient has a given disorder, considering spec- ified values for his/her concentrations of different metabolites in biofluids? 3. Can we accumulate the evidence (i.e. increase the likelihood) for the presence of a given disorder if there are multiple metabolite concentration values pointing towards it? 1 Background 1.1 ChEBI, metabolomics data and the HMDB ChEBI is an OBO Foundry [16] ontology for the structural features and biological roles of biologically interesting chemicals [5]. Many of the biologically interesting chemicals are metabolites, which are are found in ChEBI together with their structural chemical classification and their biological roles including ‘metabolite’. Roles are associated with chemicals using the has role relationship. However, there is currently no information formally captured in the ontology as to the context in which a chemical has a particular role. The identification and annotation of the metabolites found in the human organism together with associated contextual information such as the disor- ders linked to different metabolic profiles, is being undertaken by the Human Metabolome Project [17], from which has arisen the Human Metabolome Data- base (HMDB) [18]. HMDB contains physicochemical, spectral, clinical, biochem- ical and genomic information for all known human metabolites. Each metabolite contains an extensive collection of information in text fields and images including measured concentration values taken from human samples of different biofluids (such as blood, urine, cerebrospinal fluid), from persons of different ages and with different underlying conditions. In this paper, we focus in particular on metabolites for which HMDB contains both a normal and an abnormal (associated with some disease) concentration level for an adult subject. The difference between the normal and abnormal concentration values indicates a threshold between these scenarios, such that we would be able to infer the likelihood of a sample concentration being from a disordered organism by virtue of the numeric value being closer to the known disordered concentration than to the known normal concentration. 1.2 Reasoning with data ranges in OWL 2 OWL properties are separated between those that range over objects (descen- dents of owl:Thing) and those that range over data values for different types of data, such as integers or strings. In OWL 2 [8], data restrictions can be used to define classes by referring to an operator and a range of values of a data property, such as strings or integers. For example (in Manchester syntax [11]), Adult subClassOf Human and hasAgeInY ears some int[>= 18] specifies that Adults are those Humans that have ages greater than or equal to 18 years. In the Description Logics underlying the OWL language, such data ranges are called concrete domains [2]. Concrete domains are defined with respect to a domain over which values can range, and a set of allowed predicates that operate on that domain. In our example above, the domain over which the hasAgeInYears data property ranges is the domain of non-negative integers, N0, and the pred- icates which operates on that domain (in OWL, allowed predicates correspond to XSD facets) include ‘≤’, ’=’, ‘>’. For the concentration values being represented in our metabolite concentra- tion ontology, the domain is non-negative real numbers (R), which we represent for sake of the necessary precision as XSD doubles (i.e. 64 bit floating point numbers), and the predicates we use are ≤ and ≥. 1.3 Probabilistic Description Logics While standard Description Logics are designed to represent information that is certain, as in chemistry it is certain that all members of the class carboxylic acids contain at least one carboxy group, a recent DL extension allows the association of probabilistic uncertainty with DL axioms [13]. Probabilistic DL-based ontologies extend classical DLs with probabilistic knowledge about classes and properties (known as terminological probabilis- tic knowledge) as well as about individuals (known as assertional probabilistic knowledge). Terminological probabilistic knowledge expresses knowledge about randomly chosen individuals belonging to classes, that is, generic members of the class, while assertional probabilistic knowledge is about specific named indi- viduals in the knowledge base [13]. Probabilistic DLs extend traditional DLs with the ability to quantitatively model and reason with partially overlapping classes (specifying the degree to which two classes overlap), and to associate with each axiom in the ontology a probability value which represents the degree of reliability or certainty of the axiom. It is the latter capability that we will make use of. Probabilistic knowledge consists of conditional constraints [13]. Definition 1. A conditional constraint is an expression of the form (ψ | φ)[l, u], where φ and ψ are classes in the ontology, and l and u are real numbers in the range [0, 1]. Informally, (ψ | φ)[l, u] encodes that φ is a subclass of ψ with probability between l and u. For example, we may wish to express the knowledge that if a certain patient has a measured metabolite concentration within a certain range (φ), then the probability of them having a certain disorder (ψ) is in the range [0.75, 0.85]. 2 Creating the ontology 2.1 Data extraction and threshold calculation The HMDB database was programmatically parsed from the downloadable metabo- cards export. Metabolites for which there was both a normal and an abnormal concentration in the same biofluid, were extracted. The normal and the abnor- mal concentrations were then used to generate a threshold condition which was half-way between the normal and the abnormal, and which was directed in the direction of the abnormal (either greater than or less than the threshold depend- ing on which side the abnormal concentration fell). For example, a pair of sample values for metabolite D-glucose in blood were 4440 uM for a normal adult and 7000 uM for an adult with the disorder Diabetes Mellitus Type 2. In this case we create a threshold at 5700 uM, having abnormal concentrations greater than the threshold. Note that the threshold being set half-way between normal and abnormal is an artificially introduced constraint for the purpose of this paper. Identifying true thresholds between normal and abnormal concentration levels is of course a much more complex procedure requiring large numbers of samples and sophisticated techniques for eliminating noise in the underlying data [6, 3]. However, for our purposes in evaluating the representation of such information in OWL, we can safely ignore this additional complexity. 2.2 Populating the OWL ontology with data The OWL ontology was created using the OWL API [10] and reasoned over with a slightly modified form of Pronto6 [12]. The full generated ontology, illus- trated in Figure 1, includes data for 48 metabolites associated with 39 different disorders7 . Fig. 1. Metabolite ontology: Fluid samples, of which blood and urine are two exam- ples, are considered part of organisms. Concentrations of various different metabolites inhere in these fluid samples. Concentrations may be normal or abnormal, and if ab- normal are associated with a disorder, which inheres in the organism from which the sample was extracted. Unlabelled arrows represent is a (subClassOf) relationships. Note that the ontology shows fluid samples as part of organisms, although this is a simplification since fluid samples are in actual fact typically no longer part of an organism, and concentration values may depend on how the sample was extracted and processed. Calculated threshold values were added to the ontology as classes defined with data ranges. For example, we fully define the class concentration of D-glucose in Blood associated with Diabetes mellitus type 2 as: ‘concentration of D-glucose in Blood associated with Diabetes mellitus type 2’ equivalentTo ( ‘concentration in blood’ and (hasMetabolite some ‘portion of D-glucose’) and (hasConcentrationValue some double[>= 5700.0]) ) In addition to the simplification involved in setting the threshold half way between the normal and abnormal concentrations, there is a deeper underlying problem with this threshold model. Even if we included an accurate threshold between normal and abnormal, this threshold represents at the class level what is generally true across many individuals, but obscures the underlying individual variance in phenotype and metabolism which might affect the actual threshold for each individual. Furthermore, it represents normal and abnormal as a binary phenomenon whereas in reality there is a continuum between the normal and the abnormal [14]. Thus, we cannot create a straightforward DL relationship 6 Version 0.2, upgraded to the latest version of the OWL API and Pellet, since data ranges were not available in the implemented OWL 1.1. version. 7 The ontology (META.owl) and software (META.zip) are available for download from http://www.ebi.ac.uk/~hastings/concentrations/. between a given metabolite concentration and a disorder, since, according to the current model and the underlying DL semantics, each concentration instance would then be associated with at least one disorder instance. It is to address this gap that we propose the use of probabilistic DL. 2.3 Adding probabilistic constraints The challenge is to be able to infer, based on measured metabolite concentration values, the likelihood of presence of a disorder. We will call this the risk of having the disorder, given the concentration value of the metabolite. We create classes for the categories of low, medium and high risk of having the given disorder. Note that the variation of risk with concentration value can be thought of, as a simplifying assumption, as a continuously valued function ranging over all possible concentration values8 . However, as Pronto constraints take the form of intervals associated with classes (or instances), to create a finite number of OWL classes and associate probability intervals to them, it is necessary to discretize the probability function into fixed ranges. We will do this as illustrated in Figure 2. Fig. 2. Discrete approximation: We assume a continuous probability function for the relationship between metabolite concentration and the risk of having the associated disorder. We assign three risk ranges, with medium risk ranging around the threshold value. The diagram represents a scenario where the abnormal concentration of the metabolite is larger than the normal concentration. For example, we fully define the class person with low risk of having diabetes based on their blood glucose level as: ‘person with high risk of having Diabetes mellitus type 2 based on Blood sample of D-glucose’ equivalentTo (organism and hasPart some (bloodSample and bearerOf some (concentration and hasMetabolite some ‘portion of D-glucose’ and hasConcentrationValue some double[>=6840.0] ) ) ) 8 In general, associative relationships between symptoms/signs and disorders are more complex, involving two parameters: (1) the probability of the disorder given the sign/symptom; and (2) the probability of the sign/symptom given the disorder [9]. We include only the first. We create the relevant low, medium and high risk categories for each fluid type and disorder for the metabolites D-glucose and acetoacetic acid. Although it is possible to use our software to create such classes for every metabolite in the ontology, we have selected this subset to reduce overhead for reasoning. Finally we create the conditional constraints that associate the given risk categories for the associated disorder, with a certain probability. We have ar- bitrarily selected the following probability ranges for the given risk categories: Low risk: [0.00;0.24]; Medium risk: [0.25;0.54]; High risk: [0.55;1.00]. As required by Pronto, conditional constraints are added to the ontology as an annotation on a subClassOf axiom; these axioms are then removed from the main ontology and added to the probabilistic knowledge base by the Pronto pre-processor. For example, we add the constraint: ‘person with high risk of having Diabetes mellitus type 2 based on Blood sample of D-glucose’ subClassOf ‘person with Diabetes mellitus type 2’ pronto:certainty "0.55;1.00" 3 Results of reasoning We test the reasoning capability of the generated ontology corresponding to the three questions listed in the introduction. As a simple probe, we create three individuals with different metabolite concentration measurements. Table 1 describes the individuals and their blood concentration values. Individual Metabolite Concentration Expected Risk (Diabetes) Harry Glucose 4000.0 Low Sally Glucose 10000.0 High Barry Glucose 6000.0 Med Barry Acetoacetic acid 2000.0 High Table 1. Individuals with sample metabolite concentrations 3.1 Reasoning with data ranges The first question tests reasoning with numeric thresholds for inferences about conditional properties. Here, the conditional property is the suspected presence of a disorder in the organism whose fluid sample has been measured for the par- ticular metabolite concentration. Testing this is straightforward: the individual Sally in the ontology is associated with a blood glucose concentration value of 10000 uM. After classifying using Pellet, Sally’s concentration is correctly classified as abnormal. 3.2 Reasoning with probability Answering the second question involves the use of the probabilistic constraints. To allow interplay between the results of reasoning with the data ranges ex- pressed in the ontology and the probabilistic reasoning, we executed a two-step process: firstly, the classical reasoning was performed, and then the inferred class memberships were asserted back into the probabilistic ontology before per- forming the probabilistic reasoning. This allows us to ask Pronto to answer the question: entail that an individual (e.g. Harry) has the disease Diabetes mellitus Type 2. In response, Pronto provides a probability range and an explanation, which refers to the probabilistic constraints used in generating the conclusion. The results are illustrated in Table 2. Individual Risk of Diabetes [l;u] Harry [0.0;0.24] Sally [0.55;1.0] Barry [0.25;1.0] Table 2. Individuals with inferred probability results for diabetes The results for Harry and Sally are a straightforward result of the risk categories associated with the classes for which their membership is inferred. However, that of Barry is more complex since he has multiple concentration values implicating the disease. 3.3 Reasoning with multiple probabilistic constraints in combination Pronto uses linear resolution to determine the probability range entailed by a set of constraints [12]. There are two scenarios: when multiple constraints can be resolved (into a probabilistic interval entailment), and when they conflict. Since Barry has a blood D-glucose concentration in the medium risk range and a blood acetoacetic acid concentration in the high risk range, and the two ranges do not conflict, the above result for Barry indicates Pronto’s strategy in the absence of a conflict, resembling a union of the two underlying data ranges. When multiple constraints conflict, Pronto prefers more specific statements to less specific. We evaluated this behaviour by changing the medium risk constraint to overlap with the high risk constraint, setting the upper bound for medium to 0.55 instead of 0.54. In this case, Pronto concludes that the probability for Barry having diabetes is [0.55;0.55] – the most specific (narrowest) resolution. If the medium risk ranges to 0.6, Pronto entails Barry the range [0.55;0.6]. Thus, it seems that the behaviour on conflict (at least for the two-axiom scenario we test here) resembles an intersection of the two underlying data ranges. While it remains a task for future work to examine the reasoning behaviour under more complex scenarios, neither of these results is an optimal represen- tation of the intuitive requirement driven by the use case: it would be better if the probabilistic combination of different types of evidence for the same con- clusion increased the certainty of the conclusion. However, Pronto does allow for overriding inherited constraints in more specific subclasses. Thus, we can specify a new risk subclass for Barry’s combined risk categories, and associate this with the disease with a new probability range (e.g. [0.54;0.85]). However, this approach is in general somewhat cumbersome as it would require adding many more classes and constraints to the knowledge base – for all interesting combinations of risk factors. 4 Conclusion Metabolomics is the field which bridges between chemical data and biological data by investigating the chemical markers for biological processes, and therefore for their underlying disorders [18]. Accurately modelling the associations between metabolites and disorders goes beyond traditional OWL modelling constructs. We have evaluated a probabilistic representation strategy using Pronto. While probabilistic ontologies have been used to model, e.g. breast cancer risk factors [12], they have to our knowledge not previously been applied to chemical–disease associations, nor used in combination with concrete domains. Our prototype has illustrated the general applicability of the approach, but a more intuitive and flexible solution for reasoning with combined probability constraints would be mandatory for a real application based on this scenario. Future work will involve the investigation of alternative probabilistic DL approaches, such as those which use an underlying Bayesian model [4], and ultimately address the extension of this prototype towards a full implementation linking ChEBI metabolites to diseases. Acknowledgements This work was partly supported by the Deutsche Forschungsgemeinschaft (DFG) grant JA 1904/2-1, SCHU 2515/1-1 GoodOD (Good Ontology Design) and by the BBSRC, grant agreement number BB/G022747/1. References 1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications, 2nd Edition. Cambridge University Press, 2 edn. (Sep 2007) 2. Baader, F., Sattler, U.: Description logics with aggregates and con- crete domains. Information Systems 28(8), 979–1004 (Dec 2003), http://www.sciencedirect.com/science/article/B6V0G-481FTVC-1/2/ 06400b60da99c41bc6e07596ff8950c1 3. van den Berg, R., Hoefsloot, H., Westerhuis, J., Smilde, A., van der Werf, M.: Cen- tering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7(1), 142 (2006) 4. da Costa, P.C.G., Laskey, K.B.: PR-OWL: A framework for probabilistic ontolo- gies. In: International Conference on Formal Ontology in Information Systems. pp. 237–249 (2006) 5. de Matos, P., Alcántara, R., Dekker, A., Ennis, M., Hastings, J., Haug, K., Spiteri, I., Turner, S., Steinbeck, C.: Chemical Entities of Biological Interest: an update. Nucl. Acids Res. 38, D249–D254 (2010) 6. Flöter, A., Nicolas, J., Schaub, T., Selbig, J.: Threshold extraction in metabo- lite concentration data. Bioinformatics 20(10), 1491–1494 (2004), http:// bioinformatics.oxfordjournals.org/content/20/10/1491.abstract 7. Gieger, C., Geistlinger, L., Altmaier, E., Hrab de Angelis, M., Kronenberg, F., Meitinger, T., Mewes, H.W., Wichmann, H.E., Weinberger, K.M., Adamski, J., Illig, T., Suhre, K.: Genetics meets metabolomics: A genome-wide association study of metabolite profiles in human serum. PLoS Genet 4(11), e1000282 (11 2008) 8. Grau, B.C., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler, U.: OWL 2: The next step for OWL. Web Semant. 6, 309–322 (November 2008), http: //portal.acm.org/citation.cfm?id=1464505.1464604 9. Hall, G.H.: The clinical application of Bayes’ theorem. The Lancet 290, 555–557 (1967) 10. Horridge, M., Bechhofer, S.: The OWL API: A Java API for working with OWL 2 ontologies. In: Hoekstra, R., Patel-Schneider, P.F. (eds.) Proc. of OWL Experiences and Directions 2009 (OWLED 2009) (2009) 11. Horridge, M., Patel-Schneider, P.F.: OWL 2 web ontology lan- guage manchester syntax (Oct 2009), http://www.w3.org/TR/2009/ NOTE-owl2-manchester-syntax-20091027/ 12. Klinov, P.: Pronto: A Non-monotonic Probabilistic Description Logic Reasoner. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 5021, chap. 66, pp. 822–826. Springer Berlin Heidelberg, Berlin, Heidelberg (2008), http: //dx.doi.org/10.1007/978-3-540-68234-9\_66 13. Lukasiewicz, T.: Probabilistic description logics for the semantic web. TU Vienna infsys research report (2007) 14. Schulz, S., Johansson, I.: Continua in biological systems. The Monist 4, 499–522 (2007) 15. Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., Katz, Y.: Pellet: A practical OWL-DL reasoner. Journal of Web Semantics 5, 51–53 (2007) 16. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J., The OBI Consortium, Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.A., Scheuermann, R.H., Shah, N., Whetzel, P.L., Lewis, S.: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11), 1251–1255 (Nov 2007), http://dx.doi.org/10.1038/nbt1346 17. Wishart, D.S.: Current progress in computational metabolomics. Briefings in Bioin- formatics 8(5), 279–293 (2007), http://bib.oxfordjournals.org/content/8/5/ 279.abstract 18. Wishart, D.S., Knox, C., Guo, A.C.C., Eisner, R., Young, N., Gautam, B., Hau, D.D., Psychogios, N., Dong, E., Bouatra, S., Mandal, R., Sinelnikov, I., Xia, J., Jia, L., Cruz, J.A., Lim, E., Sobsey, C.A., Shrivastava, S., Huang, P., Liu, P., Fang, L., Peng, J., Fradette, R., Cheng, D., Tzur, D., Clements, M., Lewis, A., De Souza, A., Zuniga, A., Dawe, M., Xiong, Y., Clive, D., Greiner, R., Nazyrova, A., Shaykhutdinov, R., Li, L., Vogel, H.J., Forsythe, I.: HMDB: a knowledgebase for the human metabolome. Nucleic acids research 37(Database issue), D603–610 (Jan 2009), http://dx.doi.org/10.1093/nar/gkn810