Representing Sampling Distributions In P-SROIQ Pavel Klinov1 and Bijan Parsia2 1 University of Arizona, AZ, USA pklinov@email.arizona.edu 2 The University of Manchester, UK bparsia@cs.man.ac.uk Abstract. We present a design for a (fragment of) Breast Cancer on- tology encoded in the probabilistic description logic P-SROIQ which supports determining the consistency of distinct statistical experimental results which may be described in diverse ways. The key contribution is a method for approximating sampling distributions such that the incon- sistency of the approximation implies the statistical inconsistency of the continuous distributions. 1 Introduction The current amount of knowledge about breast cancer is overwhelming. For example, a meta-study conducted in 2006 by Key et al. [4] covered 98 unique studies focused only on the impact of a single risk factor, alcohol consumption. At the same time there are no common knowledge bases which would combine and formally represent findings produced by the multitude of studies.3 This makes it difficult to have a global view of breast cancer risk factors and, consequently, develop tools like risk assessment calculators. The probabilistic description logic P-SROIQ can be used to represent gen- eral knowledge about breast cancer in the form of a probabilistic ontology (the BRC ontology) [5]. However, a general knowledge ontology need not support risk entailments for various combinations of risk factors — that is, compete (poorly) with narrowly specific risk calculators4 which have a direct implementation of simple equations derived from statistical risk models (such as the Gail model [2]). Instead, its main goal is to formally and unambiguously describe the back- ground theory of breast cancer embracing as many reliable findings as possible and serving as a common knowledge base for more specific tools, such as risk assessment calculators or decision support systems. This sort of task seems to be a better fit for a probabilistic logic. The set of use cases for the general knowledge ontology is wider than for the BRC ontology. In addition to maintaining a birds-eye view of breast cancer, it may be used for finding and analyzing inconsistencies in outcomes of different 3 There are some lower level databases, such as ROCK (http://rock.icr.ac.uk/)—a cancer specific functional genomic database. However, they do not explicitly repre- sent case study findings and do not support such services as risk assessment. 4 Such as http://www.cancer.gov/bcrisktool studies. It can support studying mechanisms of interactions between risk factors, for example, how alcohol consumption affects estrogen level. Finally, it may play a useful role in planning and coordination of future medical studies by helping to identify the most controversial or insufficiently studied risk factors or exposures. In this paper, we present a design of general P-SROIQ ontology about breast cancer (i.e., the BRC ontology) which incorporates a substantial amount of statistical knowledge. While we do not present a fully fleshed out instance of this design, we do tackle a major representational challenge, namely, the representation of the statistical results of experiments. We present a method for approximate representations of different sampling distributions and their use in determining consistency between experimental data. 2 Preliminaries of P-SROIQ P-SROIQ [8] is a probabilistic extension of the DL SROIQ [3]. It provides means for expressing probabilistic relationships between arbitrary SROIQ con- cepts and a certain class of probabilistic relationships between classes and indi- viduals. Any SROIQ, and thus OWL 2 DL (as it can be seen as a notational variant of SROIQ), ontology can be used as a basis for a P-SROIQ ontology, which facilitates transition from classical to probabilistic ontologies. We presume the reader is reasonbly familiar with class/object oriented description logics such as SROIQ, though very little in this paper turns on specific details. The only syntactic construct in P-SROIQ (in addition to all of the SROIQ syntax) is the conditional constraint. Definition 1 (Conditional Constraint). A conditional constraint is an ex- pression of the form (D|C)[l, u], where C and D are concept expressions in SRIQ (i.e., SROIQ without nominals) called evidence and conclusion, re- spectively, and [l, u] ⊆ [0, 1] is a closed real-valued interval. In the case where C is > the constraint is called unconditional. Ontologies in P-SROIQ are separated into a classical and a probabilistic part. It is assumed that the set of individual names NI is partitioned onto two sets: classical individuals NCI and probabilistic individuals NP I . Definition 2 (PTBox, PABox, and Probabilistic Knowledge Base). A probabilistic TBox (PTBox) is a pair P T = (T , P) where T is a classical (finite) SROIQ TBox and P is a finite set of conditional constraints. A proba- bilistic ABox (PABox) is a finite set of conditional constraints associated with a probabilistic individual op ∈ NP I . A probabilistic knowledge base (or a probabilistic ontology) is a triple P O = (T , P, {Pop }op ∈NP I ), where the first two components define a PTBox and the last is a a set of PABoxes. Informally, a PTBox constraint (D|C)[l, u] expresses a conditional statement of the form “if a randomly chosen individual is an instance of C, the probability of it being an instance of D is in [l, u]”. A PABox constraint, which we write as (D|C)o [l, u] where o is a probabilistic individual, states that “if a specific individual (that is, o) is an instance of C, the probability of it being an instance of D is in [l, u]”. For more details we refer the reader to [8]. 3 The Classical Part The classical part of a P-SROIQ ontology (or OWL part) provides a medical vocabulary which can be used on its own in a variety of applications or used in the representation of probabilistic knowledge. In this paper we focus on providing an OWL terminology for probabilistic statements. The ontology contains the following main class hierarchies (taxonomies): Taxonomy of breast cancers Breast cancer is a heterogeneous disease. Some risk factors can be associated with increase in risk of developing one partic- ular type of breast cancer and not the other. Thus it is important to classify types of breast cancer. In particular, our ontology distinguishes breast can- cers by hormone receptor status. Estrogen and progesterone positive breast cancers are modeled using concepts ERPositiveBRC and PRPositiveBRC while their complements are modeled using ERNegativeBRC and PRNegativeBRC (we use shorthands ER+/- and PR+/- with obvious mean- ing.). Another important classification is based on histology. The ontology distinguishes between invasive and non-invasive (e.g. in situ) cancers. Taxonomy of risk factors Dozens of risk factors are known so far. Some are established and strongly associate with increased risks, such as BRCA1(2) gene mutations, while others are controversial. The ontology should provide vocabulary for both to support current and future findings. It includes a taxonomy of concepts rooted at RiskFactor. We distinguish between known risk factors (those which can be reported via a questionnaire, such as alcohol intake) and inferred risk factors which require medical examination. Taxonomy of risks The ontology differentiates absolute and relative risks of developing breast cancer. Absolute risks are further divided into the life- time risk and the short-term risk. Relative risks are divided into increased and reduced risks. Level of increases is a continuous variable which requires discretization (see below). The last two taxonomies induce the corresponding classifications of women, i.e., classes of women w.r.t. risk factors and w.r.t. risk. For example, any risk factors RF gives rise to a class of women Woman u ∃hasRiskFactor.RF. Women having various combinations of risk factors are modeled as conjunctive concept expressions. Analogously, given a certain kind of risk R the expression Woman u ∃hasRisk.R models those women who are in the risk group R, for example, have moderately increased risk of developing ER+ breast cancer. These taxonomies of women may or may not be explicitly present in the ontology. In other words, it is possible, but not essential, to generate a concept name for each interesting class of women since P-SROIQ (and our reasoner Pronto) allows for complex concept expressions in conditional constraints. A future, more complete version of the ontology would certainly make use of existing bio-medical ontologies which cover substantial portions of the domain either by direct reuse or by ontology alignment techniques. 4 The Method for Approximating Distributions in the Probabilistic Part The probabilistic part of the ontology captures statistical background knowl- edge about breast cancer. We distinguish between knowledge which explicitly associates quantifies specific risk factors and more general statistical relation- ships which are not necessarily risk related. The distinction could be useful for importing knowledge from other medical ontologies. We begin with the latter. General statistical knowledge mostly includes relationships between vari- ous risk factors. For example, Ashkenazi Jew women are more likely to de- velop BRCA gene mutations, while early menarche, late first child (or no live births), lack of breastfeeding and alcohol consumption all increase levels of es- trogen in blood.5 Such relationships are important because they can help to infer the presence of some risk factors given the set of known factors. They are typically easy to represent by using conditional constraints of the form (Womanu∃hasRiskFactor.RFY|Womanu∃hasRiskFactor.RFX)[l,u] which says that the chances of having risk factor RF Y given RF X are between l and u. One possible source of complications is continuous variables, e.g. the level of estrogen, which are discussed below. Most of statistical findings available in medical literature quantitatively de- scribe risk increase for categories of women with specific risk factors. Such find- ings are presented by giving estimated parameters of a probability distribution where the random variable represents the relative risk of a random woman in the population. Such parameters include the estimated mean value and the esti- mated variance. Table 1 presents an example of the reported association between alcohol intake and the risk increase among postmenopausal women taken from [10]. There are two main difficulties with representing this kind of data in P- SROIQ. First, the risk increase is a continuous random variable so it needs to be discretized. Second, the available language supports only conditional con- straints so a straightforward encoding of probability distributions is not possible. Table 1. Example of a reported association between alcohol intake and the risk of hormone receptor-specific breast cancer (excerpt from [10]) Alcohol (g) ER+ ER- PR+ PR- RR (95% CI) RR (95% CI) RR (95% CI) RR (95% CI) 0 1.00 (ref) 1.00 (ref) 1.00 (ref) 1.00 (ref) ≤4 1.06 (0.91 - 1.22) 1.40 (1.00 - 1.96) 1.04 (0.89 - 1.23) 1.24 (0.95 - 1.62) ≥4 1.07 (0.90 - 1.26) 1.64 (1.14 - 2.35) 1.12 (0.93 - 1.34) 1.28 (0.96 - 1.71) 5 See http://tinyurl.com/4jpsdvk Discretization of a continuous variable is technically straightforward. We in- troduce a set of disjoint concept names each of which models women in the corresponding group of risk. Specifically, we define concepts WomenAtWeakRisk, WomenAtModerateRisk and WomenAtHighRisk with the obvious meanings de- scribed using OWL 2 datatype support to describe the exact boundaries. We have chosen ranges (1, 1.5], (1.5, 3.0] and (3.0, + inf) respectively.6 The inability to represent distributions is a more severe limitation. It leaves the modeler with the only option of approximating the continuous distribution using a finite set of points. In other words, each distribution, for example, risk in- crease for women consuming a certain amount of alcohol, can be approximated by specifying the probability that a randomly taken woman with the given exposure belongs to a specific group of risk, i.e. WomenAtWeakRisk, WomenAtModerateRisk or WomenAtHighRisk. This is the semantics of P-SROIQ conditional constraints. Assuming that the random variable is real-valued, a standard way of ap- proximating a continuous distribution is to take each interval and compute the probability that the variable takes on a value in that interval. Then the ap- proximation of a distribution P r(x) R w.r.t. a finite set of intervals U is simply a function Pˆr such that Pˆr(Ui ) = P r(x)dx. Ui Unfortunately, this approximation of results of statistical experiments is un- satisfactory because it maps every interval to a single point. The problem is that any arbitrarily small difference between two or more sampling distributions will results in conflicting probabilistic statements for every interval (because the point-valued probabilities will be different) even though the results can confirm each other from a purely statistical point of view. Consequently this approach does not support working with results reported by multiple studies. Our goal is to approximate sampling distributions in P-SROIQ in a sta- tistically coherent way. Informally it means that satisfiability of probabilistic formulas representing two or more sampling distributions must agree with their mutual statistical consistency, i.e., whether they support a common statistical hypothesis. The hypothesis, in this case, is that there exists a distribution (not necessarily a unique one) over G with parameters µ, σ such that it is supported by all sampling distributions with the required level of confidence. We assume a (finite) population G of size NG and a random variable X which is normally distributed across G. We also make the realistic assumption that G is large enough so that evaluating X for all members of G is not feasible. A common approach is to take one or more random samples from G, evaluate X for them and estimate the actual distribution over G based on the sampling distributions. We use µ, σ to denote the mean and the variance of the actual distribution and X (i) , S (i) for the mean and the variance of the sample X (i) . For simplicity we finally assume that the population distribution is normal. The mainstream approach for comparing two or more sampling distributions is based on statistical hypothesis tests. For example, given two normal distribu- 6 The choice of intervals is obviously ambiguous but this issue is orthogonal to the approximation method presented in this paper. tions X (1) , S (1) , X (2) , S (1) it is common to take X (1) − X (2) , which is a normally distributed random variable, and perform a z-test (or a Student’s t-test depend- ing on the sample sizes) to see if the difference can be taken as 0 with the required level of confidence. It amounts to calculating standard errors of the mean (SE) for both distributions and then computing the difference in units of SE. If the probability of observing such difference given the null hypothesis,7 which can be found in standard tables, is low enough, e.g., ≤ 0.05, a statistician would accept the hypothesis that both distributions are consistent. Our approach is slightly different from the outlined above. It is not based on tests but on confidence regions for sampling distributions. The approach, which generalizes confidence intervals and dates back to Mood [9], is to estimate a region Rγ in the parameter space for (µ, σ 2 ) such that it will contain the µ, σ 2 pair of the actual distribution 100(1 − γ)% times as the number of estimations goes to infinity. More formally, a 100(1 − γ)% confidence region Rγ is a random set for parameters (µ, σ 2 ) based on a group of independent normally distributed variables X (i.e., a sample) such that [1]:8 P ((µ, σ 2 ) ∈ Rγ ) = 1 − γ, for all (µ, σ 2 ) (1) Informally, the confidence region specifies how far sampling distributions can deviate from the population distribution while supporting it with 100(1 − γ)% confidence. Following Mood [9] we will show that for the normal distribution the region is a convex set and, therefore can be represented by boundary values of (µ, σ 2 ) such that any sampling distribution inside the boundary will be consistent with the current distribution. Consider the sample X1 , . . . , Xn where all Xi are independent Pn random vari- ables with the normal distribution (N (µ, σ 2 )). Then X = n1 i=1 Xi and S 2 = 1 Pn 2 n−1 i=1 (Xi − X) , i.e., the sample mean and the sample variance, are ran- 2 dom variables. It is well known that X has the normal distribution N (µ, σn ) (or, X−µ √ ∼ N (0, 1)) while (n − 1)S 2 /σ 2 has the chi-square distribution equivalently, σ/ n with n − 1 degrees of freedom [9]. The standard tables for N (0, 1) and χ2n−1 provide numbers a, b, c such that for fixed p1 , p2 the following equalities hold [1]: X −µ P (−a < √ < a) = p1 , σ/ n P (b < (n − 1)S 2 /σ 2 < c) = p2 7 The null hypothesis is a default position which, in this case, could be that the population mean is different from at least one of X (1) , X (2) . 8 We deliberately leave out a precise definition of random set. For the purposes of this paper it is sufficient to think of a random set as of a random variable which takes on subsets of some space. The crucial fact is that the two random variables are independent (see [9] for a proof) which implies that: p1 p2 = 2 X −µ (n − 1)S √ < a, b < P (−a < < c) = σ/ n σ2 σ σ (n − 1)S 2 (n − 1)S 2 P (X − a √ < µ < X + a √ , < σ2 < ) n n c b Thus, the 100(p1 )(p2 )% confidence region for (µ, σ 2 ) takes the following form:  σ σ Rp1 ,p2 (X, S) = (µ, σ 2 ) : X − α √ < µ < X + α √ , (2) n n 2 (n − 1)S 2  (n − 1)S < σ2 < γ β Figure 1 shows the joint confidence region R in the parameter space (µ, σ 2 ). Note that it is possible, although technically messy, to generalize the definition (2) to the case of several independent sampling distributions. The simultane- ous confidence region for k samples X (1) , . . . , X (k) will be a region in the 2k- dimensional parameter space which projections on each plane (µ(i) , (σ (i) )2 ) will look as (2). Then the notion of consistency of sampling distributions can be defined as follows (we limit the attention to two samples for clarity): Fig. 1. Joint confidence region for (µ, σ 2 ) Definition 3. Let P r(X (1) ), P r(X (2) ) be distributions on two samples X (1) , X (2) drawn independently from a population G. They are said to be consistent with confidence 100p% if there exists a point (µ, σ 2 ) which belongs to both Rp (X (1) , S (1) ) and Rp (X (2) , S (2) ). Now we can return to the issue of approximating a continuous sampling distribution by a discrete set of points. Assume that the domain E of a continuous real-valued random variable X is a disjoint union of a finite number of intervals U = {(−∞, r1 ], (r1 , r2 ], . . . , (rl−1 , rl ], (rl , +∞)}. Then the approximation of the sampling distribution P r(X) with mean and variance (X, S 2 ) is the function Pˆr which maps each interval Ui to the following real-valued set: Pˆr(Ui ; X, S) = {g(µ, σ 2 )|(µ, σ 2 ) ∈ Rp1 ,p2 (X, S)} (3) Z 1 (x−µ)2 2 g(µ, σ ) = √ e− 2σ2 dx 2πσ 2 Ui Now we are ready to define the notion of approximate consistency of sampling distribution with respect to a set of intervals U : Definition 4. Two sampling distributions P r(X (1) ), P r(X (2) ) are approximately consistent given a finite set of intervals U if Pˆr(Ui ; X (1) , S (1) ) ∩ Pˆr(Ui ; X (2) , S (2) ) is non-empty for all Ui ∈ U . As with any approximation, the utility of approximations of sampling distri- butions depends on what conclusions they help to draw about the distributions themselves. Given that we are interested in the matter of consistency, it is im- portant to understand the relationships between the notions of consistency and approximate consistency of sampling distributions. Fortunately, consistency im- plies approximate consistency regardless of partitioning of the real line: Theorem 1. If two sampling distributions P r(X (1) ), P r(X (2) ) are consistent, then they are approximately consistent for any choice of real-valued intervals. Proof. For the distribution P r(X (1) ) a confidence region Rp1 ,p2 (X (1) , S (1) ) is connected (see Definition 2). The function g(µ, σ 2 ) (Definition 3) is continuous on it which implies that for any Ui , the set Pˆr(Ui ; X (1) , S (1) ) is a real-valued interval (l1 , u1 ). Now consider a point µ0 , σ02 ∈ Rp1 ,p2 (X (1) , S (1) ) ∩ Rp1 ,p2 (X (2) , S (2) ) which exists since the distributions are consistent. It follows that l1 < g(µ0 , σ02 ) < u1 (and analogously l2 < g(µ0 , σ02 ) < u2 for Pˆr(Ui ; X (2) , S (2) )), so g(µ0 , σ02 ) is a common point for both approximations on Ui . As such the distributions are approximately consistent. The following corollary from the above theorem is at heart of our method. As we demonstrate below, the inconsistency of approximations can be proved by logical reasoning in P-SROIQ (i.e., by solving the probabilistic satisfiabil- ity problem), which means that the result enables approximate reasoning about sampling distributions in a purely logical way. Even though the power of such rea- soning is currently limited to consistency checking, its integration with OWL/DL reasoning and the ability to use common, formally defined terminology for rep- resentation of statistical experiments is promising. Corollary 1. If sampling distributions P r(X (1) ), P r(X (2) ) are approximately inconsistent for some choice of real-valued intervals, then they are inconsistent. 5 Example of Approximate Modelling Now we present an example of approximate representation of sampling distri- butions in P-SROIQ. The task is to take two results of statistical experiments aimed at investigating associations between alcohol consumption and the in- creased risk of breast cancer among postmenopausal women. Unfortunately it is common for medical papers to not explicitly present all parameters that char- acterize results of their statistical analyses. Typically, only the estimated mean and the confidence interval are presented while, for example, the kind of distri- bution is left to the reader to infer from other information. Due to that fact and because the approach above has only been developed for normal distributions, we illustrate it on an artificial example. The information given in the example is analogous to that given in medical literature, e.g. [10, 11], but is complete in the sense that all parameters and the type of sampling distributions are known. Example 1. Consider two hypothetical papers which report results of indepen- dent studies of associations between alcohol consumption among postmenopausal women and their relative risk of developing breast cancer. According to study A the mean relative risk (RR) of ER+ breast cancer for women drinking ≥ 4g of ethanol a day is 1.8 and has variance of 0.5. Study B has reported that the mean RR of ER+ breast cancer for the same level of drinking is 2.2 (variance 0.7). The number of cases in the studies was 230 and 150 respectively. We propose the following four step procedure for an approximate representa- tion of statistical results, similar to those in the example above, in P-SROIQ: 1. Preparing concepts The first step is to define the concepts/roles used to describe the distribution. In our case evidence concepts should describe cat- egories of women with respect to specific risk factors, e.g. alcohol intake, while conclusion concepts describe groups of women stratified by risk increase. For in- stance, the concept expression C ≡ Woman u ∃hasRiskFactor.(Postmenopause u ModerateConsumption) is used to model postmenopausal women with moderate level of alcohol intake.9 On the other hand the expression: D ≡ Woman u ∃hasRisk.(ModeratelyIncreasedRisk u ∃riskOf.ERPositiveBRC) 9 The level of intake is a continuous variable which we also split onto categories LimitedConsumption, ModerateConsumption and HeavyConsumption which corre- spond to ≤ 4, 4 − 9.9 and ≥ 10g of ethanol per day. models women who are at moderately increased risk of developing ER-positive breast cancer. Using these expressions the modeler can specify the probability than a random women the class C also belong the risk group D as (D|C)[l,u]. 2. Determining parameters of sampling distributions (if required) Sometimes parameters of sampling distributions can be determined from other information. For example, knowing the kind of distribution, sample mean, sample size, confidence interval and the methodology of its estimation, one can calculate the sample variance.10 In our case it is not needed as the distributions are normal and the parameters are known. 3. Choosing intervals Choice of intervals for an approximation of a con- tinuous random variable is driven by balancing the quality of the approxima- tion (i.e., how closely it models the continuous distribution) and the number of statements required. The latter has a direct impact on performance. For Ex- ample 1 we use three concepts WomenAtWeakRisk, WomenAtModerateRisk and WomenAtHighRisk which correspond to relative risk intervals of (1,1.5], (1.5, 3.0] and (3.0,+∞) respectively. 4. Computing the approximation The final (and the central) step is to compute probability intervals for the statements that approximate the contin- uous distribution. Each statement specifies the lower and upper probabilities that the continuous random variable X will fall into an interval Ui given that parameters of the distribution can vary within the confidence region (2). More formally, given the interval Ui , e.g. (1,1.5] for WomenAtWeakRisk, and the sam- pling distribution (X, S 2 ) the interval [li , ui ] can be computed by solving the following non-linear optimization problem (): li (resp. ui ) = min (resp. max) g(µ, σ 2 ) s.t. (4) 2 (µ, σ ) ∈ Rp1 ,p2 (X, S) Z 1 (x−µ)2 2 g(µ, σ ) = √ e− 2σ2 dx 2πσ 2 Ui In other words, [li , ui ]=[inf Pˆr(Ui ; X, S), sup Pˆr(Ui ; X, S)]. The last preparatory step is to calculate confidence regions according to (2). The 95% confidence regions for distributions (X (1) , S (1) ), (X (2) , S (2) ) in Example (1) (2) 1 (abbreviated as R0.95 and R0.95 ) are defined by the following inequalities:   (1) 2.241σ 2.241σ R0.95 = (µ, σ 2 ) : 1.8 − √ < µ < 1.8 + √ , 0.409 < σ 2 < 0.623 230 230   (2) 2.241σ 2.241σ R0.95 = (µ, σ 2 ) : 2.2 − √ < µ < 2.2 + √ , 0.548 < σ 2 < 0.923 150 150 10 √ The variable T = (X − µ)/(S/ n) has the t-dustribution with n − 1 degrees of freedom. Confidence interval is standardly computed as [X − a, X + a] where a = t 1−α ,n−1 √Sn (t 1−α ,n−1 is the α−percentile of the Student distribution). If the 2 2 confidence interval and α are known, then S can be calculated. Now the optimization problem (4) can be solved numerically11 to obtain the following approximations for both sampling distributions: inf Pˆr((1, 1.5]; X (1) , S (1) ) = 0.219 sup Pˆr((1, 1.5]; X (1) , S (1) ) = 0.298 inf Pˆr((1.5, 3.0]; X (1) , S (1) ) = 0.655 sup Pˆr((1.5, 3.0]; X (1) , S (1) ) = 0.878 inf Pˆr((3.0, +∞); X (1) , S (1) ) = 0.239 sup Pˆr((3.0, +∞); X (1) , S (1) ) = 0.586 inf Pˆr((1, 1.5]; X (2) , S (2) ) = 0.116 sup Pˆr((1, 1.5]; X (2) , S (2) ) = 0.224 inf Pˆr((1.5, 3.0]; X (2) , S (2) ) = 0.562 sup Pˆr((1.5, 3.0]; X (2) , S (2) ) = 0.769 inf Pˆr((3.0, +∞); X (2) , S (2) ) = 0.189 sup Pˆr((3.0, +∞); X (2) , S (2) ) = 0.568 So, for this example, the sampling distributions are approximately represented in P-SROIQ using the following conditional constraints. {(W u ∃hR.(WeaklyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.219, 0.298], (W u ∃hR.(ModeratelyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.655, 0.878], (W u ∃hR.(StronglyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.239, 0.586]} and {(W u ∃hR.(WeaklyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.116, 0.224], (W u ∃hR.(ModeratelyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.562, 0.769], (W u ∃hR.(StronglyIncreasedRisk u ∃riskOf.ERPositiveBRC)|C)[0.189, 0.568]} where W and hR abbreviate Woman and hasRisk, respectively and C ≡ W u ∃hasRiskFactor.(Postmenopause u ModerateConsumption) Probabilistic consistency of the above set of statements can be proved by solving the probabilistic satisfiability problem (PSAT). Modern algorithms can decide PSAT for over a thousand of P-SROIQ statements (in addition to thou- sands of OWL axioms), so the method could be computationally practical [6]. 6 Conclusion Checking consistency of sampling distributions in P-SROIQ may well appear cumbersome and pointless given that the same task can be done in a much simpler way and without any logical reasoning, e.g. via testing or by analyzing confidence regions. However, our aim is not to reduce statistical testing to log- ical reasoning (that aim is indeed pointless). Our aim is to represent results of statistical experiments using common, unambiguously defined logical vocabulary and be able to reason about them. Even though probabilistic reasoning about statistical results is currently limited to approximate consistency checking, the 11 We use Wolfram Mathematica for this purpose. potential benefits are in combining it with reasoning about the classical knowl- edge. For example, the BCRA ontology contains a little taxonomy of breast cancers by hormone receptor status. This enables us to combine results of the studies which are of different levels of granularity. For instance, Sellers et al. [10] report associations between alcohol intake and ER(+/-) breast cancer risk, while Suzuki et al. [11] divide it further to ER(+/-)PR(+/-) risks. In that simple case non-logical reasoning about the reported results becomes much less straightfor- ward, while studies can also distinguish histologic types of breast cancer (see [7]). In such complex situations reasoning about findings does involve reasoning about background knowledge, e.g. the taxonomy of breast cancers, so a combination of OWL and probabilistic reasoning is potentially beneficial. References 1. Arnold, B.C., Shavelle, R.M.: Joint confidence sets for the mean and variance of a normal distribution. The American Statistician 52(2), 133–140 (1998) 2. Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer, C., Mul- vihill, J.J.: Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute 81(25), 1879–1886 (1989) 3. Horrocks, I., Kutz, O., Sattler, U.: The even more irresistible SROIQ. In: Knowl- edge Representation and Reasoning. pp. 57–67 (2006) 4. Key, J., Hodgson, S., Omar, R.Z., Jensen, T.K., Thompson, S.G., Boobis, A.R., Davies, D.S., Elliott, P.: Meta-analysis of studies of alcohol and breast cancer with consideration of the methodological issues. Cancer Causes Control 17, 759–770 (2006) 5. Klinov, P., Parsia, B.: Probabilistic modeling and OWL: A user oriented intro- duction into P-SHIQ(D). In: OWL: Experiences and Directions (2008), http: //www.webont.org/owled/2008/papers/owled2008eu_submission_32.pdf 6. Klinov, P., Parsia, B.: A hybrid method for probabilistic satisfiability. In: CADE. pp. 354–368 (2011) 7. Lew, J.Q., Freedman, N.D., Leitzmann, M.F., Brinton, L.A., Hoover, R.N., Hollen- beck, A.R., Schatzkin, A., Park, Y.: Alcohol and risk of breast cancer by histologic type and hormone receptor status in postmenopausal women the nih-aarp diet and health study. American Journal of Epidemiology 170(3), 308–317 (2009) 8. Lukasiewicz, T.: Expressive probabilistic description logics. Artificial Intelligence 172(6-7), 852–883 (2008) 9. Mood, A.M.: Introduction to the Theory of Statistics. McGraw-Hill (1950) 10. Sellers, T.A., Vierkant, R.A., Cerhan, J.R., Gapstur, S.M., Vachon, C.M., Olson, J.E., Pankratz, V.S., Kushi, L.H.: Interaction of dietary folate intake, alcohol, and risk of hormone receptor-defined breast cancer in a prospective study of post- menopausal women. Cancer Epidemiology, Biomarkers and Prevention 11, 1104– 1107 (2002) 11. Suzuki, R., Ye, W., Rylander-Rudqvist, T., Saji, S., Colditz, G.A., Wolk, A.: Al- cohol and postmenopausal breast cancer risk defined by estrogen and progesterone receptor status: A prospective cohort study. Journal of the National Cancer Insti- tute 97(21), 1601–1608 (2005)