<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Representing Sampling Distributions In P-S ROI Q</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pavel Klinov</string-name>
          <email>pklinov@email.arizona.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <email>bparsia@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a design for a (fragment of) Breast Cancer ontology encoded in the probabilistic description logic P-SROIQ which supports determining the consistency of distinct statistical experimental results which may be described in diverse ways. The key contribution is a method for approximating sampling distributions such that the inconsistency of the approximation implies the statistical inconsistency of the continuous distributions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The current amount of knowledge about breast cancer is overwhelming. For
example, a meta-study conducted in 2006 by Key et al. [4] covered 98 unique
studies focused only on the impact of a single risk factor, alcohol consumption. At
the same time there are no common knowledge bases which would combine and
formally represent ndings produced by the multitude of studies.3 This makes
it di cult to have a global view of breast cancer risk factors and, consequently,
develop tools like risk assessment calculators.</p>
      <p>The probabilistic description logic P-SROIQ can be used to represent
general knowledge about breast cancer in the form of a probabilistic ontology (the
BRC ontology) [5]. However, a general knowledge ontology need not support risk
entailments for various combinations of risk factors | that is, compete (poorly)
with narrowly speci c risk calculators4 which have a direct implementation of
simple equations derived from statistical risk models (such as the Gail model
[2]). Instead, its main goal is to formally and unambiguously describe the
background theory of breast cancer embracing as many reliable ndings as possible
and serving as a common knowledge base for more speci c tools, such as risk
assessment calculators or decision support systems. This sort of task seems to
be a better t for a probabilistic logic.</p>
      <p>The set of use cases for the general knowledge ontology is wider than for the
BRC ontology. In addition to maintaining a birds-eye view of breast cancer, it
may be used for nding and analyzing inconsistencies in outcomes of di erent
3 There are some lower level databases, such as ROCK (http://rock.icr.ac.uk/)|a
cancer speci c functional genomic database. However, they do not explicitly
represent case study ndings and do not support such services as risk assessment.
4 Such as http://www.cancer.gov/bcrisktool
studies. It can support studying mechanisms of interactions between risk factors,
for example, how alcohol consumption a ects estrogen level. Finally, it may play
a useful role in planning and coordination of future medical studies by helping to
identify the most controversial or insu ciently studied risk factors or exposures.</p>
      <p>In this paper, we present a design of general P-SROIQ ontology about
breast cancer (i.e., the BRC ontology) which incorporates a substantial amount
of statistical knowledge. While we do not present a fully eshed out instance
of this design, we do tackle a major representational challenge, namely, the
representation of the statistical results of experiments. We present a method for
approximate representations of di erent sampling distributions and their use in
determining consistency between experimental data.
2</p>
      <p>Preliminaries of P-SROI Q
P-SROIQ [8] is a probabilistic extension of the DL SROIQ [3]. It provides
means for expressing probabilistic relationships between arbitrary SROIQ
concepts and a certain class of probabilistic relationships between classes and
individuals. Any SROIQ, and thus OWL 2 DL (as it can be seen as a notational
variant of SROIQ), ontology can be used as a basis for a P-SROIQ ontology,
which facilitates transition from classical to probabilistic ontologies. We presume
the reader is reasonbly familiar with class/object oriented description logics such
as SROIQ, though very little in this paper turns on speci c details.</p>
      <p>The only syntactic construct in P-SROIQ (in addition to all of the SROIQ
syntax) is the conditional constraint.</p>
      <p>De nition 1 (Conditional Constraint). A conditional constraint is an
expression of the form (DjC)[l; u], where C and D are concept expressions in
SRIQ (i.e., SROIQ without nominals) called evidence and conclusion,
respectively, and [l; u] [0; 1] is a closed real-valued interval. In the case where C
is &gt; the constraint is called unconditional.</p>
      <p>Ontologies in P-SROIQ are separated into a classical and a probabilistic
part. It is assumed that the set of individual names NI is partitioned onto two
sets: classical individuals NCI and probabilistic individuals NP I .
De nition 2 (PTBox, PABox, and Probabilistic Knowledge Base). A
probabilistic TBox (PTBox) is a pair P T = (T ; P) where T is a classical
( nite) SROIQ TBox and P is a nite set of conditional constraints. A
probabilistic ABox (PABox) is a nite set of conditional constraints associated with
a probabilistic individual op 2 NP I . A probabilistic knowledge base (or a
probabilistic ontology) is a triple P O = (T ; P; fPop gop2NP I ), where the rst two
components de ne a PTBox and the last is a a set of PABoxes.</p>
      <p>Informally, a PTBox constraint (DjC)[l; u] expresses a conditional statement
of the form \if a randomly chosen individual is an instance of C, the probability
of it being an instance of D is in [l; u]". A PABox constraint, which we write
as (DjC)o[l; u] where o is a probabilistic individual, states that \if a speci c
individual (that is, o) is an instance of C, the probability of it being an instance
of D is in [l; u]". For more details we refer the reader to [8].
3</p>
    </sec>
    <sec id="sec-2">
      <title>The Classical Part</title>
      <p>
        The classical part of a P-SROIQ ontology (or OWL part) provides a medical
vocabulary which can be used on its own in a variety of applications or used in the
representation of probabilistic knowledge. In this paper we focus on providing
an OWL terminology for probabilistic statements. The ontology contains the
following main class hierarchies (taxonomies):
Taxonomy of breast cancers Breast cancer is a heterogeneous disease. Some
risk factors can be associated with increase in risk of developing one
particular type of breast cancer and not the other. Thus it is important to classify
types of breast cancer. In particular, our ontology distinguishes breast
cancers by hormone receptor status. Estrogen and progesterone positive breast
cancers are modeled using concepts ERPositiveBRC and PRPositiveBRC
while their complements are modeled using ERNegativeBRC and
PRNegativeBRC (we use shorthands ER+/- and PR+/- with obvious
meaning.). Another important classi cation is based on histology. The ontology
distinguishes between invasive and non-invasive (e.g. in situ) cancers.
Taxonomy of risk factors Dozens of risk factors are known so far. Some are
established and strongly associate with increased risks, such as BRCA1(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
gene mutations, while others are controversial. The ontology should provide
vocabulary for both to support current and future ndings. It includes a
taxonomy of concepts rooted at RiskFactor. We distinguish between known
risk factors (those which can be reported via a questionnaire, such as alcohol
intake) and inferred risk factors which require medical examination.
Taxonomy of risks The ontology di erentiates absolute and relative risks of
developing breast cancer. Absolute risks are further divided into the
lifetime risk and the short-term risk. Relative risks are divided into increased
and reduced risks. Level of increases is a continuous variable which requires
discretization (see below).
      </p>
      <p>The last two taxonomies induce the corresponding classi cations of women,
i.e., classes of women w.r.t. risk factors and w.r.t. risk. For example, any risk
factors RF gives rise to a class of women Woman u 9hasRiskFactor.RF. Women
having various combinations of risk factors are modeled as conjunctive concept
expressions. Analogously, given a certain kind of risk R the expression Woman u
9hasRisk.R models those women who are in the risk group R, for example, have
moderately increased risk of developing ER+ breast cancer. These taxonomies
of women may or may not be explicitly present in the ontology. In other words,
it is possible, but not essential, to generate a concept name for each interesting
class of women since P-SROIQ (and our reasoner Pronto) allows for complex
concept expressions in conditional constraints.</p>
      <p>A future, more complete version of the ontology would certainly make use of
existing bio-medical ontologies which cover substantial portions of the domain
either by direct reuse or by ontology alignment techniques.
4</p>
    </sec>
    <sec id="sec-3">
      <title>The Method for Approximating Distributions in the</title>
    </sec>
    <sec id="sec-4">
      <title>Probabilistic Part</title>
      <p>The probabilistic part of the ontology captures statistical background
knowledge about breast cancer. We distinguish between knowledge which explicitly
associates quanti es speci c risk factors and more general statistical
relationships which are not necessarily risk related. The distinction could be useful for
importing knowledge from other medical ontologies. We begin with the latter.</p>
      <p>General statistical knowledge mostly includes relationships between
various risk factors. For example, Ashkenazi Jew women are more likely to
develop BRCA gene mutations, while early menarche, late rst child (or no live
births), lack of breastfeeding and alcohol consumption all increase levels of
estrogen in blood.5 Such relationships are important because they can help to
infer the presence of some risk factors given the set of known factors. They
are typically easy to represent by using conditional constraints of the form
(Woman u 9hasRiskFactor.RFYjWoman u 9hasRiskFactor.RFX)[l,u] which says
that the chances of having risk factor RF Y given RF X are between l and u.
One possible source of complications is continuous variables, e.g. the level of
estrogen, which are discussed below.</p>
      <p>Most of statistical ndings available in medical literature quantitatively
describe risk increase for categories of women with speci c risk factors. Such
ndings are presented by giving estimated parameters of a probability distribution
where the random variable represents the relative risk of a random woman in
the population. Such parameters include the estimated mean value and the
estimated variance. Table 1 presents an example of the reported association between
alcohol intake and the risk increase among postmenopausal women taken from
[10]. There are two main di culties with representing this kind of data in
PSROIQ. First, the risk increase is a continuous random variable so it needs
to be discretized. Second, the available language supports only conditional
constraints so a straightforward encoding of probability distributions is not possible.</p>
      <p>Discretization of a continuous variable is technically straightforward. We
introduce a set of disjoint concept names each of which models women in the
corresponding group of risk. Speci cally, we de ne concepts WomenAtWeakRisk,
WomenAtModerateRisk and WomenAtHighRisk with the obvious meanings
described using OWL 2 datatype support to describe the exact boundaries. We
have chosen ranges (1, 1.5], (1.5, 3.0] and (3.0, + inf) respectively.6</p>
      <p>The inability to represent distributions is a more severe limitation. It leaves
the modeler with the only option of approximating the continuous distribution
using a nite set of points. In other words, each distribution, for example, risk
increase for women consuming a certain amount of alcohol, can be approximated by
specifying the probability that a randomly taken woman with the given exposure
belongs to a speci c group of risk, i.e. WomenAtWeakRisk, WomenAtModerateRisk
or WomenAtHighRisk. This is the semantics of P-SROIQ conditional constraints.</p>
      <p>Assuming that the random variable is real-valued, a standard way of
approximating a continuous distribution is to take each interval and compute the
probability that the variable takes on a value in that interval. Then the
approximation of a distribution P r(x) w.r.t. a nite set of intervals U is simply a
function P^r such that P^r(Ui) = R P r(x)dx.</p>
      <p>Ui</p>
      <p>Unfortunately, this approximation of results of statistical experiments is
unsatisfactory because it maps every interval to a single point. The problem is that
any arbitrarily small di erence between two or more sampling distributions will
results in con icting probabilistic statements for every interval (because the
point-valued probabilities will be di erent) even though the results can con rm
each other from a purely statistical point of view. Consequently this approach
does not support working with results reported by multiple studies.</p>
      <p>Our goal is to approximate sampling distributions in P-SROIQ in a
statistically coherent way. Informally it means that satis ability of probabilistic
formulas representing two or more sampling distributions must agree with their
mutual statistical consistency, i.e., whether they support a common statistical
hypothesis. The hypothesis, in this case, is that there exists a distribution (not
necessarily a unique one) over G with parameters ; such that it is supported
by all sampling distributions with the required level of con dence.</p>
      <p>We assume a ( nite) population G of size NG and a random variable X
which is normally distributed across G. We also make the realistic assumption
that G is large enough so that evaluating X for all members of G is not feasible.
A common approach is to take one or more random samples from G, evaluate
X for them and estimate the actual distribution over G based on the sampling
distributions. We use ; to denote the mean and the variance of the actual
distribution and X(i); S(i) for the mean and the variance of the sample X(i). For
simplicity we nally assume that the population distribution is normal.</p>
      <p>
        The mainstream approach for comparing two or more sampling distributions
is based on statistical hypothesis tests. For example, given two normal
distribu6 The choice of intervals is obviously ambiguous but this issue is orthogonal to the
approximation method presented in this paper.
tions X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) it is common to take X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), which is a normally
distributed random variable, and perform a z-test (or a Student's t-test
depending on the sample sizes) to see if the di erence can be taken as 0 with the required
level of con dence. It amounts to calculating standard errors of the mean (SE)
for both distributions and then computing the di erence in units of SE. If the
probability of observing such di erence given the null hypothesis,7 which can be
found in standard tables, is low enough, e.g., 0:05, a statistician would accept
the hypothesis that both distributions are consistent.
      </p>
      <p>
        Our approach is slightly di erent from the outlined above. It is not based on
tests but on con dence regions for sampling distributions. The approach, which
generalizes con dence intervals and dates back to Mood [9], is to estimate a
region R in the parameter space for ( ; 2) such that it will contain the ; 2
pair of the actual distribution 100(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )% times as the number of estimations
goes to in nity. More formally, a 100(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )% con dence region R is a random
set for parameters ( ; 2) based on a group of independent normally distributed
variables X (i.e., a sample) such that [1]:8
      </p>
      <p>
        P (( ; 2) 2 R ) = 1
; for all ( ; 2)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>
        Informally, the con dence region speci es how far sampling distributions can
deviate from the population distribution while supporting it with 100(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )%
con dence. Following Mood [9] we will show that for the normal distribution the
region is a convex set and, therefore can be represented by boundary values of
( ; 2) such that any sampling distribution inside the boundary will be consistent
with the current distribution.
      </p>
      <p>Consider the sample X1; : : : ; Xn where all Xi are independent random
variables with the normal distribution (N ( ; 2)). Then X = n1 Pn
i=1 Xi and S2 =
n1 1 Pin=1 (Xi X)2, i.e., the sample mean and the sample variance, are
ran2
dom variables. It is well known that X has the normal distribution N ( ; n ) (or,
equivalently, X=pn N (0; 1)) while (n 1)S2= 2 has the chi-square distribution
with n 1 degrees of freedom [9].</p>
      <p>
        The standard tables for N (0; 1) and 2n 1 provide numbers a; b; c such that
for xed p1; p2 the following equalities hold [1]:
7 The null hypothesis is a default position which, in this case, could be that the
population mean is di erent from at least one of X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
8 We deliberately leave out a precise de nition of random set. For the purposes of this
paper it is su cient to think of a random set as of a random variable which takes
on subsets of some space.
      </p>
      <p>The crucial fact is that the two random variables are independent (see [9] for
a proof) which implies that:</p>
      <p>
        P (X
a pn &lt;
&lt; X + a pn ;
=pn &lt; a; b &lt;
Thus, the 100(p1)(p2)% con dence region for ( ; 2) takes the following form:
Rp1;p2 (X; S) =
( ; 2) : X
(n
pn &lt;
1)S2
&lt;
&lt; X +
2 &lt; (n
pn ;
1)S2
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>
        De nition 3. Let P r(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )); P r(X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) be distributions on two samples X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
drawn independently from a population G. They are said to be consistent with
con dence 100p% if there exists a point ( ; 2) which belongs to both Rp(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ))
      </p>
      <p>
        Now we can return to the issue of approximating a continuous sampling
distribution by a discrete set of points. Assume that the domain E of a continuous
real-valued random variable X is a disjoint union of a nite number of intervals
U = f( 1; r1]; (r1; r2]; : : : ; (rl 1; rl]; (rl; +1)g. Then the approximation of the
sampling distribution P r(X) with mean and variance (X; S2) is the function P^r
which maps each interval Ui to the following real-valued set:
De nition 4. Two sampling distributions P r(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )); P r(X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) are approximately
consistent given a nite set of intervals U if P^r(Ui; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) \ P^r(Ui; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ))
is non-empty for all Ui 2 U .
      </p>
      <p>
        As with any approximation, the utility of approximations of sampling
distributions depends on what conclusions they help to draw about the distributions
themselves. Given that we are interested in the matter of consistency, it is
important to understand the relationships between the notions of consistency and
approximate consistency of sampling distributions. Fortunately, consistency
implies approximate consistency regardless of partitioning of the real line:
Theorem 1. If two sampling distributions P r(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )); P r(X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) are consistent,
then they are approximately consistent for any choice of real-valued intervals.
Proof. For the distribution P r(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) a con dence region Rp1;p2 (X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) is
connected (see De nition 2). The function g( ; 2) (De nition 3) is continuous on
it which implies that for any Ui, the set P^r(Ui; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) is a real-valued interval
(l1; u1). Now consider a point 0; 02 2 Rp1;p2 (X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) \ Rp1;p2 (X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ))
which exists since the distributions are consistent. It follows that l1 &lt; g( 0; 02) &lt;
u1 (and analogously l2 &lt; g( 0; 02) &lt; u2 for P^r(Ui; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ))), so g( 0; 02) is
a common point for both approximations on Ui. As such the distributions are
approximately consistent.
      </p>
      <p>The following corollary from the above theorem is at heart of our method.
As we demonstrate below, the inconsistency of approximations can be proved
by logical reasoning in P-SROIQ (i.e., by solving the probabilistic satis
ability problem), which means that the result enables approximate reasoning about
sampling distributions in a purely logical way. Even though the power of such
reasoning is currently limited to consistency checking, its integration with OWL/DL
reasoning and the ability to use common, formally de ned terminology for
representation of statistical experiments is promising.</p>
      <p>
        Corollary 1. If sampling distributions P r(X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )); P r(X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) are approximately
inconsistent for some choice of real-valued intervals, then they are inconsistent.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Example of Approximate Modelling</title>
      <p>Now we present an example of approximate representation of sampling
distributions in P-SROIQ. The task is to take two results of statistical experiments
aimed at investigating associations between alcohol consumption and the
increased risk of breast cancer among postmenopausal women. Unfortunately it is
common for medical papers to not explicitly present all parameters that
characterize results of their statistical analyses. Typically, only the estimated mean
and the con dence interval are presented while, for example, the kind of
distribution is left to the reader to infer from other information. Due to that fact and
because the approach above has only been developed for normal distributions,
we illustrate it on an arti cial example. The information given in the example
is analogous to that given in medical literature, e.g. [10, 11], but is complete in
the sense that all parameters and the type of sampling distributions are known.
Example 1. Consider two hypothetical papers which report results of
independent studies of associations between alcohol consumption among postmenopausal
women and their relative risk of developing breast cancer. According to study
A the mean relative risk (RR) of ER+ breast cancer for women drinking 4g
of ethanol a day is 1.8 and has variance of 0.5. Study B has reported that the
mean RR of ER+ breast cancer for the same level of drinking is 2.2 (variance
0.7). The number of cases in the studies was 230 and 150 respectively.</p>
      <p>We propose the following four step procedure for an approximate
representation of statistical results, similar to those in the example above, in P-SROIQ:
1. Preparing concepts The rst step is to de ne the concepts/roles used
to describe the distribution. In our case evidence concepts should describe
categories of women with respect to speci c risk factors, e.g. alcohol intake, while
conclusion concepts describe groups of women strati ed by risk increase. For
instance, the concept expression C Woman u 9hasRiskFactor:(Postmenopause u
ModerateConsumption) is used to model postmenopausal women with moderate
level of alcohol intake.9 On the other hand the expression:</p>
      <p>D</p>
      <p>Woman u 9hasRisk:(ModeratelyIncreasedRisk u 9riskOf.ERPositiveBRC)
9 The level of intake is a continuous variable which we also split onto categories
LimitedConsumption, ModerateConsumption and HeavyConsumption which
correspond to 4, 4 9:9 and 10g of ethanol per day.
models women who are at moderately increased risk of developing ER-positive
breast cancer. Using these expressions the modeler can specify the probability
than a random women the class C also belong the risk group D as (D|C)[l,u].</p>
      <p>2. Determining parameters of sampling distributions (if required)
Sometimes parameters of sampling distributions can be determined from other
information. For example, knowing the kind of distribution, sample mean, sample
size, con dence interval and the methodology of its estimation, one can calculate
the sample variance.10 In our case it is not needed as the distributions are normal
and the parameters are known.</p>
      <p>3. Choosing intervals Choice of intervals for an approximation of a
continuous random variable is driven by balancing the quality of the
approximation (i.e., how closely it models the continuous distribution) and the number of
statements required. The latter has a direct impact on performance. For
Example 1 we use three concepts WomenAtWeakRisk, WomenAtModerateRisk and
WomenAtHighRisk which correspond to relative risk intervals of (1,1.5], (1.5,
3.0] and (3.0,+1) respectively.</p>
      <p>
        4. Computing the approximation The nal (and the central) step is to
compute probability intervals for the statements that approximate the
continuous distribution. Each statement speci es the lower and upper probabilities
that the continuous random variable X will fall into an interval Ui given that
parameters of the distribution can vary within the con dence region (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ). More
formally, given the interval Ui, e.g. (1,1.5] for WomenAtWeakRisk, and the
sampling distribution (X; S2) the interval [li; ui] can be computed by solving the
following non-linear optimization problem ():
li (resp. ui) = min (resp. max) g( ; 2) s.t.
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
1 (abbreviated as R0(1:9)5 and R0(2:9)5) are de ned by the following inequalities:
10 The variable T = (X )=(S=pn) has the t-dustribution with n 1 degrees
of freedom. Con dence interval is standardly computed as [X a; X + a] where
a = t 1 2 ;n 1 pSn (t 1 2 ;n 1 is the percentile of the Student distribution). If the
con dence interval and are known, then S can be calculated.
      </p>
      <p>
        Now the optimization problem (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) can be solved numerically11 to obtain the
following approximations for both sampling distributions:
inf P^r((1; 1:5]; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:219
sup P^r((1; 1:5]; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:298
inf P^r((1:5; 3:0]; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:655
sup P^r((1:5; 3:0]; X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:878
inf P^r((3:0; +1); X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:239
sup P^r((3:0; +1); X(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ); S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) = 0:586
inf P^r((1; 1:5]; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:116
sup P^r((1; 1:5]; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:224
inf P^r((1:5; 3:0]; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:562
sup P^r((1:5; 3:0]; X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:769
inf P^r((3:0; +1); X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:189
sup P^r((3:0; +1); X(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ); S(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )) = 0:568
So, for this example, the sampling distributions are approximately represented
in P-SROIQ using the following conditional constraints.
      </p>
      <p>f(W u 9hR:(WeaklyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:219; 0:298];
(W u 9hR:(ModeratelyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:655; 0:878];
(W u 9hR:(StronglyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:239; 0:586]g
and
f(W u 9hR:(WeaklyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:116; 0:224];
(W u 9hR:(ModeratelyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:562; 0:769];
(W u 9hR:(StronglyIncreasedRisk u 9riskOf.ERPositiveBRC)jC)[0:189; 0:568]g
where W and hR abbreviate Woman and hasRisk, respectively and
C</p>
      <p>W u 9hasRiskFactor:(Postmenopause u ModerateConsumption)</p>
      <p>Probabilistic consistency of the above set of statements can be proved by
solving the probabilistic satis ability problem (PSAT). Modern algorithms can
decide PSAT for over a thousand of P-SROIQ statements (in addition to
thousands of OWL axioms), so the method could be computationally practical [6].
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Checking consistency of sampling distributions in P-SROIQ may well appear
cumbersome and pointless given that the same task can be done in a much
simpler way and without any logical reasoning, e.g. via testing or by analyzing
con dence regions. However, our aim is not to reduce statistical testing to
logical reasoning (that aim is indeed pointless). Our aim is to represent results of
statistical experiments using common, unambiguously de ned logical vocabulary
and be able to reason about them. Even though probabilistic reasoning about
statistical results is currently limited to approximate consistency checking, the
11 We use Wolfram Mathematica for this purpose.
potential bene ts are in combining it with reasoning about the classical
knowledge. For example, the BCRA ontology contains a little taxonomy of breast
cancers by hormone receptor status. This enables us to combine results of the
studies which are of di erent levels of granularity. For instance, Sellers et al. [10]
report associations between alcohol intake and ER(+/-) breast cancer risk, while
Suzuki et al. [11] divide it further to ER(+/-)PR(+/-) risks. In that simple case
non-logical reasoning about the reported results becomes much less
straightforward, while studies can also distinguish histologic types of breast cancer (see [7]).
In such complex situations reasoning about ndings does involve reasoning about
background knowledge, e.g. the taxonomy of breast cancers, so a combination of
OWL and probabilistic reasoning is potentially bene cial.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arnold</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shavelle</surname>
            ,
            <given-names>R.M.:</given-names>
          </string-name>
          <article-title>Joint con dence sets for the mean and variance of a normal distribution</article-title>
          .
          <source>The American Statistician</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <volume>133</volume>
          {
          <fpage>140</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gail</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brinton</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Byar</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corle</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schairer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulvihill</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Projecting individualized probabilities of developing breast cancer for white females who are being examined annually</article-title>
          .
          <source>Journal of the National Cancer Institute</source>
          <volume>81</volume>
          (
          <issue>25</issue>
          ),
          <year>1879</year>
          {
          <year>1886</year>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kutz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>The even more irresistible SROIQ</article-title>
          .
          <source>In: Knowledge Representation and Reasoning</source>
          . pp.
          <volume>57</volume>
          {
          <issue>67</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Key</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hodgson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omar</surname>
            ,
            <given-names>R.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boobis</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elliott</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Meta-analysis of studies of alcohol and breast cancer with consideration of the methodological issues</article-title>
          .
          <source>Cancer Causes Control</source>
          <volume>17</volume>
          ,
          <issue>759</issue>
          {
          <fpage>770</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Klinov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Probabilistic modeling and OWL: A user oriented introduction into P-SHIQ(D)</article-title>
          .
          <source>In: OWL: Experiences and Directions</source>
          (
          <year>2008</year>
          ), http: //www.webont.org/owled/2008/papers/owled2008eu_submission_32.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Klinov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A hybrid method for probabilistic satis ability</article-title>
          .
          <source>In: CADE</source>
          . pp.
          <volume>354</volume>
          {
          <issue>368</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>J.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leitzmann</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brinton</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoover</surname>
            ,
            <given-names>R.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollenbeck</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schatzkin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Alcohol and risk of breast cancer by histologic type and hormone receptor status in postmenopausal women the nih-aarp diet and health study</article-title>
          .
          <source>American Journal of Epidemiology</source>
          <volume>170</volume>
          (
          <issue>3</issue>
          ),
          <volume>308</volume>
          {
          <fpage>317</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lukasiewicz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Expressive probabilistic description logics</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>172</volume>
          (
          <issue>6-7</issue>
          ),
          <volume>852</volume>
          {
          <fpage>883</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mood</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          :
          <article-title>Introduction to the Theory of Statistics</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          (
          <year>1950</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sellers</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vierkant</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cerhan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gapstur</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vachon</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olson</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pankratz</surname>
            ,
            <given-names>V.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kushi</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          :
          <article-title>Interaction of dietary folate intake, alcohol, and risk of hormone receptor-de ned breast cancer in a prospective study of postmenopausal women</article-title>
          .
          <source>Cancer Epidemiology, Biomarkers and Prevention</source>
          <volume>11</volume>
          ,
          <issue>1104</issue>
          {
          <fpage>1107</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Suzuki</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rylander-Rudqvist</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colditz</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Alcohol and postmenopausal breast cancer risk de ned by estrogen and progesterone receptor status: A prospective cohort study</article-title>
          .
          <source>Journal of the National Cancer Institute</source>
          <volume>97</volume>
          (
          <issue>21</issue>
          ),
          <volume>1601</volume>
          {
          <fpage>1608</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>