Maximum Entropy in Support of Semantically
            Annotated Datasets

                Paulo Pinheiro da Silva, Vladik Kreinovich, and
                               Christian Servin

                        Department of Computer Science
                   University of Texas, El Paso, TX 79968, USA
                        paulo@utep.edu, vladik@utep.edu


      Abstract. One of the important problems of semantic web is checking
      whether two datasets describe the same quantity. The existing solution
      to this problem is to use these datasets’ ontologies to deduce that these
      datasets indeed represent the same quantity. However, even when ontolo-
      gies seem to confirm the identify of the two corresponding quantities, it
      is still possible that in reality, we deal with somewhat different quanti-
      ties. A natural way to check the identity is to compare the numerical
      values of the measurement results: if they are close (within measurement
      errors), then most probably we deal with the same quantity, else we most
      probably deal with different ones. In this paper, we show how to perform
      this checking.

      Key words: semantic web, ontology, uncertainty, probabilistic ap-
      proach, Maximum Entropy approach


Checking whether two datasets represent the same data: formulation of the prob-
lem. In the semantic web, data are often encoded in Resource Description Frame-
work (RDF) [2]. In RDF, every piece of information is represented as a triple
consisting of a subject, a predicate, and an object. For example, when we describe
the result of measuring the gravitation field, the coordinates at which we per-
form the measurements for a subject, a predicate is a term indicating that the
measured quantity is a gravitational field (e.g., a term hasGravityReading), an
the actual measurement result is an object.
    In general, an RDF-based scientific dataset can be viewed as a (large) graph
of RDF triples. One of the hard-to-solve problems is that triples in two different
datasets using the same predicate hasGravityReading may not mean the same
thing just because the predicates have the same name. One way to check this is
to use semantics, i.e., to specify the meanings of the terms used in both datasets
by an appropriate ontology, and then use reasoning to verify that the mean-
ing of the terms is indeed the same. In the gravity example, we conclude that
the predicate hasGravityReading has the same meaning in both datasets if in
both datasets, this meaning coincides with sweet:hasGravityReading, the mean-
ing of this term in one of the the Semantic Web for Earth and Environmental
Terminology (SWEET) ontologies [3] that deals with gravity.
2        P. Pinheiro da Silva, V. Kreinovich, and C. Servin

Need to take uncertainty into account. Even when ontologies seem to infer that
we are dealing with the same concept, there is still a chance that the two datasets
talk about slightly different concepts. To clarify the situation, we can use the
fact that often, the two datasets contain the values measured at the same (or
almost the same) locations. In such cases, to confirm that we are indeed dealing
with the same concept, we can compare the corresponding measurement results
x!1 , . . . , x!n and x!!1 , . . . , x!!n . Due to measurement uncertainty, the measured values
x!i and x!!i are, in general, slightly different.
      The question is: Based on the semantically annotated measurement results
and the known information about the measurement uncertainty, how can we use
the uncertainty information to either reinforce or question whether two datasets
namely representing the same data may not be the same data.

Probabilistic approach to measurement uncertainty. To answer the above ques-
tion, we must start by analyzing how the measurement uncertainty is repre-
sented. In this paper, we consider the traditional probabilistic way of describing
measurement uncertainty.
    In the engineering and scientific practice, we usually assume that for each
measuring instrument, we know the probability distribution of different values
                             def
of measurement error ∆x!i = x!i − xi . This assumption is often reasonable, since
we can calibrate each measuring instrument by comparing the results of this
measuring instrument with the results of a “standard” (much more accurate)
one. The differences between the corresponding measurement results form the
sample from which we can extract the desired distribution.
    Often, after the calibration, it turns out that the tested measuring instrument
is somewhat biased in the sense that the mean value of the measurement error
is different from 0. In such cases, the instrument is usually re-calibrated – by
subtracting this bias (mean) from all the measurement results – to make sure
that the mean is 0. Thus, without losing generality, we can also assume that the
mean value of the measurement error is 0: E[∆x!i ] = 0.
    The degree to which the measured value x!i differs from the actual value xi
                                                      def !
is usually measured by the standard deviation σi! = E[(∆x!i )2 ].

Gaussian distribution: justification. The measurement error is usually caused by
a large number of different independent factors. It is known that under certain
reasonable conditions, the joint effect of a large number of small independent fac-
tors has a probability distribution which is close to Gaussian; the corresponding
results (Central Limit Theorems) are the main reason why Gaussian (normal)
distribution is indeed widely spread in practice [4]. So, it is reasonable to assume
that the distribution for ∆x!i is Gaussian.

Towards a solution. We do not know the actual values xi , we only know the
measurement results x!i and x!!i from the two datasets. For each i, the difference
between these measurement results can be described in terms of the measurement
            def
errors: ∆xi = x!i − x!!i = (x!i − xi ) − (x!!i − xi ) = ∆x!i − ∆x!!i . It is reasonable to
           Maximum Entropy in Support of Semantically Annotated Datasets                      3

assume that this difference is also normally distributed. Since the mean values of
∆x!i and ∆x!!i are zeros, the mean value of their√ difference ∆xi is also 0, so it is
sufficient to find the standard deviation σi = Vi of ∆xi . In general, for the sum
of two Gaussian variables, we have σi2 = (σi! )2 + (σi!! )2 + 2ri · σi! · σi!! , where ri =
E[∆x!i · ∆x!!i ]
                 is the correlation between the i-th measurement errors. It is known
    σi! · σi!!
that the correlation ri can take all possible values from the interval [−1, 1]: the
value ri = 1 corresponds to the maximal possible (perfect) positive correlation,
when ∆x!!i = a · ∆x!i + b for some a > 0; the value ri = 0 corresponds to the case
when measurement errors are independent; the value ri = −1 corresponds to
the maximal possible (perfect) negative correlation, when ∆x!!i = a · ∆x!i + b for
some a < 0. Other values correspond to imperfect correlation. The problem is
that usually, we have no information about the correlation between measurement
errors from different datasets.

First idea: assume independence. A usual practical approach to situations in
which we have no information about possible correlations is to assume that the
measurement errors are independent.
     A possible (somewhat informal) justification of this assumption is as follows.
Each correlation ri can take any value from the interval [−1, 1]. We would like
to choose a single value rij from this interval.
     We have no information why some values are more reasonable than others,
whether non-negative correlation is more probable or non-positive correlation
is more probable. Thus, our information is invariant with respect to the change
ri → −ri , and hence, the selected correlation value ri must be invariant w.r.t. the
same transformation. Thus, we must have ri = −ri , thence ri = 0. A somewhat
more formal justification of this selection can be obtained from the Maximum
Entropy approach; see, e.g., [1]. Under the independence assumption, we have
(σi )2 = (σi! )2 + (σi!! )2 .
     Once we know the values, we can use the χ2 criterion (see, e.g., [4]) to check
whether with given degree of confidence α, the observed differences are consis-
tent with the assumption that these differences are normally distributed with
                                "n (∆x )2
                                         i
standard deviations σi :                     ≤ χ2n,α . If this inequality is satisfied, i.e., if
                                i=1 (σ i ) 2

"n      (∆xi )2
                 !! 2 ≤ χn,α , then we conclude that the two datasets indeed de-
                              2
i=1 (σi ) + (σi )
       ! 2

scribe the same quantity. If this inequality is not satisfied, then most probably,
the datasets describe somewhat different quantities.
     On the other hand, there is another possibility: that the two datasets do
describe the same quantity, but the measurement errors are indeed correlated.

An alternative idea: worst-case estimations. If the above inequality holds for
some values σi , then it holds for larger values σi as well. To take into account
the possibility of correlations, we should only reject the similarity hypothesis
when the above inequality does not hold even for the largest possible values σi .
4       P. Pinheiro da Silva, V. Kreinovich, and C. Servin

                                             def
   Since |ri | ≤ 1, we have (σi )2 ≤ Vi = (σi! )2 + (σi!! )2 + 2σi! · σi!! . The value
                                 σ !!
Vi is attained for ∆x!!i = − i! · ∆x!i . So, the largest possible value of σi2 is
                                 σi
equal to Vi . One can easily check that Vi = (σi! + σi!! )2 . Thus, in this case, if
   #           $2
"n      ∆xi
                  ≤ χ2n,α , then we conclude that the two datasets indeed describe
i=1 σi + σi
       !    !!

the same quantity. If this inequality is not satisfied, then most probably, the
datasets describe somewhat different quantities.

Conclusion. Based on the semantically annotated measurement results and the
known information about the measurement uncertainty, how can we use the
uncertainty information to either reinforce or question whether two datasets
namely representing the same data may not be the same data?
     We assume the some values from the two datasets contain the results of
measuring the same quantity at the same locations and/or moments of time.
Let n denote the total number of such measurements, let x!1 , . . . , x!n denote
the corresponding results from the first dataset, and let x!!1 , . . . , x!!n denote the
measurement results from the second dataset. We assume that we know the
standard deviations σi! and σi!! of these measurements, and that we have no
information about possible correlation between the corresponding measurement
errors. In this case, we apply the Maximum Entropy approach, and conclude that
     "
     n       (∆xi )2
if                         ≤ χ2n,α , where χ2n,α ≈ n is the value of the χ2 -criterion for
    i=1 (σ !
           i )2 + (σ !! )2
                     i
the desired certainty α, then this reinforces the original conclusion that the two
datasets represent the same data. If the above inequality is not satisfied, then we
conclude that either the two datasets represent different data (or, alternatively,
that the measurement uncertainty values σi! and σi!! are underestimated).
     If we have reasons to suspect that the measurement errors corresponding to
two databases may be correlated, then can be more cautious and reinforce the
                                                                         #            $2
                                                                      "n       ∆xi
original conclusion even when a weaker inequality is satisfied:                          ≤
                                                                      i=1 σi + σi
                                                                              !    !!
   2
χn,α .

Acknowledgments. This work was partly supported by NSF grant HRD-
0734825 and by NIH Grant 1 T36 GM078000-01. The authors are thankful to
the anonymous referees for valuable suggestions.

References
1. Jaynes, E. T.: Probability Theory: The Logic of Science, Cambridge University Press
   (2003)
2. Resource Description Framework (RDF) http://www.w3.org/RDF/
3. Semantic Web for Earth and Environmental Terminology SWEET ontologies
   http://sweet.jpl.nasa.gov/ontology/
4. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures,
   Chapman & Hall/CRC, Boca Raton, Florida (2004)