<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Maximum Entropy in Support of Semantically Annotated Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paulo Pinheiro da Silva</string-name>
          <email>paulo@utep.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladik Kreinovich</string-name>
          <email>vladik@utep.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Servin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science University of Texas</institution>
          ,
          <addr-line>El Paso, TX 79968</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>One of the important problems of semantic web is checking whether two datasets describe the same quantity. The existing solution to this problem is to use these datasets' ontologies to deduce that these datasets indeed represent the same quantity. However, even when ontologies seem to confirm the identify of the two corresponding quantities, it is still possible that in reality, we deal with somewhat different quantities. A natural way to check the identity is to compare the numerical values of the measurement results: if they are close (within measurement errors), then most probably we deal with the same quantity, else we most probably deal with different ones. In this paper, we show how to perform this checking.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic web</kwd>
        <kwd>ontology</kwd>
        <kwd>uncertainty</kwd>
        <kwd>probabilistic approach</kwd>
        <kwd>Maximum Entropy approach</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Checking whether two datasets represent the same data: formulation of the
problem. In the semantic web, data are often encoded in Resource Description
Framework (RDF) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In RDF, every piece of information is represented as a triple
consisting of a subject, a predicate, and an object. For example, when we describe
the result of measuring the gravitation field, the coordinates at which we
perform the measurements for a subject, a predicate is a term indicating that the
measured quantity is a gravitational field (e.g., a term hasGravityReading), an
the actual measurement result is an object.
      </p>
      <p>
        In general, an RDF-based scientific dataset can be viewed as a (large) graph
of RDF triples. One of the hard-to-solve problems is that triples in two different
datasets using the same predicate hasGravityReading may not mean the same
thing just because the predicates have the same name. One way to check this is
to use semantics, i.e., to specify the meanings of the terms used in both datasets
by an appropriate ontology, and then use reasoning to verify that the
meaning of the terms is indeed the same. In the gravity example, we conclude that
the predicate hasGravityReading has the same meaning in both datasets if in
both datasets, this meaning coincides with sweet:hasGravityReading, the
meaning of this term in one of the the Semantic Web for Earth and Environmental
Terminology (SWEET) ontologies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that deals with gravity.
Need to take uncertainty into account. Even when ontologies seem to infer that
we are dealing with the same concept, there is still a chance that the two datasets
talk about slightly different concepts. To clarify the situation, we can use the
fact that often, the two datasets contain the values measured at the same (or
almost the same) locations. In such cases, to confirm that we are indeed dealing
with the same concept, we can compare the corresponding measurement results
x!1, . . . , x!n and x!1!, . . . , x!!. Due to measurement uncertainty, the measured values
n
x!i and x!i! are, in general, slightly different.
      </p>
      <p>The question is: Based on the semantically annotated measurement results
and the known information about the measurement uncertainty, how can we use
the uncertainty information to either reinforce or question whether two datasets
namely representing the same data may not be the same data.</p>
      <p>Probabilistic approach to measurement uncertainty. To answer the above
question, we must start by analyzing how the measurement uncertainty is
represented. In this paper, we consider the traditional probabilistic way of describing
measurement uncertainty.</p>
      <p>In the engineering and scientific practice, we usually assume that for each
measuring instrument, we know the probability distribution of different values
of measurement error Δx!i d=ef x!i − xi. This assumption is often reasonable, since
we can calibrate each measuring instrument by comparing the results of this
measuring instrument with the results of a “standard” (much more accurate)
one. The differences between the corresponding measurement results form the
sample from which we can extract the desired distribution.</p>
      <p>Often, after the calibration, it turns out that the tested measuring instrument
is somewhat biased in the sense that the mean value of the measurement error
is different from 0. In such cases, the instrument is usually re-calibrated – by
subtracting this bias (mean) from all the measurement results – to make sure
that the mean is 0. Thus, without losing generality, we can also assume that the
mean value of the measurement error is 0: E[Δx!i] = 0.</p>
      <p>
        The degree to which the measured value x!i differs from the actual value xi
is usually measured by the standard deviation σi! d=ef !E[(Δx!i)2].
Gaussian distribution: justification. The measurement error is usually caused by
a large number of different independent factors. It is known that under certain
reasonable conditions, the joint effect of a large number of small independent
factors has a probability distribution which is close to Gaussian; the corresponding
results (Central Limit Theorems) are the main reason why Gaussian (normal)
distribution is indeed widely spread in practice [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. So, it is reasonable to assume
that the distribution for Δx!i is Gaussian.
      </p>
      <p>Towards a solution. We do not know the actual values xi, we only know the
measurement results x!i and x!i! from the two datasets. For each i, the difference
between these measurement results can be described in terms of the measurement
errors: Δxi d=ef x!i − x!i! = (x!i − xi) − (x!i! − xi) = Δx!i − Δx!i!. It is reasonable to
Maximum Entropy in Support of Semantically Annotated Datasets
assume that this difference is also normally distributed. Since the mean values of
Δx!i and Δx!i! are zeros, the mean value of their difference Δxi is also 0, so it is
sufficient to find the standard deviation σi = √Vi of Δxi. In general, for the sum
of two Gaussian variables, we have σi2 = (σi!)2 + (σi!!)2 + 2ri · σi! · σi!!, where ri =
E[Δx!i · Δx!i!] is the correlation between the i-th measurement errors. It is known
σi! · σ!!</p>
      <p>
        i
that the correlation ri can take all possible values from the interval [
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ]: the
value ri = 1 corresponds to the maximal possible (perfect) positive correlation,
when Δx!i! = a · Δx!i + b for some a &gt; 0; the value ri = 0 corresponds to the case
when measurement errors are independent; the value ri = −1 corresponds to
the maximal possible (perfect) negative correlation, when Δx!i! = a · Δx!i + b for
some a &lt; 0. Other values correspond to imperfect correlation. The problem is
that usually, we have no information about the correlation between measurement
errors from different datasets.
      </p>
      <p>First idea: assume independence. A usual practical approach to situations in
which we have no information about possible correlations is to assume that the
measurement errors are independent.</p>
      <p>
        A possible (somewhat informal) justification of this assumption is as follows.
Each correlation ri can take any value from the interval [
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ]. We would like
to choose a single value rij from this interval.
      </p>
      <p>
        We have no information why some values are more reasonable than others,
whether non-negative correlation is more probable or non-positive correlation
is more probable. Thus, our information is invariant with respect to the change
ri → −ri, and hence, the selected correlation value ri must be invariant w.r.t. the
same transformation. Thus, we must have ri = −ri, thence ri = 0. A somewhat
more formal justification of this selection can be obtained from the Maximum
Entropy approach; see, e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Under the independence assumption, we have
(σi)2 = (σi!)2 + (σi!!)2.
      </p>
      <p>
        Once we know the values, we can use the χ2 criterion (see, e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) to check
whether with given degree of confidence α, the observed differences are
consistent with the assumption that these differences are normally distributed with
standard deviations σi: "n (Δxi)2 ≤ χ2n,α. If this inequality is satisfied, i.e., if
i=1 (σi)2
"n (Δxi)2
i=1 (σi!)2 + (σi!!)2 ≤ χ2n,α, then we conclude that the two datasets indeed
describe the same quantity. If this inequality is not satisfied, then most probably,
the datasets describe somewhat different quantities.
      </p>
      <p>On the other hand, there is another possibility: that the two datasets do
describe the same quantity, but the measurement errors are indeed correlated.
An alternative idea: worst-case estimations. If the above inequality holds for
some values σi, then it holds for larger values σi as well. To take into account
the possibility of correlations, we should only reject the similarity hypothesis
when the above inequality does not hold even for the largest possible values σi.
Since |ri| ≤ 1, we have (σi)2 ≤ Vi d=ef (σi!)2 + (σi!!)2 + 2σi! · σi!!. The value
σ!!</p>
      <p>i
Vi is attained for Δx!i! = − σ! · Δx!i. So, the largest possible value of σi2 is
i
equal to Vi. One can easily check that Vi = (σi! + σ!!)2. Thus, in this case, if
i
"n # Δxi $2 ≤ χ2n,α, then we conclude that the two datasets indeed describe
i=1 σi! + σi!!
the same quantity. If this inequality is not satisfied, then most probably, the
datasets describe somewhat different quantities.
χ2n,α.</p>
      <p>Conclusion. Based on the semantically annotated measurement results and the
known information about the measurement uncertainty, how can we use the
uncertainty information to either reinforce or question whether two datasets
namely representing the same data may not be the same data?</p>
      <p>We assume the some values from the two datasets contain the results of
measuring the same quantity at the same locations and/or moments of time.
Let n denote the total number of such measurements, let x!1, . . . , x!n denote
the corresponding results from the first dataset, and let x!1!, . . . , x!n! denote the
measurement results from the second dataset. We assume that we know the
standard deviations σi! and σi!! of these measurements, and that we have no
information about possible correlation between the corresponding measurement
errors. In this case, we apply the Maximum Entropy approach, and conclude that
if "n (Δxi)2</p>
      <p>i=1 (σi!)2 + (σi!!)2 ≤ χ2n,α, where χ2n,α ≈ n is the value of the χ2-criterion for
the desired certainty α, then this reinforces the original conclusion that the two
datasets represent the same data. If the above inequality is not satisfied, then we
conclude that either the two datasets represent different data (or, alternatively,
that the measurement uncertainty values σi! and σi!! are underestimated).</p>
      <p>If we have reasons to suspect that the measurement errors corresponding to
two databases may be correlated, then can be more cautious and reinforce the
original conclusion even when a weaker inequality is satisfied: i"=n1 # σi!Δ+xσii!! $2 ≤
Acknowledgments. This work was partly supported by NSF grant
HRD0734825 and by NIH Grant 1 T36 GM078000-01. The authors are thankful to
the anonymous referees for valuable suggestions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jaynes</surname>
          </string-name>
          , E. T.:
          <source>Probability Theory: The Logic of Science</source>
          , Cambridge University Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Resource</given-names>
            <surname>Description</surname>
          </string-name>
          <article-title>Framework (RDF</article-title>
          ) http://www.w3.org/RDF/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Semantic</surname>
          </string-name>
          <article-title>Web for Earth and Environmental Terminology SWEET ontologies http://sweet</article-title>
          .jpl.nasa.gov/ontology/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sheskin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Handbook of Parametric and Nonparametric Statistical Procedures</article-title>
          , Chapman &amp; Hall/CRC,
          <string-name>
            <surname>Boca</surname>
            <given-names>Raton</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florida</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>