Maximum Entropy in Support of Semantically Annotated Datasets Paulo Pinheiro da Silva, Vladik Kreinovich, and Christian Servin Department of Computer Science University of Texas, El Paso, TX 79968, USA paulo@utep.edu, vladik@utep.edu Abstract. One of the important problems of semantic web is checking whether two datasets describe the same quantity. The existing solution to this problem is to use these datasets’ ontologies to deduce that these datasets indeed represent the same quantity. However, even when ontolo- gies seem to confirm the identify of the two corresponding quantities, it is still possible that in reality, we deal with somewhat different quanti- ties. A natural way to check the identity is to compare the numerical values of the measurement results: if they are close (within measurement errors), then most probably we deal with the same quantity, else we most probably deal with different ones. In this paper, we show how to perform this checking. Key words: semantic web, ontology, uncertainty, probabilistic ap- proach, Maximum Entropy approach Checking whether two datasets represent the same data: formulation of the prob- lem. In the semantic web, data are often encoded in Resource Description Frame- work (RDF) [2]. In RDF, every piece of information is represented as a triple consisting of a subject, a predicate, and an object. For example, when we describe the result of measuring the gravitation field, the coordinates at which we per- form the measurements for a subject, a predicate is a term indicating that the measured quantity is a gravitational field (e.g., a term hasGravityReading), an the actual measurement result is an object. In general, an RDF-based scientific dataset can be viewed as a (large) graph of RDF triples. One of the hard-to-solve problems is that triples in two different datasets using the same predicate hasGravityReading may not mean the same thing just because the predicates have the same name. One way to check this is to use semantics, i.e., to specify the meanings of the terms used in both datasets by an appropriate ontology, and then use reasoning to verify that the mean- ing of the terms is indeed the same. In the gravity example, we conclude that the predicate hasGravityReading has the same meaning in both datasets if in both datasets, this meaning coincides with sweet:hasGravityReading, the mean- ing of this term in one of the the Semantic Web for Earth and Environmental Terminology (SWEET) ontologies [3] that deals with gravity. 2 P. Pinheiro da Silva, V. Kreinovich, and C. Servin Need to take uncertainty into account. Even when ontologies seem to infer that we are dealing with the same concept, there is still a chance that the two datasets talk about slightly different concepts. To clarify the situation, we can use the fact that often, the two datasets contain the values measured at the same (or almost the same) locations. In such cases, to confirm that we are indeed dealing with the same concept, we can compare the corresponding measurement results x!1 , . . . , x!n and x!!1 , . . . , x!!n . Due to measurement uncertainty, the measured values x!i and x!!i are, in general, slightly different. The question is: Based on the semantically annotated measurement results and the known information about the measurement uncertainty, how can we use the uncertainty information to either reinforce or question whether two datasets namely representing the same data may not be the same data. Probabilistic approach to measurement uncertainty. To answer the above ques- tion, we must start by analyzing how the measurement uncertainty is repre- sented. In this paper, we consider the traditional probabilistic way of describing measurement uncertainty. In the engineering and scientific practice, we usually assume that for each measuring instrument, we know the probability distribution of different values def of measurement error ∆x!i = x!i − xi . This assumption is often reasonable, since we can calibrate each measuring instrument by comparing the results of this measuring instrument with the results of a “standard” (much more accurate) one. The differences between the corresponding measurement results form the sample from which we can extract the desired distribution. Often, after the calibration, it turns out that the tested measuring instrument is somewhat biased in the sense that the mean value of the measurement error is different from 0. In such cases, the instrument is usually re-calibrated – by subtracting this bias (mean) from all the measurement results – to make sure that the mean is 0. Thus, without losing generality, we can also assume that the mean value of the measurement error is 0: E[∆x!i ] = 0. The degree to which the measured value x!i differs from the actual value xi def ! is usually measured by the standard deviation σi! = E[(∆x!i )2 ]. Gaussian distribution: justification. The measurement error is usually caused by a large number of different independent factors. It is known that under certain reasonable conditions, the joint effect of a large number of small independent fac- tors has a probability distribution which is close to Gaussian; the corresponding results (Central Limit Theorems) are the main reason why Gaussian (normal) distribution is indeed widely spread in practice [4]. So, it is reasonable to assume that the distribution for ∆x!i is Gaussian. Towards a solution. We do not know the actual values xi , we only know the measurement results x!i and x!!i from the two datasets. For each i, the difference between these measurement results can be described in terms of the measurement def errors: ∆xi = x!i − x!!i = (x!i − xi ) − (x!!i − xi ) = ∆x!i − ∆x!!i . It is reasonable to Maximum Entropy in Support of Semantically Annotated Datasets 3 assume that this difference is also normally distributed. Since the mean values of ∆x!i and ∆x!!i are zeros, the mean value of their√ difference ∆xi is also 0, so it is sufficient to find the standard deviation σi = Vi of ∆xi . In general, for the sum of two Gaussian variables, we have σi2 = (σi! )2 + (σi!! )2 + 2ri · σi! · σi!! , where ri = E[∆x!i · ∆x!!i ] is the correlation between the i-th measurement errors. It is known σi! · σi!! that the correlation ri can take all possible values from the interval [−1, 1]: the value ri = 1 corresponds to the maximal possible (perfect) positive correlation, when ∆x!!i = a · ∆x!i + b for some a > 0; the value ri = 0 corresponds to the case when measurement errors are independent; the value ri = −1 corresponds to the maximal possible (perfect) negative correlation, when ∆x!!i = a · ∆x!i + b for some a < 0. Other values correspond to imperfect correlation. The problem is that usually, we have no information about the correlation between measurement errors from different datasets. First idea: assume independence. A usual practical approach to situations in which we have no information about possible correlations is to assume that the measurement errors are independent. A possible (somewhat informal) justification of this assumption is as follows. Each correlation ri can take any value from the interval [−1, 1]. We would like to choose a single value rij from this interval. We have no information why some values are more reasonable than others, whether non-negative correlation is more probable or non-positive correlation is more probable. Thus, our information is invariant with respect to the change ri → −ri , and hence, the selected correlation value ri must be invariant w.r.t. the same transformation. Thus, we must have ri = −ri , thence ri = 0. A somewhat more formal justification of this selection can be obtained from the Maximum Entropy approach; see, e.g., [1]. Under the independence assumption, we have (σi )2 = (σi! )2 + (σi!! )2 . Once we know the values, we can use the χ2 criterion (see, e.g., [4]) to check whether with given degree of confidence α, the observed differences are consis- tent with the assumption that these differences are normally distributed with "n (∆x )2 i standard deviations σi : ≤ χ2n,α . If this inequality is satisfied, i.e., if i=1 (σ i ) 2 "n (∆xi )2 !! 2 ≤ χn,α , then we conclude that the two datasets indeed de- 2 i=1 (σi ) + (σi ) ! 2 scribe the same quantity. If this inequality is not satisfied, then most probably, the datasets describe somewhat different quantities. On the other hand, there is another possibility: that the two datasets do describe the same quantity, but the measurement errors are indeed correlated. An alternative idea: worst-case estimations. If the above inequality holds for some values σi , then it holds for larger values σi as well. To take into account the possibility of correlations, we should only reject the similarity hypothesis when the above inequality does not hold even for the largest possible values σi . 4 P. Pinheiro da Silva, V. Kreinovich, and C. Servin def Since |ri | ≤ 1, we have (σi )2 ≤ Vi = (σi! )2 + (σi!! )2 + 2σi! · σi!! . The value σ !! Vi is attained for ∆x!!i = − i! · ∆x!i . So, the largest possible value of σi2 is σi equal to Vi . One can easily check that Vi = (σi! + σi!! )2 . Thus, in this case, if # $2 "n ∆xi ≤ χ2n,α , then we conclude that the two datasets indeed describe i=1 σi + σi ! !! the same quantity. If this inequality is not satisfied, then most probably, the datasets describe somewhat different quantities. Conclusion. Based on the semantically annotated measurement results and the known information about the measurement uncertainty, how can we use the uncertainty information to either reinforce or question whether two datasets namely representing the same data may not be the same data? We assume the some values from the two datasets contain the results of measuring the same quantity at the same locations and/or moments of time. Let n denote the total number of such measurements, let x!1 , . . . , x!n denote the corresponding results from the first dataset, and let x!!1 , . . . , x!!n denote the measurement results from the second dataset. We assume that we know the standard deviations σi! and σi!! of these measurements, and that we have no information about possible correlation between the corresponding measurement errors. In this case, we apply the Maximum Entropy approach, and conclude that " n (∆xi )2 if ≤ χ2n,α , where χ2n,α ≈ n is the value of the χ2 -criterion for i=1 (σ ! i )2 + (σ !! )2 i the desired certainty α, then this reinforces the original conclusion that the two datasets represent the same data. If the above inequality is not satisfied, then we conclude that either the two datasets represent different data (or, alternatively, that the measurement uncertainty values σi! and σi!! are underestimated). If we have reasons to suspect that the measurement errors corresponding to two databases may be correlated, then can be more cautious and reinforce the # $2 "n ∆xi original conclusion even when a weaker inequality is satisfied: ≤ i=1 σi + σi ! !! 2 χn,α . Acknowledgments. This work was partly supported by NSF grant HRD- 0734825 and by NIH Grant 1 T36 GM078000-01. The authors are thankful to the anonymous referees for valuable suggestions. References 1. Jaynes, E. T.: Probability Theory: The Logic of Science, Cambridge University Press (2003) 2. Resource Description Framework (RDF) http://www.w3.org/RDF/ 3. Semantic Web for Earth and Environmental Terminology SWEET ontologies http://sweet.jpl.nasa.gov/ontology/ 4. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures, Chapman & Hall/CRC, Boca Raton, Florida (2004)