=Paper=
{{Paper
|id=Vol-170/paper-8
|storemode=property
|title=Probabilistic XML in Information Integration
|pdfUrl=https://ceur-ws.org/Vol-170/paper7.pdf
|volume=Vol-170
}}
==Probabilistic XML in Information Integration==
Probabilistic XML in Information Integration
Ander de Keijzer
Faculty of EEMCS, University of Twente
POBox 217, 7500AE Enschede, The Netherlands
a.dekeijzer@utwente.nl
Abstract autonomous and can, in principle, be mobile, connec-
tions will appear and disappear continuously.
Information integration is a difficult research We envision an integration system that is capable
problem. In an ambient environment, where of unattended integration of information sources from
devices can connect and disconnect arbitrar- different devices. The process needs to be unattended,
ily, the problem only increases, because data because of the irregular availability of surrounding de-
sources may become available at any time, vices. It is infeasible for the user to provide informa-
but can also disappear. In such an envi- tion every time a device connects and integration is
ronment, information integration needs to be initiated or resumed.
unattended, because information integration The integration problem can be divided into two
opportunities arise on ad-hoc basis. classes, schema integration and data integration.
There already is a large body of work on schema inte-
We propose to use probabilistic XML to store gration, but less work on data integration. This paper
the integration result and instead of resolving addresses the data integration part of the integration
conflicts at integration time, just store these problem.
conflicts in the integrated information source
Suppose, for example, two devices with address
and resolve them at query time.
books are within connection range. The address books
of these devices are depicted in Table 1. The per-
1 Introduction sons mentioned in the address books have the same
name, but different phone numbers. Therefore, both
The problem of integration of data from heterogeneous elements could refer to the same person. In this case,
information sources has been on the research agenda there may have been a mistake in entering one of the
for many years. With the increasing amount of in- phone numbers, or this person may have two phone
formation that is available, the need for an answer to numbers. Another possibility is, that both elements
this problem increases. However, because of the huge refer to two different person, who have the same name.
amounts of data, manually integrating information is Instead of asking the user, the system should decide
not a viable option. Unfortunately, due to the se- itself if elements refer to the same real world object (in
mantical assumptions not captured in schema or data, this case a person). In this paper we use XML as a
automatic integration solutions often make mistakes. data model. Elements referring to the same real world
Therefore, most of the existing integration systems are object are said to be equal ; the same holds for equality
semi-automatic. of entire XML subtrees. However, in most cases this
The environment we work on is an ambient intel- will be impossible due to the fact that not all seman-
ligent environment. Many autonomous devices, with tics is captured in the data. Instead of resolving these
network capabilities, are equipped with applications conflicts, we propose to simply store the encountered
that use a database. This database resides on the de- uncertainty. Semantical conflicts, i.e. two elements re-
vice itself, but the device should be capable of inte- ferring to the same real world object, can be seen as
grating and synchronizing its database with databases uncertainty. Therefore, after integration, the informa-
on other connected devices. Because all devices are tion source will contain uncertain information. This
uncertainty is stored as probabilistic XML.
c 2006 for the individual paper by the paper’ authors. Copying
permitted for private and scientific purposes. Re-publication
At query time, the end user of the system is already
of material on this page requires permission by the copyright actively involved. This provides an opportunity for the
owners. system to get feedback from the user, without really
Proceedings of the VLDB2006 Ph.D. Workshop bothering the user. The feedback provided by the user
Seoul, Rep of Korea, 2006 can then be used to reduce the uncertainty contained
(a) Address book 1 (b) Address book 2
in the information source.
name phone name phone
Application John 1111 John 2222
Municipalities in the Netherlands are obligated by law
to provide their services to the public through the in-
ternet. The amount of data they have collected is huge Table 1: Two example address books
and divided over different departments. As a result
of this division, the data is also divided over differ- • Feedback
ent databases. However, a user accessing the services Although no human effort should be required at
through the internet is not (always) aware of this divi- integration time, this is not the case for query
sion, therefore the data needs to be integrated. Unfor- time. When querying the document, the user is
tunately, the data in the different sources often con- already actively using the system. Facilities for
flicts on several attributes. Fully automatic integra- giving feedback on query results without much
tion will possibly result in errors in the data, because overhead can easily be incorporated in a user in-
of the conflicts in the original sources. Manual in- terface. Feedback can not only increase accuracy
tegration is to difficult due to personal information of the integrated information source, but also re-
contained in the data that needs to be checked at inte- duce the size of the database.
gration time to ensure correct integration, but even if • Usage in Integration Framework
this would be possible, it would be to labor intensive. The probabilistic model will have to be imple-
Probabilistic integration solves the problem of keep- mented in an integration framework capable of
ing wrong information and throwing away correct in- also supporting schema transformations. A pro-
formation, because no data will be deleted. Checking totype will be developed. We will show feasibility
the data can be postponed to the time when the data of the approach by means of some real-life exper-
is accessed either by the person himself through the iments.
internet, or an employee at the municipality involved So far, we have a good understanding of the
in the particular case file. main principles and semantics of required functional-
ity [KK04, KKA05, KKL06]. We are currently working
on prototype development and performance aspects in
2 Research Questions
order to prove the practical viability of the approach.
For this PhD research the following questions are ad-
dressed. 3 Probabilistic Data
• Probabilistic model and approach
Conflicts between elements at integration time In order to support unattended information integra-
can be seen as uncertainty between data values, or tion, we do not make any decisions about conflicts in
even between entire XML subtrees. A probabilis- the data. Instead, the uncertainty is stored. For exam-
tic model based on the possible world approach ple, if the address books from Table 1 are integrated,
will be developed to support postponing decisions several results are possible.
about conflicts in the data. Instead of resolving 1. John1 (value “John” from address book 1) and
these conflicts, the uncertainty about elements or John2 refer to the same real world person. One of
subtrees will be stored. the phone numbers is correct, either 1111 or 2222,
• Querying the other one is incorrect.
A probabilistic XML document should be queried 2. John1 and John2 refer to the same real world per-
like a regular, non probabilistic XML document. son. Both phone numbers are correct.
Only the answer will be uncertain, i.e. the result 3. John1 and John2 refer to different real world per-
will be a set of possible worlds. Therefore, the sons.
semantics of XPath will have to be redefined for From this list, option 2 can be eliminated if we allow
querying probabilistic XML. schema information to be used in the integration pro-
• Reducing size of integration result cess and if only one phone number is allowed for each
With no additional world knowledge, the size of person. The other two options can not be resolved
the resulting probabilistic XML document after without using external information. This uncertainty
integration tends to be huge. We plan to use a is stored by means of probabilistic data.
component called “The Oracle” that predicts if it We use XML to store information, because we found
is imaginable that two elements refer to the same XML to be more expressive than the relational model
real world object. Using very simple knowledge [KKA05]. We introduced two new kinds of nodes
rules, the number of possibilities in the result- 1. probability nodes (▽) and 2. possibility nodes (◦).
ing information source can be reduced enormously The document root is a probability node. Child nodes
[KKL06], thus removing a major obstacle in the of probability nodes are always possibility nodes and
practical viability of this approach. child nodes of possibility nodes are always regular
▽
◦
• addressbook
α cccccccc▽ [[[[[[[[(1−α)
[[[[[[[[
c cc cccccccc [[[f[ XX
◦ c f ◦ XXXXXX
f ffffff XXXXX
personll• RRR personll•f
Rf
R
f
R
f
ll•Rperson
RRR
llll RRR
llll RRR
llll RR
▽ β ll▽RR(1−β)
R ▽ ▽ ▽ ▽
ll l RRR
◦ ◦ l ◦ ◦ ◦ ◦ ◦
name •phone • phone • name • phone • name • phone •
John 1111 2222 John 1111 John 2222
Figure 1: Integration result
XML nodes, which in turn have only probability nodes
as child nodes. Therefore, on each level of an XML tree
there is only one kind of nodes. Figure 1 contains the
resulting XML document after integrating the address
books from Table 1.
Subtrees of a probability node denote mutually
exclusive possible subtrees. A probability is asso-
ciated with each possibility node. Since possibility
nodes with a common parent (probability) exclude
each other, their probabilities should sum up to 1. The
probability associated with possibility nodes indicate
the level of confidence in the subtree. A probability
value of 1 means that it is certain that the subtree
will be part of the XML document. If all probabilities
in the XML document are 1, the document is certain
and can therefore be considered to be a normal XML
document.
In our probabilistic XML model, we use the possi-
ble world approach. This means that any uncertainty
in the data produces possible worlds, or views on the Figure 2: Information Cycle
real world. Possibilities of an element can be seen as
approach.
a possible representation of the real world object. If
On the whole we envision it to work like the infor-
there are two elements in the information source and
mation cycle shown in Figure 2. The database of a
both have three possible representations, then there
device is integrated or synchronized with other (exter-
are nine possible worlds represented by the probabilis-
nal) information sources. As a result, possible worlds
tic XML tree. In Figure 1 three possible worlds can be
are created. This is caused by conflicts in the data
distinguished. Two worlds where there is only one per-
between different information sources. At all times,
son “John”, in the first world (with probability α × β)
the database reflects possibly conflicting observations
this person has phone number “1111” and in the sec-
of the real world (RW in the figure). The user of the
ond world (with probability α × (1 − β)) this person
system poses queries to the database and also observes
has phone number “2222”. In the third world (with
the real world. He can provide feedback on query re-
probability (1−α)) there are two persons, both named
sults based on his own observations of the real world.
“John”, one of them has phone number “1111” and the
All possible worlds represented in the database that
other has phone number “2222”.
contradict the feedback can be deleted from the set of
The advantage of the possible world approach is possible worlds.
that it provides for a natural way to reason about the
data. First of all, the semantics of a query is simply
the set of answers obtained from posing the query to
4 Integration Framework
each possible world individually. The probability of We plan to use the probabilistic model in an inte-
each possible answer is the probability of the corre- gration framework. An overview of this framework
sponding possible world. An implementation deriving is shown in Figure 3. The integration framework
such a query result will, of course, use a more efficient acts as a DBMS. The user can pose queries to the
Application • Different naming conventions for data values
In our research, we attempt to deal with these chal-
Mediator lenges by explicitly handling the inherent uncertain-
Oracle ties occurring in the data integration process using a
rule engine integrator
probabilistic database approach. Suciu’s tutorial at
SIGMOD 2005 comes with an extensive bibliography
on the topic of probabilistic data management [SD05].
wrapper wrapper Many results from the logic programming and artificial
intelligence communities are combined in [ELLS01]
which proposes a probabilistic object database. The
object-oriented data model is more expressive, but also
less flexible, than the XML data model. Nevertheless,
many things carry over to the XML world.
Figure 3: Integration Framework A probabilistic database is not a new idea, see for
example [FKL97, BGMP90, ELW01, DS96, KKA05],
entire DBMS, while the rule engine distributes the but in recent years attention grew considerably. Orig-
queries over autonomous underlying databases. We inally, work concentrated on relational databases, but
use a rather standard rule-based approach to deal with in [KKA05] we argue that XML can be made to express
schema integration issues. uncertainty in a more natural way. Other probabilistic
The schema integration problem itself will not be XML databases are, for example, PXML [HGS03] and
the focus of this PhD research, instead we will develop ProTDB [NJ02].
a probabilistic XML data model, investigate how to Although schema and data level matching and in-
query this probabilistic XML and show how proba- tegration can be clearly separated, schema matching
bilistic XML can be used in an integration setting. techniques (see [MBR01] for a nice taxonomy) can of-
This part is captured in the data integrator of the ar- ten be used or adapted to be applicable on data level.
chitecture. For example, [BN05] presents a technique to search
The Oracle in the architecture is a component that for duplicate records and to use these duplicates for
can indicate if, and to what degree, two elements are schema matching. As we will see in the sequel, an
referring to the same real world object. This oracle important problem in data integration is how to de-
will be responsible for associating initial probabilities cide whether or not two data items refer to the same
to uncertain elements. real-world object. Duplicate finding techniques can be
applied to (partly) solve this problem. Also in many
5 Validation other areas duplicate finding techniques can be found,
such as data warehousing [ACG02].
In order to validate the research, we design and im-
plement a prototype. First a Haskell prototype will A problem in using probabilistic databases for data
be used to validate the research on probabilistic XML. integration is how to determine the probabilities.
We use Haskell, because this resembles the formal def- Many schema matching techniques suitable for data
initions of probabilistic XML and operations thereon, integration, however, quantify the degree of matching.
so we can easily check our ideas on main principles and For example, instance-based matchers use classifica-
semantics of functionality. tion techniques [DDH01]. If two data items from dif-
Next, we will design a prototype on top of Mon- ferent information sources referring to the same real-
etDB/XQuery [BGvK+ 06]. This prototype is mainly world object conflict on some attribute value, and one
meant to show feasibility and performance of the al- of those values is classified with less certainty that the
gorithms. The probabilistic XML implementation will other in the class corresponding to the attribute, then
become a module in MonetDB/XQuery that attribute value is less likely to be correct and
should receive a smaller probability. The same holds
for techniques where dictionaries or thesauri are used:
6 Related Research if a possible data value is not present in the corre-
The amount of research on information integration is sponding dictionary, it should receive a smaller prob-
huge. [DH05] provides a nice survey. As we mentioned, ability. Schema matching techniques can also be used
we distinguish schema matching and integration from for data conversion as [MZ98] demonstrated.
data integration. [RB01] is a good survey on schema Finally, an important source of schema and data
matching techniques. integration techniques can be drawn from the Seman-
The challenges for data integration of [Lev99] have tic Web community. As we argued in the introduc-
received much attention in recent years: tion, world knowledge is required for making decisions
• Overlapping and contradictory data in the integration process. In theory, annotating the
• Semantic mismatches among sources data with sufficient world knowledge may also over-
come the problem. The question remains if it is prac- survey. AI Magazine, Sp.Issue on Semantic
tical to demand from all information sources to be Integration, 2005.
sufficiently annotated to resolve all uncertainty. Fur- [DS96] Debabrata Dey and Sumit Sarkar. A prob-
thermore, it is an open problem how to determine be- abilistic relational model and algebra. ACM
forehand when annotations suffice to resolve all un- Trans. Database Syst., 21(3):339–369, 1996.
certainty. We, therefore, approach the problem from [ELLS01] T. Eiter, J.J. Lu, T. Lukasiewicz, and V.S.
the other end. The probabilistic data integration ap- Subrahmanian. Probabilistic object bases.
proach as such is independent of any world knowledge. ACM Trans. Database Syst., 26(3):264–312,
Adding world knowledge can then be used to restrict 2001.
uncertainty [KKL06]. [ELW01] Thomas Eiter, Thomas Lukasiewicz, and
Michael Walter. A data model and algebra
7 Summary for probabilistic complex values. Annals of
Mathematics and Artificial Intelligence, 33(2-
The number of autonomous devices capable of commu- 4):205–252, 2001.
nication and containing information, is still increasing. [FKL97] D. Florescu, D. Koller, and A. Levy. Using
Although information integration has been an impor- probabilistic information in data integration.
tant area of research, due to the increase in available In Proc. VLDB Conf., Athens, Greece, pages
information, its importance is still growing. In this 216–225, Aug. 1997.
paper, we have given an overview of using probabilis- [HGS03] E. Hung, L. Getoor, and V.S. Subrahmanian.
tic XML in data integration. We propose to use this PXML: A probabilistic semistructured data
probabilistic XML to facilitate unattended integration. model and algebra. In Proc. ICDE Conf.,
Instead of asking the user in case of conflict, these con- Bangalore, India, pages 467–, Mar. 2003.
flicts are simply stored. At query time, when the user [KK04] A. de Keijzer and M. van Keulen. A possible
is already actively using the system, feedback to query world approach to uncertain relational data.
results can be used to reduce the uncertainty in the In Proc. SIUFDB Workshop, Zaragoza, Spain,
database. Sept. 2004.
[KKA05] M. van Keulen, A. de Keijzer, and W. Alink.
References A probabilistic xml approach to data integra-
tion. In Proc. ICDE Conf., Tokyo, Japan,
[ACG02] R. Ananthakrishna, S. Chaudhuri, and pages 459–470, 2005.
V. Ganti. Eliminating fuzzy duplicates in [KKL06] A. de Keijzer, M. van Keulen, and Y. Li. Tam-
data warehouses. In Proc. VLDB Conf., Hong ing Data Explosion in Probabilistic Informa-
Kong, China, pages 586–597, Aug. 2002. tion Integration. In Proceedings of the Inter-
[BGMP90] Daniel Barbará, Hector Garcia-Molina, and national Workshop on Inconsistency and In-
Daryl Porter. A probabilistic relational completeness in Databases (IIDB), March 26,
data model. In François Bancilhon, 2006, Munich, Germany, March 2006.
Costantino Thanos, and Dennis Tsichritzis, [Lev99] A.Y. Levy. Combining artificial intelligence
editors, Advances in Database Technology - and databases for data integration. In Ar-
EDBT’90. International Conference on Ex- tificial Intelligence Today, LNCS 1600, pages
tending Database Technology, Venice, Italy, 249–268. 1999.
March 26-30, 1990, Proceedings, volume 416
[MBR01] J. Madhavan, P.A. Bernstein, and E. Rahm.
of Lecture Notes in Computer Science, pages
Generic schema matching with Cupid. In
60–74. Springer, 1990.
Proc. VLDB Conf., Roma, Italy, pages 49–58,
[BGvK+ 06] P. Boncz, T. Grust, M. van Keulen, S. Mane- Sept. 2001.
gold, J. Rittinger, and J. Teubner. MonetD- [MZ98] T. Milo and S. Zohar. Using schema matching
B/XQuery - Fast and Scalable XQuery Pro- to simplify heterogeneous data translation. In
cessor Powered by a Relational Engine. In Proc. VLDB Conf., New York, USA, pages
ACM SIGMOD International Conference on 122–133, Aug. 1998.
Management of Data, Chicago, USA, June
26-29, 2006, June 2006. [NJ02] A. Nierman and H.V. Jagadish. ProTDB:
Probabilistic data in XML. In Proc. VLDB
[BN05] A. Bilke and F. Naumann. Schema matching Conf., Hong Kong, China, 2002.
using duplicates. In Proc. ICDE Conf., Tokyo,
Japan, pages 69–80, Apr. 2005. [RB01] E. Rahm and P.A. Bernstein. A survey of
approaches to automatic schema matching.
[DDH01] A. Doan, P. Domingos, and A. Halevy. Rec- VLDB Journal, 10(4):334–350, Dec. 2001.
onciling schemas of disparate data sources:
[SD05] D. Suciu and N.N. Dalvi. Foundations
a machine-learning approach. In Proc. SIG-
of probabilistic answers to queries. In
MOD Conf., Santa Barbara, USA, pages 509–
Proc. SIGMOD Conf., page 963, 2005.
520, 2001.
Bibliographic notes to this tutorial at
[DH05] A. Doan and A. Halevy. Semantic integration http://www.cs.washington.edu/homes/suciu/
research in the database community: Brief .
tutorial-sigmod2005-bib.pdf