=Paper=
{{Paper
|id=Vol-170/paper-8
|storemode=property
|title=Probabilistic XML in Information Integration
|pdfUrl=https://ceur-ws.org/Vol-170/paper7.pdf
|volume=Vol-170
}}
==Probabilistic XML in Information Integration==
<pdf width="1500px">https://ceur-ws.org/Vol-170/paper7.pdf</pdf>
<pre>
             Probabilistic XML in Information Integration

                                                     Ander de Keijzer
                                      Faculty of EEMCS, University of Twente
                                    POBox 217, 7500AE Enschede, The Netherlands
                                               a.dekeijzer@utwente.nl


                        Abstract                                  autonomous and can, in principle, be mobile, connec-
                                                                  tions will appear and disappear continuously.
    Information integration is a difficult research                  We envision an integration system that is capable
    problem. In an ambient environment, where                     of unattended integration of information sources from
    devices can connect and disconnect arbitrar-                  different devices. The process needs to be unattended,
    ily, the problem only increases, because data                 because of the irregular availability of surrounding de-
    sources may become available at any time,                     vices. It is infeasible for the user to provide informa-
    but can also disappear. In such an envi-                      tion every time a device connects and integration is
    ronment, information integration needs to be                  initiated or resumed.
    unattended, because information integration                      The integration problem can be divided into two
    opportunities arise on ad-hoc basis.                          classes, schema integration and data integration.
                                                                  There already is a large body of work on schema inte-
    We propose to use probabilistic XML to store                  gration, but less work on data integration. This paper
    the integration result and instead of resolving               addresses the data integration part of the integration
    conflicts at integration time, just store these               problem.
    conflicts in the integrated information source
                                                                     Suppose, for example, two devices with address
    and resolve them at query time.
                                                                  books are within connection range. The address books
                                                                  of these devices are depicted in Table 1. The per-
1    Introduction                                                 sons mentioned in the address books have the same
                                                                  name, but different phone numbers. Therefore, both
The problem of integration of data from heterogeneous             elements could refer to the same person. In this case,
information sources has been on the research agenda               there may have been a mistake in entering one of the
for many years. With the increasing amount of in-                 phone numbers, or this person may have two phone
formation that is available, the need for an answer to            numbers. Another possibility is, that both elements
this problem increases. However, because of the huge              refer to two different person, who have the same name.
amounts of data, manually integrating information is                 Instead of asking the user, the system should decide
not a viable option. Unfortunately, due to the se-                itself if elements refer to the same real world object (in
mantical assumptions not captured in schema or data,              this case a person). In this paper we use XML as a
automatic integration solutions often make mistakes.              data model. Elements referring to the same real world
Therefore, most of the existing integration systems are           object are said to be equal ; the same holds for equality
semi-automatic.                                                   of entire XML subtrees. However, in most cases this
   The environment we work on is an ambient intel-                will be impossible due to the fact that not all seman-
ligent environment. Many autonomous devices, with                 tics is captured in the data. Instead of resolving these
network capabilities, are equipped with applications              conflicts, we propose to simply store the encountered
that use a database. This database resides on the de-             uncertainty. Semantical conflicts, i.e. two elements re-
vice itself, but the device should be capable of inte-            ferring to the same real world object, can be seen as
grating and synchronizing its database with databases             uncertainty. Therefore, after integration, the informa-
on other connected devices. Because all devices are               tion source will contain uncertain information. This
                                                                  uncertainty is stored as probabilistic XML.
 c 2006 for the individual paper by the paper’ authors. Copying
permitted for private and scientific purposes. Re-publication
                                                                     At query time, the end user of the system is already
of material on this page requires permission by the copyright     actively involved. This provides an opportunity for the
owners.                                                           system to get feedback from the user, without really
Proceedings of the VLDB2006 Ph.D. Workshop                        bothering the user. The feedback provided by the user
Seoul, Rep of Korea, 2006                                         can then be used to reduce the uncertainty contained
                                                                  (a) Address book 1   (b) Address book 2
in the information source.
                                                                    name      phone       name     phone
Application                                                         John      1111        John     2222
Municipalities in the Netherlands are obligated by law
to provide their services to the public through the in-
ternet. The amount of data they have collected is huge              Table 1: Two example address books
and divided over different departments. As a result
of this division, the data is also divided over differ-        • Feedback
ent databases. However, a user accessing the services            Although no human effort should be required at
through the internet is not (always) aware of this divi-         integration time, this is not the case for query
sion, therefore the data needs to be integrated. Unfor-          time. When querying the document, the user is
tunately, the data in the different sources often con-           already actively using the system. Facilities for
flicts on several attributes. Fully automatic integra-           giving feedback on query results without much
tion will possibly result in errors in the data, because         overhead can easily be incorporated in a user in-
of the conflicts in the original sources. Manual in-             terface. Feedback can not only increase accuracy
tegration is to difficult due to personal information            of the integrated information source, but also re-
contained in the data that needs to be checked at inte-          duce the size of the database.
gration time to ensure correct integration, but even if        • Usage in Integration Framework
this would be possible, it would be to labor intensive.          The probabilistic model will have to be imple-
    Probabilistic integration solves the problem of keep-        mented in an integration framework capable of
ing wrong information and throwing away correct in-              also supporting schema transformations. A pro-
formation, because no data will be deleted. Checking             totype will be developed. We will show feasibility
the data can be postponed to the time when the data              of the approach by means of some real-life exper-
is accessed either by the person himself through the             iments.
internet, or an employee at the municipality involved          So far, we have a good understanding of the
in the particular case file.                                main principles and semantics of required functional-
                                                            ity [KK04, KKA05, KKL06]. We are currently working
                                                            on prototype development and performance aspects in
2   Research Questions
                                                            order to prove the practical viability of the approach.
For this PhD research the following questions are ad-
dressed.                                                    3   Probabilistic Data
  • Probabilistic model and approach
     Conflicts between elements at integration time         In order to support unattended information integra-
     can be seen as uncertainty between data values, or     tion, we do not make any decisions about conflicts in
     even between entire XML subtrees. A probabilis-        the data. Instead, the uncertainty is stored. For exam-
     tic model based on the possible world approach         ple, if the address books from Table 1 are integrated,
     will be developed to support postponing decisions      several results are possible.
     about conflicts in the data. Instead of resolving        1. John1 (value “John” from address book 1) and
     these conflicts, the uncertainty about elements or          John2 refer to the same real world person. One of
     subtrees will be stored.                                    the phone numbers is correct, either 1111 or 2222,
  • Querying                                                     the other one is incorrect.
     A probabilistic XML document should be queried           2. John1 and John2 refer to the same real world per-
     like a regular, non probabilistic XML document.             son. Both phone numbers are correct.
     Only the answer will be uncertain, i.e. the result       3. John1 and John2 refer to different real world per-
     will be a set of possible worlds. Therefore, the            sons.
     semantics of XPath will have to be redefined for       From this list, option 2 can be eliminated if we allow
     querying probabilistic XML.                            schema information to be used in the integration pro-
  • Reducing size of integration result                     cess and if only one phone number is allowed for each
     With no additional world knowledge, the size of        person. The other two options can not be resolved
     the resulting probabilistic XML document after         without using external information. This uncertainty
     integration tends to be huge. We plan to use a         is stored by means of probabilistic data.
     component called “The Oracle” that predicts if it          We use XML to store information, because we found
     is imaginable that two elements refer to the same      XML to be more expressive than the relational model
     real world object. Using very simple knowledge         [KKA05]. We introduced two new kinds of nodes
     rules, the number of possibilities in the result-      1. probability nodes (▽) and 2. possibility nodes (◦).
     ing information source can be reduced enormously       The document root is a probability node. Child nodes
     [KKL06], thus removing a major obstacle in the         of probability nodes are always possibility nodes and
     practical viability of this approach.                  child nodes of possibility nodes are always regular
                                      ▽
                                       ◦
                                       • addressbook
                            α cccccccc▽  [[[[[[[[(1−α)
                                                  [[[[[[[[
                 c cc cccccccc                              [[[f[ XX
              ◦ c                                             f ◦ XXXXXX
                                                      f ffffff            XXXXX
      personll• RRR                    personll•f
                                                Rf
                                                 R
                                                   f
                                                   R
                                                     f
                                                                             ll•Rperson
                                                                                 RRR
        llll        RRR
                                         llll       RRR
                                                                         llll       RR
     ▽             β ll▽RR(1−β)
                           R          ▽                  ▽            ▽               ▽
                 ll l        RRR
      ◦       ◦ l               ◦      ◦                 ◦             ◦               ◦
 name •phone •           phone • name •         phone •           name •        phone •
    John 1111                  2222 John               1111          John            2222


                                            Figure 1: Integration result
XML nodes, which in turn have only probability nodes
as child nodes. Therefore, on each level of an XML tree
there is only one kind of nodes. Figure 1 contains the
resulting XML document after integrating the address
books from Table 1.
   Subtrees of a probability node denote mutually
exclusive possible subtrees. A probability is asso-
ciated with each possibility node. Since possibility
nodes with a common parent (probability) exclude
each other, their probabilities should sum up to 1. The
probability associated with possibility nodes indicate
the level of confidence in the subtree. A probability
value of 1 means that it is certain that the subtree
will be part of the XML document. If all probabilities
in the XML document are 1, the document is certain
and can therefore be considered to be a normal XML
document.
   In our probabilistic XML model, we use the possi-
ble world approach. This means that any uncertainty
in the data produces possible worlds, or views on the                      Figure 2: Information Cycle
real world. Possibilities of an element can be seen as
                                                            approach.
a possible representation of the real world object. If
                                                               On the whole we envision it to work like the infor-
there are two elements in the information source and
                                                            mation cycle shown in Figure 2. The database of a
both have three possible representations, then there
                                                            device is integrated or synchronized with other (exter-
are nine possible worlds represented by the probabilis-
                                                            nal) information sources. As a result, possible worlds
tic XML tree. In Figure 1 three possible worlds can be
                                                            are created. This is caused by conflicts in the data
distinguished. Two worlds where there is only one per-
                                                            between different information sources. At all times,
son “John”, in the first world (with probability α × β)
                                                            the database reflects possibly conflicting observations
this person has phone number “1111” and in the sec-
                                                            of the real world (RW in the figure). The user of the
ond world (with probability α × (1 − β)) this person
                                                            system poses queries to the database and also observes
has phone number “2222”. In the third world (with
                                                            the real world. He can provide feedback on query re-
probability (1−α)) there are two persons, both named
                                                            sults based on his own observations of the real world.
“John”, one of them has phone number “1111” and the
                                                            All possible worlds represented in the database that
other has phone number “2222”.
                                                            contradict the feedback can be deleted from the set of
   The advantage of the possible world approach is          possible worlds.
that it provides for a natural way to reason about the
data. First of all, the semantics of a query is simply
the set of answers obtained from posing the query to
                                                            4   Integration Framework
each possible world individually. The probability of        We plan to use the probabilistic model in an inte-
each possible answer is the probability of the corre-       gration framework. An overview of this framework
sponding possible world. An implementation deriving         is shown in Figure 3. The integration framework
such a query result will, of course, use a more efficient   acts as a DBMS. The user can pose queries to the
                       Application                             • Different naming conventions for data values
                                                                In our research, we attempt to deal with these chal-
      Mediator                                              lenges by explicitly handling the inherent uncertain-
                                         Oracle             ties occurring in the data integration process using a
        rule engine      integrator
                                                            probabilistic database approach. Suciu’s tutorial at
                                                            SIGMOD 2005 comes with an extensive bibliography
                                                            on the topic of probabilistic data management [SD05].
      wrapper                wrapper                        Many results from the logic programming and artificial
                                                            intelligence communities are combined in [ELLS01]
                                                            which proposes a probabilistic object database. The
                                                            object-oriented data model is more expressive, but also
                                                            less flexible, than the XML data model. Nevertheless,
                                                            many things carry over to the XML world.
           Figure 3: Integration Framework                      A probabilistic database is not a new idea, see for
                                                            example [FKL97, BGMP90, ELW01, DS96, KKA05],
entire DBMS, while the rule engine distributes the          but in recent years attention grew considerably. Orig-
queries over autonomous underlying databases. We            inally, work concentrated on relational databases, but
use a rather standard rule-based approach to deal with      in [KKA05] we argue that XML can be made to express
schema integration issues.                                  uncertainty in a more natural way. Other probabilistic
   The schema integration problem itself will not be        XML databases are, for example, PXML [HGS03] and
the focus of this PhD research, instead we will develop     ProTDB [NJ02].
a probabilistic XML data model, investigate how to              Although schema and data level matching and in-
query this probabilistic XML and show how proba-            tegration can be clearly separated, schema matching
bilistic XML can be used in an integration setting.         techniques (see [MBR01] for a nice taxonomy) can of-
This part is captured in the data integrator of the ar-     ten be used or adapted to be applicable on data level.
chitecture.                                                 For example, [BN05] presents a technique to search
   The Oracle in the architecture is a component that       for duplicate records and to use these duplicates for
can indicate if, and to what degree, two elements are       schema matching. As we will see in the sequel, an
referring to the same real world object. This oracle        important problem in data integration is how to de-
will be responsible for associating initial probabilities   cide whether or not two data items refer to the same
to uncertain elements.                                      real-world object. Duplicate finding techniques can be
                                                            applied to (partly) solve this problem. Also in many
5   Validation                                              other areas duplicate finding techniques can be found,
                                                            such as data warehousing [ACG02].
In order to validate the research, we design and im-
plement a prototype. First a Haskell prototype will             A problem in using probabilistic databases for data
be used to validate the research on probabilistic XML.      integration is how to determine the probabilities.
We use Haskell, because this resembles the formal def-      Many schema matching techniques suitable for data
initions of probabilistic XML and operations thereon,       integration, however, quantify the degree of matching.
so we can easily check our ideas on main principles and     For example, instance-based matchers use classifica-
semantics of functionality.                                 tion techniques [DDH01]. If two data items from dif-
   Next, we will design a prototype on top of Mon-          ferent information sources referring to the same real-
etDB/XQuery [BGvK+ 06]. This prototype is mainly            world object conflict on some attribute value, and one
meant to show feasibility and performance of the al-        of those values is classified with less certainty that the
gorithms. The probabilistic XML implementation will         other in the class corresponding to the attribute, then
become a module in MonetDB/XQuery                           that attribute value is less likely to be correct and
                                                            should receive a smaller probability. The same holds
                                                            for techniques where dictionaries or thesauri are used:
6   Related Research                                        if a possible data value is not present in the corre-
The amount of research on information integration is        sponding dictionary, it should receive a smaller prob-
huge. [DH05] provides a nice survey. As we mentioned,       ability. Schema matching techniques can also be used
we distinguish schema matching and integration from         for data conversion as [MZ98] demonstrated.
data integration. [RB01] is a good survey on schema             Finally, an important source of schema and data
matching techniques.                                        integration techniques can be drawn from the Seman-
   The challenges for data integration of [Lev99] have      tic Web community. As we argued in the introduc-
received much attention in recent years:                    tion, world knowledge is required for making decisions
  • Overlapping and contradictory data                      in the integration process. In theory, annotating the
  • Semantic mismatches among sources                       data with sufficient world knowledge may also over-
come the problem. The question remains if it is prac-                   survey. AI Magazine, Sp.Issue on Semantic
tical to demand from all information sources to be                      Integration, 2005.
sufficiently annotated to resolve all uncertainty. Fur-      [DS96]     Debabrata Dey and Sumit Sarkar. A prob-
thermore, it is an open problem how to determine be-                    abilistic relational model and algebra. ACM
forehand when annotations suffice to resolve all un-                    Trans. Database Syst., 21(3):339–369, 1996.
certainty. We, therefore, approach the problem from          [ELLS01]   T. Eiter, J.J. Lu, T. Lukasiewicz, and V.S.
the other end. The probabilistic data integration ap-                   Subrahmanian. Probabilistic object bases.
proach as such is independent of any world knowledge.                   ACM Trans. Database Syst., 26(3):264–312,
Adding world knowledge can then be used to restrict                     2001.
uncertainty [KKL06].                                         [ELW01]    Thomas Eiter, Thomas Lukasiewicz, and
                                                                        Michael Walter. A data model and algebra
7   Summary                                                             for probabilistic complex values. Annals of
                                                                        Mathematics and Artificial Intelligence, 33(2-
The number of autonomous devices capable of commu-                      4):205–252, 2001.
nication and containing information, is still increasing.    [FKL97]    D. Florescu, D. Koller, and A. Levy. Using
Although information integration has been an impor-                     probabilistic information in data integration.
tant area of research, due to the increase in available                 In Proc. VLDB Conf., Athens, Greece, pages
information, its importance is still growing. In this                   216–225, Aug. 1997.
paper, we have given an overview of using probabilis-        [HGS03]    E. Hung, L. Getoor, and V.S. Subrahmanian.
tic XML in data integration. We propose to use this                     PXML: A probabilistic semistructured data
probabilistic XML to facilitate unattended integration.                 model and algebra. In Proc. ICDE Conf.,
Instead of asking the user in case of conflict, these con-              Bangalore, India, pages 467–, Mar. 2003.
flicts are simply stored. At query time, when the user       [KK04]     A. de Keijzer and M. van Keulen. A possible
is already actively using the system, feedback to query                 world approach to uncertain relational data.
results can be used to reduce the uncertainty in the                    In Proc. SIUFDB Workshop, Zaragoza, Spain,
database.                                                               Sept. 2004.
                                                             [KKA05]    M. van Keulen, A. de Keijzer, and W. Alink.
References                                                              A probabilistic xml approach to data integra-
                                                                        tion. In Proc. ICDE Conf., Tokyo, Japan,
[ACG02]     R. Ananthakrishna, S. Chaudhuri, and                        pages 459–470, 2005.
            V. Ganti. Eliminating fuzzy duplicates in        [KKL06]    A. de Keijzer, M. van Keulen, and Y. Li. Tam-
            data warehouses. In Proc. VLDB Conf., Hong                  ing Data Explosion in Probabilistic Informa-
            Kong, China, pages 586–597, Aug. 2002.                      tion Integration. In Proceedings of the Inter-
[BGMP90] Daniel Barbará, Hector Garcia-Molina, and                     national Workshop on Inconsistency and In-
         Daryl Porter.     A probabilistic relational                   completeness in Databases (IIDB), March 26,
         data model.        In François Bancilhon,                     2006, Munich, Germany, March 2006.
         Costantino Thanos, and Dennis Tsichritzis,          [Lev99]    A.Y. Levy. Combining artificial intelligence
         editors, Advances in Database Technology -                     and databases for data integration. In Ar-
         EDBT’90. International Conference on Ex-                       tificial Intelligence Today, LNCS 1600, pages
         tending Database Technology, Venice, Italy,                    249–268. 1999.
         March 26-30, 1990, Proceedings, volume 416
                                                             [MBR01]    J. Madhavan, P.A. Bernstein, and E. Rahm.
         of Lecture Notes in Computer Science, pages
                                                                        Generic schema matching with Cupid. In
         60–74. Springer, 1990.
                                                                        Proc. VLDB Conf., Roma, Italy, pages 49–58,
[BGvK+ 06] P. Boncz, T. Grust, M. van Keulen, S. Mane-                  Sept. 2001.
           gold, J. Rittinger, and J. Teubner. MonetD-       [MZ98]     T. Milo and S. Zohar. Using schema matching
           B/XQuery - Fast and Scalable XQuery Pro-                     to simplify heterogeneous data translation. In
           cessor Powered by a Relational Engine. In                    Proc. VLDB Conf., New York, USA, pages
           ACM SIGMOD International Conference on                       122–133, Aug. 1998.
           Management of Data, Chicago, USA, June
           26-29, 2006, June 2006.                           [NJ02]     A. Nierman and H.V. Jagadish. ProTDB:
                                                                        Probabilistic data in XML. In Proc. VLDB
[BN05]      A. Bilke and F. Naumann. Schema matching                    Conf., Hong Kong, China, 2002.
            using duplicates. In Proc. ICDE Conf., Tokyo,
            Japan, pages 69–80, Apr. 2005.                   [RB01]     E. Rahm and P.A. Bernstein. A survey of
                                                                        approaches to automatic schema matching.
[DDH01]     A. Doan, P. Domingos, and A. Halevy. Rec-                   VLDB Journal, 10(4):334–350, Dec. 2001.
            onciling schemas of disparate data sources:
                                                             [SD05]     D. Suciu and N.N. Dalvi.            Foundations
            a machine-learning approach. In Proc. SIG-
                                                                        of probabilistic answers to queries.          In
            MOD Conf., Santa Barbara, USA, pages 509–
                                                                        Proc. SIGMOD Conf., page 963, 2005.
            520, 2001.
                                                                        Bibliographic notes to this tutorial at
[DH05]      A. Doan and A. Halevy. Semantic integration                   http://www.cs.washington.edu/homes/suciu/
            research in the database community: Brief                                                               .
                                                                          tutorial-sigmod2005-bib.pdf

</pre>