<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. Eiter, J.J. Lu, T. Lukasiewicz, and V.S.
Subrahmanian. Probabilistic object bases.
ACM Trans. Database Syst.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Probabilistic XML in Information Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ander de Keijzer</string-name>
          <email>a.dekeijzer@utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of EEMCS, University of Twente POBox 217</institution>
          ,
          <addr-line>7500AE Enschede</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <volume>26</volume>
      <issue>3</issue>
      <fpage>216</fpage>
      <lpage>225</lpage>
      <abstract>
        <p>Information integration is a difficult research problem. In an ambient environment, where devices can connect and disconnect arbitrarily, the problem only increases, because data sources may become available at any time, but can also disappear. In such an environment, information integration needs to be unattended, because information integration opportunities arise on ad-hoc basis. We propose to use probabilistic XML to store the integration result and instead of resolving conflicts at integration time, just store these conflicts in the integrated information source and resolve them at query time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The problem of integration of data from heterogeneous
information sources has been on the research agenda
for many years. With the increasing amount of
information that is available, the need for an answer to
this problem increases. However, because of the huge
amounts of data, manually integrating information is
not a viable option. Unfortunately, due to the
semantical assumptions not captured in schema or data,
automatic integration solutions often make mistakes.
Therefore, most of the existing integration systems are
semi-automatic.</p>
      <p>The environment we work on is an ambient
intelligent environment. Many autonomous devices, with
network capabilities, are equipped with applications
that use a database. This database resides on the
device itself, but the device should be capable of
integrating and synchronizing its database with databases
on other connected devices. Because all devices are
c 2006 for the individual paper by the paper’ authors. Copying
permitted for private and scientific purposes. Re-publication
of material on this page requires permission by the copyright
owners.
autonomous and can, in principle, be mobile,
connections will appear and disappear continuously.</p>
      <p>We envision an integration system that is capable
of unattended integration of information sources from
different devices. The process needs to be unattended,
because of the irregular availability of surrounding
devices. It is infeasible for the user to provide
information every time a device connects and integration is
initiated or resumed.</p>
      <p>The integration problem can be divided into two
classes, schema integration and data integration.
There already is a large body of work on schema
integration, but less work on data integration. This paper
addresses the data integration part of the integration
problem.</p>
      <p>Suppose, for example, two devices with address
books are within connection range. The address books
of these devices are depicted in Table 1. The
persons mentioned in the address books have the same
name, but different phone numbers. Therefore, both
elements could refer to the same person. In this case,
there may have been a mistake in entering one of the
phone numbers, or this person may have two phone
numbers. Another possibility is, that both elements
refer to two different person, who have the same name.</p>
      <p>Instead of asking the user, the system should decide
itself if elements refer to the same real world object (in
this case a person). In this paper we use XML as a
data model. Elements referring to the same real world
object are said to be equal ; the same holds for equality
of entire XML subtrees. However, in most cases this
will be impossible due to the fact that not all
semantics is captured in the data. Instead of resolving these
conflicts, we propose to simply store the encountered
uncertainty. Semantical conflicts, i.e. two elements
referring to the same real world object, can be seen as
uncertainty. Therefore, after integration, the
information source will contain uncertain information. This
uncertainty is stored as probabilistic XML.</p>
      <p>At query time, the end user of the system is already
actively involved. This provides an opportunity for the
system to get feedback from the user, without really
bothering the user. The feedback provided by the user
can then be used to reduce the uncertainty contained
in the information source.
Municipalities in the Netherlands are obligated by law
to provide their services to the public through the
internet. The amount of data they have collected is huge
and divided over different departments. As a result
of this division, the data is also divided over
different databases. However, a user accessing the services
through the internet is not (always) aware of this
division, therefore the data needs to be integrated.
Unfortunately, the data in the different sources often
conflicts on several attributes. Fully automatic
integration will possibly result in errors in the data, because
of the conflicts in the original sources. Manual
integration is to difficult due to personal information
contained in the data that needs to be checked at
integration time to ensure correct integration, but even if
this would be possible, it would be to labor intensive.</p>
      <p>Probabilistic integration solves the problem of
keeping wrong information and throwing away correct
information, because no data will be deleted. Checking
the data can be postponed to the time when the data
is accessed either by the person himself through the
internet, or an employee at the municipality involved
in the particular case file.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Research Questions</title>
      <p>For this PhD research the following questions are
addressed.</p>
      <p>• Probabilistic model and approach</p>
      <p>Conflicts between elements at integration time
can be seen as uncertainty between data values, or
even between entire XML subtrees. A
probabilistic model based on the possible world approach
will be developed to support postponing decisions
about conflicts in the data. Instead of resolving
these conflicts, the uncertainty about elements or
subtrees will be stored.
• Querying</p>
      <p>A probabilistic XML document should be queried
like a regular, non probabilistic XML document.
Only the answer will be uncertain, i.e. the result
will be a set of possible worlds. Therefore, the
semantics of XPath will have to be redefined for
querying probabilistic XML.
• Reducing size of integration result</p>
      <p>With no additional world knowledge, the size of
the resulting probabilistic XML document after
integration tends to be huge. We plan to use a
component called “The Oracle” that predicts if it
is imaginable that two elements refer to the same
real world object. Using very simple knowledge
rules, the number of possibilities in the
resulting information source can be reduced enormously
[KKL06], thus removing a major obstacle in the
practical viability of this approach.
name
John
phone
1111
name
John
• Feedback</p>
      <p>Although no human effort should be required at
integration time, this is not the case for query
time. When querying the document, the user is
already actively using the system. Facilities for
giving feedback on query results without much
overhead can easily be incorporated in a user
interface. Feedback can not only increase accuracy
of the integrated information source, but also
reduce the size of the database.
• Usage in Integration Framework</p>
      <p>The probabilistic model will have to be
implemented in an integration framework capable of
also supporting schema transformations. A
prototype will be developed. We will show feasibility
of the approach by means of some real-life
experiments.</p>
      <p>So far, we have a good understanding of the
main principles and semantics of required
functionality [KK04, KKA05, KKL06]. We are currently working
on prototype development and performance aspects in
order to prove the practical viability of the approach.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Probabilistic Data</title>
      <p>In order to support unattended information
integration, we do not make any decisions about conflicts in
the data. Instead, the uncertainty is stored. For
example, if the address books from Table 1 are integrated,
several results are possible.</p>
      <p>1. John1 (value “John” from address book 1) and</p>
      <p>John2 refer to the same real world person. One of
the phone numbers is correct, either 1111 or 2222,
the other one is incorrect.
2. John1 and John2 refer to the same real world
per</p>
      <p>son. Both phone numbers are correct.
3. John1 and John2 refer to different real world
per</p>
      <p>sons.</p>
      <p>From this list, option 2 can be eliminated if we allow
schema information to be used in the integration
process and if only one phone number is allowed for each
person. The other two options can not be resolved
without using external information. This uncertainty
is stored by means of probabilistic data.</p>
      <p>We use XML to store information, because we found
XML to be more expressive than the relational model
[KKA05]. We introduced two new kinds of nodes
1. probability nodes (▽) and 2. possibility nodes (◦).</p>
      <p>The document root is a probability node. Child nodes
of probability nodes are always possibility nodes and
child nodes of possibility nodes are always regular
▽
◦
• addressbook
▽plelrlsollnl•RRRRRR
◦ccccccccccαcccccccccc▽▽pl[el[rls[ol[ln[l[•[Rf[Rf([1Rf[−Rf[Rαf[R)f[f[f[f[f[f[f[f[◦XXXXXXXXXXXXX•RpRReRrsRoRn
▽llllll
◦lllβlll▽RR(R1R−RβR)◦
◦
name •phone •</p>
      <p>John 1111</p>
      <p>◦
phone • name •
2222 John
▽
◦
phone •
1111</p>
      <p>◦
name •</p>
      <p>John
▽
◦
phone •
2222
XML nodes, which in turn have only probability nodes
as child nodes. Therefore, on each level of an XML tree
there is only one kind of nodes. Figure 1 contains the
resulting XML document after integrating the address
books from Table 1.</p>
      <p>Subtrees of a probability node denote mutually
exclusive possible subtrees. A probability is
associated with each possibility node. Since possibility
nodes with a common parent (probability) exclude
each other, their probabilities should sum up to 1. The
probability associated with possibility nodes indicate
the level of confidence in the subtree. A probability
value of 1 means that it is certain that the subtree
will be part of the XML document. If all probabilities
in the XML document are 1, the document is certain
and can therefore be considered to be a normal XML
document.</p>
      <p>In our probabilistic XML model, we use the
possible world approach. This means that any uncertainty
in the data produces possible worlds, or views on the
real world. Possibilities of an element can be seen as
a possible representation of the real world object. If
there are two elements in the information source and
both have three possible representations, then there
are nine possible worlds represented by the
probabilistic XML tree. In Figure 1 three possible worlds can be
distinguished. Two worlds where there is only one
person “John”, in the first world (with probability α × β)
this person has phone number “1111” and in the
second world (with probability α × (1 − β)) this person
has phone number “2222”. In the third world (with
probability (1−α)) there are two persons, both named
“John”, one of them has phone number “1111” and the
other has phone number “2222”.</p>
      <p>The advantage of the possible world approach is
that it provides for a natural way to reason about the
data. First of all, the semantics of a query is simply
the set of answers obtained from posing the query to
each possible world individually. The probability of
each possible answer is the probability of the
corresponding possible world. An implementation deriving
such a query result will, of course, use a more efficient
approach.</p>
      <p>On the whole we envision it to work like the
information cycle shown in Figure 2. The database of a
device is integrated or synchronized with other
(external) information sources. As a result, possible worlds
are created. This is caused by conflicts in the data
between different information sources. At all times,
the database reflects possibly conflicting observations
of the real world (RW in the figure). The user of the
system poses queries to the database and also observes
the real world. He can provide feedback on query
results based on his own observations of the real world.</p>
      <p>All possible worlds represented in the database that
contradict the feedback can be deleted from the set of
possible worlds.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Integration Framework</title>
      <p>We plan to use the probabilistic model in an
integration framework. An overview of this framework
is shown in Figure 3. The integration framework
acts as a DBMS. The user can pose queries to the</p>
      <sec id="sec-4-1">
        <title>Mediator</title>
        <p>rule engine
integrator</p>
      </sec>
      <sec id="sec-4-2">
        <title>Oracle wrapper wrapper</title>
        <p>entire DBMS, while the rule engine distributes the
queries over autonomous underlying databases. We
use a rather standard rule-based approach to deal with
schema integration issues.</p>
        <p>The schema integration problem itself will not be
the focus of this PhD research, instead we will develop
a probabilistic XML data model, investigate how to
query this probabilistic XML and show how
probabilistic XML can be used in an integration setting.
This part is captured in the data integrator of the
architecture.</p>
        <p>The Oracle in the architecture is a component that
can indicate if, and to what degree, two elements are
referring to the same real world object. This oracle
will be responsible for associating initial probabilities
to uncertain elements.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Validation</title>
      <p>In order to validate the research, we design and
implement a prototype. First a Haskell prototype will
be used to validate the research on probabilistic XML.
We use Haskell, because this resembles the formal
definitions of probabilistic XML and operations thereon,
so we can easily check our ideas on main principles and
semantics of functionality.</p>
      <p>Next, we will design a prototype on top of
MonetDB/XQuery [BGvK+06]. This prototype is mainly
meant to show feasibility and performance of the
algorithms. The probabilistic XML implementation will
become a module in MonetDB/XQuery
6</p>
    </sec>
    <sec id="sec-6">
      <title>Related Research</title>
      <p>The amount of research on information integration is
huge. [DH05] provides a nice survey. As we mentioned,
we distinguish schema matching and integration from
data integration. [RB01] is a good survey on schema
matching techniques.</p>
      <p>The challenges for data integration of [Lev99] have
received much attention in recent years:
• Overlapping and contradictory data
• Semantic mismatches among sources
• Different naming conventions for data values
In our research, we attempt to deal with these
challenges by explicitly handling the inherent
uncertainties occurring in the data integration process using a
probabilistic database approach. Suciu’s tutorial at
SIGMOD 2005 comes with an extensive bibliography
on the topic of probabilistic data management [SD05].
Many results from the logic programming and artificial
intelligence communities are combined in [ELLS01]
which proposes a probabilistic object database. The
object-oriented data model is more expressive, but also
less flexible, than the XML data model. Nevertheless,
many things carry over to the XML world.</p>
      <p>A probabilistic database is not a new idea, see for
example [FKL97, BGMP90, ELW01, DS96, KKA05],
but in recent years attention grew considerably.
Originally, work concentrated on relational databases, but
in [KKA05] we argue that XML can be made to express
uncertainty in a more natural way. Other probabilistic
XML databases are, for example, PXML [HGS03] and
ProTDB [NJ02].</p>
      <p>Although schema and data level matching and
integration can be clearly separated, schema matching
techniques (see [MBR01] for a nice taxonomy) can
often be used or adapted to be applicable on data level.
For example, [BN05] presents a technique to search
for duplicate records and to use these duplicates for
schema matching. As we will see in the sequel, an
important problem in data integration is how to
decide whether or not two data items refer to the same
real-world object. Duplicate finding techniques can be
applied to (partly) solve this problem. Also in many
other areas duplicate finding techniques can be found,
such as data warehousing [ACG02].</p>
      <p>A problem in using probabilistic databases for data
integration is how to determine the probabilities.
Many schema matching techniques suitable for data
integration, however, quantify the degree of matching.
For example, instance-based matchers use
classification techniques [DDH01]. If two data items from
different information sources referring to the same
realworld object conflict on some attribute value, and one
of those values is classified with less certainty that the
other in the class corresponding to the attribute, then
that attribute value is less likely to be correct and
should receive a smaller probability. The same holds
for techniques where dictionaries or thesauri are used:
if a possible data value is not present in the
corresponding dictionary, it should receive a smaller
probability. Schema matching techniques can also be used
for data conversion as [MZ98] demonstrated.</p>
      <p>Finally, an important source of schema and data
integration techniques can be drawn from the
Semantic Web community. As we argued in the
introduction, world knowledge is required for making decisions
in the integration process. In theory, annotating the
data with sufficient world knowledge may also
overcome the problem. The question remains if it is
practical to demand from all information sources to be
sufficiently annotated to resolve all uncertainty.
Furthermore, it is an open problem how to determine
beforehand when annotations suffice to resolve all
uncertainty. We, therefore, approach the problem from
the other end. The probabilistic data integration
approach as such is independent of any world knowledge.
Adding world knowledge can then be used to restrict
uncertainty [KKL06].
7</p>
    </sec>
    <sec id="sec-7">
      <title>Summary</title>
      <p>The number of autonomous devices capable of
communication and containing information, is still increasing.
Although information integration has been an
important area of research, due to the increase in available
information, its importance is still growing. In this
paper, we have given an overview of using
probabilistic XML in data integration. We propose to use this
probabilistic XML to facilitate unattended integration.
Instead of asking the user in case of conflict, these
conflicts are simply stored. At query time, when the user
is already actively using the system, feedback to query
results can be used to reduce the uncertainty in the
database.
[ACG02]
[ELLS01]
[ELW01]
[FKL97]
[HGS03]
[KK04]
[KKA05]
[KKL06]
[Lev99]
[MBR01]
[MZ98]
[NJ02]
[RB01]
[SD05]</p>
      <p>Debabrata Dey and Sumit Sarkar. A
probabilistic relational model and algebra. ACM
Trans. Database Syst., 21(3):339–369, 1996.</p>
      <p>Thomas Eiter, Thomas Lukasiewicz, and
Michael Walter. A data model and algebra
for probabilistic complex values. Annals of
Mathematics and Artificial Intelligence,
33(24):205–252, 2001.</p>
      <p>A. de Keijzer and M. van Keulen. A possible
world approach to uncertain relational data.</p>
      <p>In Proc. SIUFDB Workshop, Zaragoza, Spain,
Sept. 2004.</p>
      <p>A.Y. Levy. Combining artificial intelligence
and databases for data integration. In
Artificial Intelligence Today, LNCS 1600, pages
249–268. 1999.</p>
      <p>T. Milo and S. Zohar. Using schema matching
to simplify heterogeneous data translation. In
Proc. VLDB Conf., New York, USA, pages
122–133, Aug. 1998.</p>
      <p>A. Nierman and H.V. Jagadish. ProTDB:
Probabilistic data in XML. In Proc. VLDB
Conf., Hong Kong, China, 2002.</p>
      <p>E. Rahm and P.A. Bernstein. A survey of
approaches to automatic schema matching.</p>
      <p>VLDB Journal, 10(4):334–350, Dec. 2001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Ananthakrishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          .
          <article-title>Eliminating fuzzy duplicates in data warehouses</article-title>
          .
          <source>In Proc. VLDB Conf</source>
          .,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, pages
          <fpage>586</fpage>
          -
          <lpage>597</lpage>
          , Aug.
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [BGMP90]
          <article-title>Daniel Barbar´a, Hector Garcia-Molina, and</article-title>
          <string-name>
            <given-names>Daryl</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>A probabilistic relational data model</article-title>
          .
          <source>In Fran¸cois Bancilhon</source>
          , Costantino Thanos, and Dennis Tsichritzis, editors,
          <source>Advances in Database Technology - EDBT'90. International Conference on Extending Database Technology, Venice, Italy, March</source>
          <volume>26</volume>
          -30,
          <year>1990</year>
          , Proceedings, volume
          <volume>416</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>60</fpage>
          -
          <lpage>74</lpage>
          . Springer,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [BGvK+06]
          <string-name>
            <given-names>P.</given-names>
            <surname>Boncz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grust</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Keulen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Manegold</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rittinger</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Teubner</surname>
          </string-name>
          . MonetDB/XQuery - Fast and
          <article-title>Scalable XQuery Processor Powered by a Relational Engine</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data</source>
          , Chicago, USA, June 26-29,
          <year>2006</year>
          ,
          <year>June 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [BN05] [DDH01] [DH05]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bilke</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Schema matching using duplicates</article-title>
          .
          <source>In Proc. ICDE Conf</source>
          ., Tokyo, Japan, pages
          <fpage>69</fpage>
          -
          <lpage>80</lpage>
          , Apr.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Reconciling schemas of disparate data sources: a machine-learning approach</article-title>
          .
          <source>In Proc. SIGMOD Conf</source>
          .,
          <string-name>
            <surname>Santa</surname>
            <given-names>Barbara</given-names>
          </string-name>
          , USA, pages
          <fpage>509</fpage>
          -
          <lpage>520</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Semantic integration research in the database community: Brief [DS96]</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>