<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Integration Using DataPile Structure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Bednárek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Obdržálek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Yaghob</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filip Zavoral</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering Faculty of Mathematics and Physics, Charles University Prague</institution>
        </aff>
      </contrib-group>
      <fpage>178</fpage>
      <lpage>188</lpage>
      <abstract>
        <p>One of the areas of data integration covers systems that maintain coherence among a heterogeneous set of databases. Such a system repeatedly collects data from the local databases, synchronizes them, and pushes the updates back. One of the key problems in this architecture is the conflict resolution. When data in a less relevant data source changes, it should not cause any data change in a store with higher relevancy. To meet such requirements, we propose a DataPile structure with following main advantages: effective storage of historical versions of data, straightforward adaptation to global schema changes, separation of data conversion and replication logic, simple implementation of data relevance. Key usage of such mechanisms is in projects with following traits or requirements: integration of heterogeneous data from sources with different reliability, data coherence of databases whose schema differs, data changes are performed on local databases and minimal load on the central database.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>(built upon a standard relational database system) with system architecture that
separates the replication mechanisms from the schema-matching and data-conversion
logic. Since the approach is inspired by XML techniques rather than relational
databases, its combination with modern XML-based technologies is straightforward.
Nevertheless, the system is created over relational database system and direct
integration with traditional database systems is also possible.</p>
      <p>One of the most difficult problems in the area of data integration is handling of
duplicate and inconsistent information. The key issue in this problem is entity
identification, i.e. determining the correspondence between different records in different
data sources [11, 14]. The reality requires that the system administrators understand
the principles of the entity matching algorithm; thus, various difficult formalisms
presented in the theory [7] are not applicable. Our approach uses a simplified entity
matching system which allows the users to specify matching parameters that are easy
to understand. Some researchers [6] advice that successful entity identification
requires additional semantics information. Since this information cannot be generally
given in advance, the integrated system should be able to defer decision to the user.
The system should detect inconsistencies and either resolve them, or allow users to
resolve them manually. The need for user-assisted conflict resolution induces a new
class of problems: The repository should be able to store data before final resolution
while their relationship to the real world entities is not consistent. Consequently, the
system should be able to merge entities whenever the users discover that the entities
describe the same real-world entity, and, conversely, to split an entity whenever the
previous merge is found invalid. Under the presence of integrity constraints and
history records, this requirement needs special attention.</p>
      <p>The relationship between the global system and local database is usually expressed
using the global-as-view and local-as-view approaches [5]. In our system, a mixture
of these methods is used depending on the degree of integration required.</p>
      <p>Maintenance of history records falls in the area of temporal databases and queries,
where many successful solutions are known [1, 4, 12]. The theory usually
distinguishes between the valid time, for which the data element is valid in the real world,
and the transaction time, recording the moments when the data entry was inserted,
updated, or deleted. In our approach, the central system automatically assigns and
stores the transaction time, while the local systems are responsible for maintaining the
valid time where appropriate. Queries based on transaction time are processed by
special algorithms implemented in the central system; queries related to valid time are
processed in the same manner as queries to normal attributes.</p>
      <p>The rest of the paper is organized as follows: The second chapter describes the
principles of the DataPile technology used to flexibly store structured data in a
relational database system. The next chapter focuses on entity identification using data
matching and relevance weighing. The fourth chapter shows the overall architecture
of the integrated system. The fifth chapter presents an evaluation based on a
commercial data-integration project where the DataPile approach was used.</p>
    </sec>
    <sec id="sec-2">
      <title>2 The DataPile</title>
      <p>2.1</p>
      <p>Terminology
We have used an own terminology, which is partly derived from the XML
terminology. The first term is entity, which represents a type of the traditional database row.
An entity consists of attributes, which are analogous to the traditional database
columns. An entity instance is an instance of entity and directly equals to traditional
database row contents. An attribute value is an instance of attribute and forms a value
of one column in one row. A metatable is a conventional database table used by the
DataPile to store schema information and other system data.
2.2</p>
      <p>Data Verticalization
Usual information systems consist of some nontrivial number of conventional
database tables; huge information systems have huge number of such tables. Moreover,
the requirement for preserving all changes in data usually leads to the scheme, where
changing one value of one column in one row causes inserting a new changed row
(possibly very large) and updating the old row with some state changing column (e.g.
validity termination timestamp). Another problem in conventional information
systems is extensibility; adding some new columns or new tables may cause large
application code rewriting.</p>
      <p>MT_ATTR_TYPE
attr_id NUMBER(18)
tent_id NUMBER(18)
name VARCHAR2(256)
type INTEGER
mod INTEGER
classifier INTEGER
wtent_id NUMBER(18)
optional INTEGER
multiple INTEGER
&lt;pk&gt;
&lt;fk2&gt;
&lt;fk1&gt;</p>
      <p>MT_ENTITY_TYPE
tent_id NUMBER(18) &lt;pk&gt;
name VARCHAR2(256)</p>
      <p>DT_PILE
stoh_id NUMBER(18)
attr_id NUMBER(18)
ent_id NUMBER(18)
state INTEGER
relevance FLOAT
val_num NUMBER(18,0)
val_str NVARCHAR2(2000)
val_dt TIMESTAMP
val_id NUMBER(18)
lobh_id NUMBER(18)
ts_create TIMESTAMP
ts_state TIMESTAMP
iapp_id NUMBER(18)
ts_valid_from TIMESTAMP
ts_valid_to TIMESTAMP
&lt;pk&gt;
&lt;fk3&gt;
&lt;fk2&gt;
&lt;fk1&gt;</p>
      <p>DT_ENTITY
ent_id NUMBER(18) &lt;
tent_id NUMBER(18) &lt;
ts_created TIMESTAMP</p>
      <p>DT_LOB_HASH
lobh_id NUMBER(18) &lt;pk&gt;
lob_hash RAW(32)
val_lob BLOB
All these problems are addressed by the proposed method of storing data in different
way than in traditional approaches but using standard relational databases – the
DataPile. All real applications data are stored in two relational tables: one less
important table DT_LOB_HASH is dedicated for storing LOBs (for performance
purposes), and the second one, the most important, DT_PILE stores data of all other
datatypes. This particular table is called the Pile, because all data is stored in one
table without any “well-formed” internal structure or hierarchy. Each row in the pile
represents one attribute, whose value is/was valid during certain interval of
transaction time.</p>
      <p>The Fig. 1 represents slightly simplified schema of the heart of DataPile-based
information system. Tables with prefix DT_ hold real data; all other tables (with prefix
MT_) are metatables. The table DT_ENTITY holds valid “global” ID for an entity
instance stored in the pile together with information about the entity in form of a
reference to the metatable MT_ENTITY_TYPE which stores entities. Entities consist
of attributes, and this is modeled by the metatable MT_ATTR_TYPE.</p>
      <p>Real values are stored in columns val_xxx of the main table DT_PILE , where xxx
represents logical type of the attribute (number, string, datetime, ID – foreign key).
Besides the actual data, other additional data is stored in the DT_PILE table:
Transaction time aspect of any attribute value is represented by two columns ts_valid_xxx.
The type of given attribute value can be found by reference attr_id to the
MT_ATTR_TYPE. The ent_id value compounds all attribute values into one entity
instance. Other columns not mentioned here serve the system for proper
implementation of the functionality needed.</p>
      <p>Such a structure easily avoids all the problems mentioned at the beginning of this
paper: The number of relational tables used does not grow with an expansion of an
information system; it is constant regardless on how huge the system is. Data changes
are preserved with minimal overhead – one attribute value change is represented by
inserting a new value into the pile – one new row is inserted into DT_PILE table not
touching the rest of attribute values related to the same entity instance. Extensibility
of the system is reached by the possibility to insert some new rows into metatables
and therefore the possibility of defining new entities, attributes, or both.</p>
      <p>From the above described layout we can see that this data structure fulfils two
requirements put on the information system as a whole: easy extensibility of the data
scheme and full information about data changes on the timeline.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Matching and Weighing</title>
      <p>The requirement on data unification is solved by two algorithms: data matching
and data weighing.
3.1</p>
      <p>Data Matching
Let us show an example, which represents usual situation we meet while processing
the same data in different applications. Let application A1 have a record about a
person with the name “Jana”, surname “Teskova” with some personal identification
number “806010/7000” and an address “Mother’s home No. 10”. The same
information is stored in the application A2 as well. After Jana Teskova got married, she took
her husband’s surname (as it is quite usual over here). So her surname changes to
“Stanclova”. She also moved to live with her new husband on the address “New
home 20”. Our person notifies about her marriage and the accompanying changes
only the office using application A1, and does not notify other office with application
A2 - at first she might not even know A2 does not share the data with A1 as they both
are used to keep data about people in one organization, and at second she may expect
that A1 and A2 are integrated together, so changes in A1 are automatically
redistributed to A2 as well (but this is a so called distribution problem, which is discussed
later). So the result is A1 and A2 store different data about one entity instance. What
happens when we try to merge data from A1 and A2 into a common data storage?</p>
      <p>As our example shows, nearly all attributes have changed. But some of them are
constant, especially personal identification number, which should by truly unique in
our country. The association of words “should be” unfortunately means that cases
exist, when different persons have the same personal identification number. On the
other side, these cases are rare. Having two personal records with the same personal
identification number means they belong in fact to a single person with probability of
roughly 0,999999.</p>
      <p>In this example, other attributes have changed, but a combination of some
attributes can have significant meaning: e.g. name and surname together form a whole
name. Even name and surname aren’t commonly unique in a state, equality of such
attributes means some nontrivial probability these two records describe a single
person.</p>
      <p>This example leads us to attribute classification. Every attribute is assigned one of
these classes: determinant, relevant, uninteresting.
− Determinant – identifies an entity instance with very high probability (e.g. personal
identification number, passport number etc.).
− Relevant – significant attribute, which helps identify unambiguously equality of
entities (e.g. attribute types “name” and “surname” for entity type “person”).
− Uninteresting – has no impact on entity matching.</p>
      <p>Following algorithm describes entity matching for two entity instances (one is already
stored in the database, the second one is a newly integrated/created entity):
1. All determinant and relevant attribute values are equal – quite clear match with
very high probability.
2. A nonempty subset of determinant and nonempty subset of relevant attribute
values are equal, remaining determinant and relevant attribute values have no
counterpart in the complimentary entity instance – very good match with quite
high probability yet (example: let us extend our example with another attribute
“passport number”. The first entity instance has attributes “personal
identification number” and “passport number” filled. The second entity instance has only
“personal identification number” filled and “passport number” is missing.).</p>
      <p>A nonempty subset of determinant attribute values is equal, remaining
determinant attribute values has no counterpart in the complimentary entity instance, but
some nonempty subset of relevant attribute values differ – this case seems to be
clear as well, because the probability of match for determinant attribute values
outweighs probability of different relevant attribute values, but some uncertainty
remains as the probability of determinant attribute values is always &lt;1.0. We
must not loose any data and their history, so the system solves such a case by
considering these two entity instances as different with notification to the system
administrator. The administrator can investigate this case more precisely and can
merge these two entities together using an administrator application. This case
directly describes the situation during entity matching from our first example –
the personal identification number as a determinant attribute value is the same,
but surname as a subset of relevant attribute values differs.</p>
      <p>A nonempty subset of determinant attribute values differs, remaining determinant
attribute values with counterpart are equal, some nonempty subset of relevant
attribute values is equal, remaining relevant attribute values have no counterpart –
this case usually arises out of misspelling one determinant attribute value. This
case is solved as above – entity instances are considered to be different and the
system administrator is notified.</p>
      <p>All other cases – input entities are different entity instances with very high
probability.
3.2</p>
      <p>Data Weighing
Let us show another example: An employee record is usually kept in different
applications in different departments, e.g. human resources department,
payroll/accountants department, library, etc. Some applications and departments
themselves emphasize some entities and usually some subset of attributes from entities
used, e.g. staff department knows with high probability that given person has a
certain name, surname, home address, etc., whereas payroll department knows with high
probability his/her account number, etc.</p>
      <p>It should be beneficial for data integration to have possibility measure somehow
the probability, that an application has entity instances (or more precisely on
individual attribute values) filled with correct values. Therefore, every attribute value in the
DataPile keeps a number which measures probability this given value is correct. This
probability is stored in the DT_PILE column “relevance”.</p>
      <p>During processing of incoming data all incoming attribute values are somehow
evaluated (this will be explained later) and the computed relevance is compared to
current relevance of attribute value stored in the DataPile. If the new value has greater
or equal computed relevance than current value has, the new value “wins”, becomes
the current value, and the old value is marked as archive. Otherwise (when the
relevance is lower than current relevance) the new value is stored as well, but only as a
remark saying the application has ineffectually tried to change this attribute value.</p>
      <p>But there is a problem: when an unimportant application with low relevance keeps
correct data (replicated from the central repository) and wants to change some
attribute values (because a user has made some changes), these changes will be always
ignored. This problem is solved by “approving” the data. Every application must
confirm to central repository it agrees with current data replicated from central
repository to this application. This confirmation is stored in the DataPile and the system
knows the given application has accepted the current attribute value. When such an
application (which approved a current attribute value) changes the value, the rule
about weighing relevance is ignored and the attribute value is changed.</p>
      <p>For example, a large company has usually some branch offices, where department
branches can be located as well. Such branches usually show different credibility,
which should be reflected by the relevance computation as well.</p>
      <p>For computing relevancy of an attribute value following equation is therefore used:
Ra = Rap ⋅ Riap ⋅ Ret ⋅ Rec ⋅ Rat
(1)
where Ra means attribute value relevancy, Rap is static application relevancy (e.g.
application used by staff department), Riap is static instance of application relevancy
(e.g. branch of department), Ret is a static entity relevancy, Rec is a computed entity
instance relevancy, and Rat is a static attribute relevancy. All these static values are
stored in metatables as floating-point values. Rec represents computed relevancy for
given entity instance and its value is computed as follows:</p>
      <p>Rec = ∏ Rac (i)
i
(2)
where Rac(i) is either a value stored in metatables (when attribute i from given entity
has some particular value stored in metatables as well), or it has value of 1. This
computed relevancy supports changing of the relevancy based on the presence of
selected attributes in the entity.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Implementation</title>
      <p>4.1 The DataPile Architecture
The whole DataPile-based system is in general shown on the following picture.</p>
      <p>Central
database</p>
      <sec id="sec-4-1">
        <title>DataPile</title>
      </sec>
      <sec id="sec-4-2">
        <title>Application server</title>
        <p>App1
App2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Local</title>
        <p>DB1</p>
      </sec>
      <sec id="sec-4-4">
        <title>Local</title>
        <p>DB2</p>
      </sec>
      <sec id="sec-4-5">
        <title>Temporal database</title>
      </sec>
      <sec id="sec-4-6">
        <title>Appn</title>
      </sec>
      <sec id="sec-4-7">
        <title>Local</title>
        <p>DBn
Every application (marked in the picture as App1..n) has its own local database (Local
DB1..n). The DataPile machinery is constituted by the Trinity of Central database,
Temporal database, and Application server: the Central DB contains the DataPile as
data structure for collected data storage, the Temporal DB serves only as
communication medium between the DataPile machinery and applications (data is revealed here
only during replication and immediately deleted when replication ends). The
Application server, which gives life to the whole system, is discussed in following section.
4.2</p>
        <p>Application Server
The whole DataPile architecture utilizes the request/reply paradigm. The application
server behaves to the rest of world passively; it waits for requests inserted into the
temporal DB, fulfills them using central DB data, and writes a reply back into the
temporal DB.</p>
        <p>Application server is by intention implemented so that it doesn’t understand any
data semantic. Everything is controlled by the content of metatables and nothing is
hardcoded. The application server is primarily responsible for replication, computing
algorithms like all the above mentioned data matching and weighing, and can handle
other requests too (e.g. perform specialized queries upon the data stored in the
DataPile).</p>
        <p>Replication channel</p>
        <p>The communication channel between an application and the DataPile machinery is
not as simple as it may appear on the first look. In reality there are inserted two
filters: export and import filter. They are traditional adapters, which are responsible for
adapting different database schemas used by the DataPile machinery and local
database. A watchful reader may note that the direction from application to the DataPile is
marked as “export”, opposite direction as “import”; the marking is taken from the
application point of view.</p>
        <p>App
server</p>
      </sec>
      <sec id="sec-4-8">
        <title>Temporal database</title>
      </sec>
      <sec id="sec-4-9">
        <title>Export filter</title>
      </sec>
      <sec id="sec-4-10">
        <title>Import filter</title>
      </sec>
      <sec id="sec-4-11">
        <title>Appi</title>
      </sec>
      <sec id="sec-4-12">
        <title>Local</title>
        <p>DBi</p>
        <p>New applications may advantageously use collected data in central repository.
Unfortunately the DataPile structure is not very well suited for direct access (e.g.
searching is not very effective). Almost all applications in fact need to know only current
attribute value, ignoring all history stored in the DataPile. To support such
applications, the application server actively builds and maintains caches, where only current
attribute values are stored and these caches are presented to applications in the form
of traditional relational tables. The applications can easily search and use all other
RDBMS functions on the caches.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation and Conclusions</title>
      <p>The architecture described in this paper brings an alternative to traditional techniques.
It brings several advantages but also some disadvantages.</p>
      <p>All the concepts described in this paper were used in a real project – design and
development of an information system based on data replication and synchronization
of data coming from bigger number of different data sources (approx. 30 local
information systems and 20 other applications producing 30 millions entries per year for
60 000 users, in this project). The project lasts from the fall 2003 and now is in the
phase of finishing the pilot phase and starting roll-out.</p>
      <p>Basic and commonly usable advantage of the DataPile structure is its
maintainability, easy extensibility and ability to keep the track of the whole data history. All
current applications used at all branches remain preserved and functional without any
change according to a strongly desired requirement. The central data repository
integrates data from all data sources including all their history and sources of their
changes. This enables recovering of any historical snapshot of any data. All data
changes are redistributed to all other applications that contain these data, even if the
source and destination schemas are different. The global schema changes affect
neither data in the central repository nor local applications.</p>
      <p>During the development of the project, we have discovered several disadvantages
of our approach:</p>
      <p>Efficiency, especially during export and matching, is low. During the initial export
of one certain local system, about 500 000 entries had to be processed. This took
more than 24 hours. This time complexity is caused by a relatively complex matching
algorithm. Fortunately, this time complexity is not very important in everyday life
because number of data changes is smaller in magnitude in comparison to the initial
migration data volume.</p>
      <p>The second disadvantage is the fact, that the structure of the central repository
makes constructing direct queries difficult. Therefore the concept of caches was
introduced and all the queries to non-historical data are performed on the caches instead
of the DataPile itself.</p>
      <p>The project showed that the DataPile approach is suitable for certain class of large
applications, where data warehousing is coupled with maintaining consistency of
local databases. In this class of applications, the drawbacks mentioned above are
outweighed by integration of data warehousing features with the support for data
replication, synchronization, and cleaning using back-propagation.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>