<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ISO</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An ISO 9001:2000 Compliant Quality Management System for Data Integration in Data Warehouse Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Holger Hinrichs</string-name>
          <email>holger.hinrichs@of</email>
          <email>holger.hinrichs@offis.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Aden</string-name>
          <email>thomas.aden@informatik.uni-oldenburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>OFFIS</institution>
          ,
          <addr-line>Escherweg 2, 26121 Oldenburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Oldenburg</institution>
          ,
          <addr-line>26111 Oldenburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>9001</year>
      </pub-date>
      <volume>2000</volume>
      <fpage>1</fpage>
      <lpage>1</lpage>
      <abstract>
        <p>In modern information systems, the topic of data quality becomes more and more important due to increasing analysis demands. This holds especially for data warehouse systems and similar systems that provide data integrated from heterogeneous sources. Although a large variety of extraction-transformation-loading tools supporting data integration is available, there is still no process model defining which integration steps should be done in which order to best fulfil the users' needs. Our research project CLIQ is supposed to close this gap. In CLIQ, the integration process is being viewed as a kind of production process. This view enables us to apply concepts of quality management known from the manufacturing/service domain. More precisely, we developed a quality management system for data integration that meets the requirements of the recent ISO 9001:2000 standard. This paper presents our approach and describes its added value compared to traditional approaches to data integration.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The increasing popularity of data warehouse systems
[Inm92] reflects the rising need to make strategic use of
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/
data integrated from heterogeneous sources. The
integration of heterogeneous data – an integral part of data
warehousing – raises the problem of data deficiencies like
inconsistent, missing, or obsolete values, unintentional
redundancy etc. So if data that do not suffice given user
requirements are used for decision support, the so-called
”garbage in, garbage out” effect may occur, possibly with
serious consequences. To prevent this, the data integration
process has to ensure that subsequent analysis tasks work
on approved data exclusively. Data that do not fulfil the
requirements have to be either improved or singled out. The
entire integration process must align with the fulfilment of
users’ needs. Our research project CLIQ (Data Cleansing
with Intelligent Quality Management) is dedicated to this
topic.</p>
      <p>The scenario we just described can be viewed as a data
quality problem. In [ISO00a], the term ”quality” is defined
as the ”degree to which a set of inherent characteristics
fulfils requirements”. Applied to the context of data
warehousing, the subject matter under consideration is ”data”,
more precisely a fragment of a database, below termed
”data product”. So which characteristics (with regard to
the above definition) does such a data product possess?
In the literature, several classifications of data
characteristics (often called data quality dimensions) can be found
[Wan98, JJQV98, NLF99, Sha99]. Unfortunately, there is
no commonly accepted classification. In CLIQ, we adopted
the classification depicted in Fig. 1, which we consider to
be a ”best of breeds” concept (see [Hin01]).</p>
      <p>Quality characteristics form the backbone of quality
management (QM), defined in [ISO00a] as ”coordinated
activities to direct and control an organization with regard
to quality”. In accordance with [ISO00a], QM consists of
the following activities:</p>
    </sec>
    <sec id="sec-2">
      <title>Quality policy</title>
      <p>Establishing overall quality-related intentions and
goals of an organization.</p>
      <sec id="sec-2-1">
        <title>Credibility</title>
      </sec>
      <sec id="sec-2-2">
        <title>Usability</title>
      </sec>
      <sec id="sec-2-3">
        <title>Interpretability</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Quality planning</title>
      <p>Setting quality objectives and specifying processes
and related resources necessary to fulfil these
objectives.</p>
    </sec>
    <sec id="sec-4">
      <title>Quality control</title>
      <p>Executing processes to fulfil quality requirements.</p>
    </sec>
    <sec id="sec-5">
      <title>Quality assurance</title>
      <p>Providing confidence that quality requirements will be
fulfilled.</p>
    </sec>
    <sec id="sec-6">
      <title>Quality improvement</title>
      <p>Increasing the ability to fulfil quality requirements.</p>
      <p>The system within which QM is done is called quality
management system (QMS). As Fig. 2 shows, the data
integration process in a data warehouse system can be viewed
as some kind of production process, with the
heterogeneous source data corresponding to raw materials provided
by suppliers (in this case: data sources), the integration
process corresponding to the production process, and the
consolidated data corresponding to products. As (material)
products, data products can be used by customers (in this
case: data analysts) whose expectations and needs have to
be met.</p>
      <p>Following this analogy, we can adapt the
wellestablished QM concepts known from the
manufacturing/service domain to the context of data integration,
hereafter called data quality management (DQM).</p>
      <p>One important aspect of DQM is data quality
measurement. As DeMarco stated ”You cannot control what you
cannot measure” [DeM82], we need metrics at hand to
calculate the degree of conformance with given
requirements. In the manufacturing domain, we have to measure
characteristics like lengths, weights, speeds, etc. In data
warehousing, on the other hand, we have to measure
characteristics like consistency, timeliness, and completeness
of data products (cf. Fig. 1). Unfortunately, metrics
suitable for these characteristics are still a matter of research.
Most of the data quality metrics proposed in literature, e. g.
Customers</p>
      <p>Products
Raw Materials
Production Process
Integration Process
[BT99, NLF99, Sha99], are either trivial or cannot be
automated. However, since modern data warehouses store
gigabytes up to terabytes of data, a manual quality control is not
feasible at all.1 In [Hin01], we propose a well-defined set
of metrics along with automated measuring methods for a
subset of the data quality characteristics mentioned in Fig.
1. These are tailored to relational databases as they prevail
in the back-end area of current data warehouse systems.</p>
      <p>Another major question is how data quality can be
improved. Two basic approaches are possible:</p>
    </sec>
    <sec id="sec-7">
      <title>Symptom-oriented</title>
      <p>In this case, data cleansing methods are applied to
deficient data to improve their quality as far as
possible. If data cannot be made conformable with given
requirements, they have to be flagged and then
either be sorted out or be released for restricted use. In
[ISO00b], these aspects are summarized as ”control of
nonconforming product”.</p>
    </sec>
    <sec id="sec-8">
      <title>Cause-oriented</title>
      <p>In this approach, the business processes which
”produce” the data (especially data acquisition,
transformation, and consolidation processes) are tuned
continually in such a way that the quality of produced
data increases more and more. Causes of
nonconformities should be eliminated appropriately (in respect
of their effects) in order to prevent recurrence.
Furthermore, the QMS itself should be the subject of
permanent efforts to improve its effectiveness, efficiency,
and/or traceability. In [ISO00b], the entirety of these
actions is called ”(quality) improvement”.</p>
      <p>1A fully automated DQM, however, is also unrealistic. Human
interaction will be inevitable when conflicts cannot be solved automatically.</p>
      <p>Of course it is always better to strike at the root of a
problem, which means in this case to optimize the business
processes and/or the QMS. Unfortunately, cause-oriented
DQM is often hindered by the fact that the processes in
question are outside the optimizer’s sphere of influence.
Legacy systems e. g., which often deliver low quality
data (due to a lack of input controls etc.), usually
cannot be extended by appropriate integrity rules belatedly.
Besides, many data deficiencies cannot be detected until
(resp. are raised by) data integration, e. g. duplicate records
from different sources. As a consequence, symptom- and
cause-oriented steps should be implemented in
combination, complementing each other.</p>
      <p>The remainder of this paper is structured as follows:
In Sect. 2, an overview of the ISO 9001:2000 standard is
given. A QMS for data integration which meets the
requirements of this standard is presented in Sect. 3.
Subsequent to a discussion of related work in Sect. 4, the paper
concludes with a rating of our approach and an overview of
future work in Sect. 5.
2</p>
      <sec id="sec-8-1">
        <title>The ISO 9001:2000 Standard</title>
        <p>The ISO 9000 family of standards has been developed by
the International Organization for Standardization (ISO) to
assist organizations in implementing and operating
effective quality management systems. First introduced in 1987
and revised in 1994, the current revision was published in
December 2000, called ISO 9000:2000. It comprises the
following standards:</p>
        <p>ISO 9000: Fundamentals and vocabulary
ISO 9001: Requirements
ISO 9004: Guidelines for performance improvements
ISO 19011: Guidelines on quality and environmental
management systems auditing.</p>
        <p>Some excerpts of ISO 9000:2000 have already been
cited in the previous section.</p>
        <p>In this section, we give an introduction to ISO
9001:2000 (the postfix ”:2000” will be omitted below),
since it is the standard most relevant to CLIQ. ISO 9004
and 19011 are beyond the scope of this paper.</p>
        <p>ISO 9001 promotes the adoption of a process approach
when developing, implementing, and improving a QMS
to enhance customer satisfaction by meeting customer
requirements [ISO00b]. An organization has to identify
various activities, link them together and assign resources to
them, thus building up a system of communicating
processes. Figure 3 illustrates such a process-based QMS.
Customers play an important role in this model since their
requirements are used as input to the product realization
process and their satisfaction is continually analyzed.
s
t
n
e
m
e
r
i
u
q
e
R
r
e
m
o
t
s
u
C</p>
        <p>Input</p>
        <p>Continual Improvement of
the Quality Management System
Resource
Management</p>
        <p>Section 6</p>
        <p>Management
Responsibility
Section 5</p>
        <p>Section 7
Product
Realization</p>
        <p>Section 8
Measurement,
Analysis &amp; Improvement</p>
        <p>Product</p>
        <p>Output
Section 4
n
o
it
c
a
ift
s
a
S
r
e
m
o
t
s
u
C
Value-Adding Activities</p>
        <p>Information Flow
ISO 9001 is made up of eight sections, the first three
of which contain general information about the scope of
the standard, normative references and terms. Sections 4
to 8 describe requirements for a QMS and its components,
as indicated in Fig. 3. The structure of these sections is
shown in Fig. 4. We will refer to the single sections when
we discuss the ISO 9001 compliance of our QMS in the
subsequent section.
3</p>
        <p>A Quality Management System for Data
Integration
In this section, we present a process model for data
integration that defines which integration steps have to be
executed in which order to achieve optimal data quality.
This process model is enriched with DQM steps to ensure
that customer requirements are fulfilled. Integration steps
plus DQM steps along with organizational DQM
activities form a QMS for data integration, called data quality
management system (DQMS). In the context of data
warehousing, such a DQMS has to be located at the
extractiontransformation-loading (ETL) stage (see Fig. 5), with the
data warehouse as target database.</p>
        <p>The following two subsections describe the main
concepts of our DQMS and its implementation as a software
system.
3.1</p>
        <p>DQMS Concepts
As Fig. 6 shows, the DQMS should be viewed as an integral
part of an organization. It is tightly coupled to the
organization’s management, its human and technical resources,</p>
        <sec id="sec-8-1-1">
          <title>4 Quality</title>
        </sec>
        <sec id="sec-8-1-2">
          <title>Management</title>
        </sec>
        <sec id="sec-8-1-3">
          <title>System</title>
          <p>4.1 General</p>
          <p>Requirements
4.2 Documentation
Requirements</p>
        </sec>
        <sec id="sec-8-1-4">
          <title>5 Management</title>
        </sec>
        <sec id="sec-8-1-5">
          <title>Responsibility</title>
        </sec>
        <sec id="sec-8-1-6">
          <title>6 Resource</title>
        </sec>
        <sec id="sec-8-1-7">
          <title>Management</title>
        </sec>
        <sec id="sec-8-1-8">
          <title>7 Product</title>
        </sec>
        <sec id="sec-8-1-9">
          <title>Realization</title>
          <p>5.1 Management</p>
          <p>Commitment
5.2 Customer</p>
          <p>Focus
5.3 Quality Policy
5.4 Planning
5.5 Responsibility,
Authority &amp;</p>
          <p>Communication
5.6 Management
Review
6.1 Provision of</p>
          <p>Resources
6.2 Human</p>
          <p>Resources
6.3 Infrastructure
6.4 Work</p>
          <p>Environment
7.1 Planning of
Product</p>
          <p>Realization
7.2
Customer</p>
          <p>Related</p>
          <p>Processes
7.3 Design &amp;</p>
          <p>Development
7.4 Purchasing
7.5 Production &amp;
Service</p>
          <p>Provision
7.6 Control of</p>
          <p>Monitoring &amp;
Measuring
Devices</p>
        </sec>
        <sec id="sec-8-1-10">
          <title>8 Measurement,</title>
        </sec>
        <sec id="sec-8-1-11">
          <title>Analysis &amp;</title>
        </sec>
        <sec id="sec-8-1-12">
          <title>Improvement</title>
          <p>8.1 General
8.2 Monitoring &amp;</p>
          <p>Measurement
8.3 Control of</p>
          <p>Nonconforming</p>
          <p>Product
8.4 Analysis of</p>
          <p>Data
8.5 Improvement
and – of course – its (data) suppliers and (data) customers.</p>
          <p>Although the latter two are modelled as external entities
in Fig. 6 (to keep the ISO 9001 view), they could also be
viewed as internal ones in the data warehousing context
(as indicated by the dashed lines): an organization’s
operational database would be an internal supplier, an
organization’s data analyst an internal customer. Customers specify
quality-related requirements and give feedback concerning
their satisfaction with the delivered product. An analogous
relationship exists between the organization and its
suppliers.</p>
          <p>The requirements of ISO 9001 sections 4, 5, and 6 (see
Fig. 4) cannot be achieved by the data integration process
itself. Instead, they must be met by the surrounding
environment as follows:</p>
          <p>The fulfillment of the general requirements of ISO
section 4.1 (identification of processes and their
interaction, determination of control criteria and methods
etc.) has been a basic presupposition of building the
DQMS. The same holds for ISO sections 7.1
(planning of product realization) and 8.1 (general
measurement, analysis, and improvement requirements).</p>
          <p>We assume that the organization has established and
maintained documented statements of the quality
policy and quality objectives as well as documentation of
procedures used within the DQMS (plus their
configurations, where applicable). As in every other ISO
9001 compliant QMS, a quality manual must exist.</p>
          <p>We further assume that all these documents are stored
in a central document management system
(electronically), so that – provided that access rights have been
granted appropriately – authorized stakeholders may
always access the most current (meta) information.</p>
          <p>Additionally, the document management system is
supposed to support version control to meet the
requirements of ISO section 4.2.</p>
          <p>Any recordings resulting from integration steps
(especially measurement results) are stored in a central
metadata repository to meet the requirement to keep
recordings ”legible, readily identifiable, and
retrievable” (ISO section 4.2).</p>
          <p>We assume that the top management of an
organization takes steps to meet the requirements of ISO
section 5 (management responsibility).</p>
          <p>We assume that the human resources, technical
infrastructure, and work environment necessary to operate
the DQM processes are provided by the organization
as claimed by ISO section 6. Low level resources like
CPU time slots, memory allocation, and database
access are assumed to be provided by the underlying
operating system and DBMS.</p>
          <p>In the following, we take a closer look at the integration
process itself with regard to its compliance to the
remaining ISO 9001 sections 7 (product realization) and 8
(measurement, analysis, and improvement). Since we assume
that the organization does not design or develop new (data)
1-4
"Garbage In,
Garbage Out"
Data Quality
Management</p>
          <p>Transformation
Analysis (OLAP etc.)
products in terms of ISO 9001, we can exclude ISO section
7.3 (design and development) from our discussion.</p>
          <p>Step I: Assessment and Review of Requirements
In this step, requirements are determined and maintained
(cf. Fig. 6). These include customer requirements in terms
of desired values of data quality characteristics, additional
implicit requirements as well as statutory and regulatory
requirements (e. g. protection of data privacy). The
organization has to ensure that it has the ability to meet the
specified requirements. The communication between
organization and customer is supposed to be handled electronically
by means of window dialogs, emails etc. (ISO section 7.2).</p>
          <p>Step II: Purchasing Raw Data
According to the requirements specified in the previous
step, the organization has to evaluate and select data
suppliers. First, a so-called purchasing information document has
to be set up which specifies requirements the raw data have
to meet (cf. Fig. 6), especially concerning data quality
characteristics like relevance and timeliness, but also
supplieroriented characteristics like reputation, charges, data
acqui</p>
          <p>Data Customers
sition methods, QMS features etc. This document must
then be sent to potential suppliers. Based on an evaluation
of the suppliers’ responses, suppliers have to be selected,
supposed their products pass appropriate inspections (ISO
section 7.4).</p>
          <p>Compared to the purchasing process in manufacturing,
there are some peculiarities in the data warehousing
context: in many cases, there will be just one data supplier
offering a certain contribution to the warehouse data (e. g.
the operational databases of the organization itself), i. e. the
organization cannot choose between alternative suppliers.</p>
          <p>Furthermore, in contrast to raw materials in
manufacturing, raw data from one supplier may show significant
deficiencies in the first place (e. g. missing values), but when
being combined with (complementary) raw data from
another supplier, these deficiencies may disappear. The
organization has to be aware of these special characteristics.</p>
          <p>Steps I and II rather belong to the modelling phase than
to the operational phase of the data warehouse system.</p>
          <p>Therefore, they are arranged orthogonally to the
operational steps described below.</p>
          <p>Assuming that steps I and II have been completed
successfully, the data warehouse system may start operations.</p>
          <p>In CLIQ, we concentrate on the transformation phase of
the ETL model, assuming that extraction and loading
functionality is provided by third party tools (see also Sect. 4).</p>
          <p>Each integration process which is to be executed within the
data warehouse system should pursue the following phase
model (see Fig. 7) in order to meet the ISO 9001
requirements.
1-5
Step 1: Unification of Representation
We assume that previously extracted source data are
being stored in so-called source data buffers (SDB, one per
source). SDBs correspond to source-specific ”raw material
stores”, from which heterogeneous data can be fetched and
then processed. Each SDB implements a relational schema
and thus hides the data model of the respective source (flat
file, relational, object-oriented, etc.), simplifying the
subsequent integration steps.</p>
          <p>In this step, source data are moved from their SDBs into
a temporary storage called transformation area (also called
staging area [Kim98]). The transformation area is assumed
to have a global relational schema that is covered (in terms
of content) by the local SDB schemata. The main task of
this step is to map the heterogeneous source data onto the
uniform data structures of the transformation area. The
following mapping tasks are especially important:
&lt; Rules, e. g. ”IF DeliveryDate PurchaseOrderDate
THEN Error (SeverityCode 0.7)”
Step 3: Domain-Specific Consistency Checking
In this step, the transformation area records are being
checked with regard to consistency by means of
domainspecific knowledge. The latter should be represented in
such a way as it can be processed automatically.
Different types of representation are possible, especially:
defined attribute-specific limits are exceeded, an
interceding action has to be executed in order to reestablish
statistical control, e. g. deleting data from the transformation area
and initiating a new data transfer after the problem has been
solved.</p>
          <p>Look-up tables, e. g. bank code registers
Generating global IDs and linking them to local keys
Unifying character strings syntactically (with regard
to umlauts, white spaces, etc.)
Unifying character strings semantically (in case of
synonyms)
Decomposing complex attribute values into atomic
ones (e. g. addresses)
Aggregating atomic attribute values into complex
ones (e. g. date values)
Converting codings (e. g. gender values m/f to 1/2)</p>
          <p>Converting scales (e. g. inch to cm).</p>
          <p>To specify the required transformations, conventional
ETL tools (cf. Sect. 4) can be used. After executing a
transformation, data are written to the transformation area and
deleted from their respective SDB.</p>
          <p>After this step, the newly extracted records can be
identified via their global IDs. Traceability (according to ISO
section 7.5) is ensured by a look-up table linking global IDs
to local keys (plus their data source ID).</p>
          <p>Step 2: Statistical Process Control
After unification, a statistical process control (SPC) is done
on the transformation area data, based on classical SPC
theory [She31]. The idea is to compute attribute-specific
statistical key figures (mean, variance, etc.) out of the
transformation area data and log them over time. The newly
computed key figures are then compared to the formerly
logged key figures (used as ”expected values”). By doing
so, cardinal data deficiencies (e. g. transfer errors) can be
detected at a very early stage. If required, i. e. if a priori</p>
          <p>Regular expressions, e. g. for phone numbers or article
codes</p>
          <p>Arbitrary domain-specific functions.</p>
          <p>In case of a detected inconsistency, an appropriate action
has to be executed, e. g. by generating an error message or
warning. However, the checking process – usually done as
a batch run – should not be aborted.</p>
          <p>Step 4: Inquiries and Postprocessing
Provided that appropriate domain knowledge is available,
the major proportion of inconsistencies can be detected
automatically, as described above. However, very few
inconsistencies can be corrected automatically. Consequently, if
an inconsistency is not tolerable, an inquiry has to be sent to
the affected data source (generated automatically out of an
error message, if possible). Depending on the business
process implemented, the source may send corrected records
to its SDB (then continuing with step 1) or directly to the
transformation area. In the latter case, a (preferably
automated) postprocessing of corrected records has to be done
in the transformation area, followed iteratively by step 3.</p>
          <p>Step 5: Record Linkage
The goal of this step is to detect duplicate records, i. e.
records that describe the same real world entity, both within
the transformation area and between the transformation
area and the so-called target area (also called operational
data store [Kim98]) in which the consolidated data will be
stored in the end.</p>
          <p>Two types of record linkage processes should be
distinguished: If a record in the transformation area is just
an incremental update of another record already stored in
the target area, the linkage can be done via their common
(global) keys. Due to the heterogeneity of data sources, this
does not work when the extensions of data sources overlap
Source Data</p>
          <p>Buffer n
Data Flow</p>
          <p>Control Flow
2. Statistical</p>
          <p>Process Control
3. Domain-Specific</p>
          <p>Consistency Checking
10. Analysis of Customer Feedback</p>
          <p>and Retraction of Data Products
4. Inquiries and</p>
          <p>Postprocessing
8. Control of Nonconforming Data</p>
          <p>Products and Quality Improvement
9. Data Product Release
and Target Area Update</p>
          <p>Consolidation Area
5. Record Linkage
6. Merging
7. Quality Measurement</p>
          <p>and Analysis
outside of CLIQ
. . .</p>
          <p>. . .
1. Unification of</p>
          <p>Representation
Source Data</p>
          <p>Buffer 1
Transformation Area
Target</p>
          <p>Area
Loading
and, consequently, several source records (with different
keys) have to be mapped on the same target area record.</p>
          <p>The same holds for the case when there are previously
undetected duplicates within one data source. In these cases,
the linkage has to be based on other attributes like name,
city, gender, delivery date, etc. To automate this task,
several methods have been proposed in literature. The most
prominent ones are:</p>
          <p>Probabilistic Record Linkage [Jar89]
Basic resp. Duplicate Elimination
Sorted-Neighborhood Method [HS95]</p>
          <p>Neighborhood Hash Joins [GFSS00].</p>
          <p>All these methods result in a set of record pairs
potentially belonging together. These pairs, more precisely their
transitive closures, now have to be analyzed with respect to
whether the linkage is correct or not. Especially in marginal
cases, an interactive review will always be inevitable.</p>
          <p>Step 6: Merging
Records whose correlations have been validated must now
be merged to one single record in order to avoid
unintentional redundancy within the data warehouse. Applying
certain criteria (information content, attribute-specific
priorities on data sources, timeliness etc.), the best pieces of
information have to be extracted from the involved records
and written into a target record. In our process model, this
target record is not written to the target area (as one could
expect), but into another temporary storage, the so-called
consolidation area. The target area will be updated not
until the very last step of the process model, thus ensuring that
only data that have passed all conformance tests (some of
which will follow in the subsequent steps) will be written
to the target area and thus be made available for analysis
tasks.</p>
          <p>Those records that merged in a consolidation area record
are then deleted from the transformation area. The
remaining transformation area records are moved to the
consolidation area without any modification. The consolidation area
1-7
will now serve as the starting point of the following DQM
activities.</p>
          <p>Step 7: Quality Measurement and Analysis
In this step, it must be checked if the data in the target
area will meet the specified customer (user) requirements
(in compliance with ISO section 8.2) supposed that it is
updated with the current consolidation area data. To do
this, the actual quality of data must be measured, using
appropriate metrics and measuring software according to ISO
section 7.6 (control of monitoring and measuring devices). 2
These measurements3 must span both the (previous) target
area data and the consolidation area data (the target area
has not been updated yet!).</p>
          <p>In a subsequent analysis phase, the results of quality
measurements have to be compared to a priori specified
quality requirements (resulting from step I). If data do not
meet a given requirement, an appropriate action has to be
taken (see step 8). Conflicts due to contradicting
requirements (e. g. high timeliness vs. high consistency) have to be
resolved, e. g. by data replication and different treatment of
the replicas.</p>
          <p>Furthermore, ISO section 8.2 postulates the
measurement of the effectiveness of integration and (data quality)
measuring processes.4 In our context, this can be done by
means of reference data. For these, both the aspired
integration results (characteristics of the resulting data product)
and the included quality deficiencies must be known.</p>
          <p>Measurement results concerning the effectiveness of
products and processes have to be recorded and analyzed
(ISO section 8.4), resulting – if necessary – in process
improving activities as described below.</p>
          <p>Step 8: Control of Nonconforming Data Products and
Quality Improvement
In this final step before the target area update, data products
that do not conform to given requirements must be treated
appropriately, in accordance with ISO section 8.3. The
following options may be taken:</p>
          <p>Sort out and re-request data
Restrict the use of data to specific analysis scenarios,
e. g. by flagging
Eliminate detected nonconformities5 and then
continue at step 7 (subject to ISO section 8.3).</p>
          <p>2We propose to use the set of metrics and measuring methods
developed in CLIQ (see [Hin01]).</p>
          <p>3Including consistency checks as in step 3 (and, if required,
postprocessing as in step 4), since a merging of records may introduce new
combinations of attribute values and thus new inconsistencies.</p>
          <p>4Strictly speaking, these process-oriented measurements should be
assigned to a dedicated phase, running parallel to the integration steps.</p>
          <p>5In CLIQ, we developed a method that uses data mining techniques to
eliminate inconsistencies and complete previously missing values
semiautomatically (see [HW00] for details).</p>
          <p>While all these activities merely tackle the symptoms of
a problem, further (cause-oriented) measures may be taken
to increase the ability to fulfil quality requirements in the
future according to ISO section 8.5:</p>
          <p>Improve the integration process, especially by tuning
process parameters (e. g. attribute mappings, SPC
parameters, consistency rules, record linkage
parameters, merging criteria).</p>
          <p>Improve quality planning and quality control
processes, e. g. by finding better means to capture user
requirements, by optimizing measurement methods, or
by improving feedback methods.</p>
          <p>Step 9: Data Product Release and Target Area Update
Depending on the analysis results of step 7, the approved
proportion of data is now released (ISO section 8.2),
i. e. the affected consolidation area records are flagged as
”passed”. The passed proportion of data is then moved
from the consolidation area to the target area, replacing
obsolete target area data, if necessary. With this step, the
newly integrated data have been made available for
customer use. Given a typical data warehouse system, they
may now be loaded into the analysis-oriented (e. g.
multidimensional) data structures of the data warehouse and
attached data marts.</p>
          <p>Step 10: Analysis of Customer Feedback and
Retraction of Data Products
The organization has to record customer feedback and
evaluate it as a measure of the DQMS performance (ISO
section 8.2). If a deficiency of a released data product is
detected during current use (i. e. by a customer), and this
deficiency impairs the usability of the data product
significantly, the organization has to retract the product from the
target area (and the data warehouse) and ”repair” it if
possible (see step 8) before re-releasing it. If necessary,
causeoriented measures should be taken into account (see step
8).
3.2</p>
          <p>DQMS Implementation
The CLIQ DQMS has been implemented as a software
workbench that offers a continuously quality controlled,
extensively automated support of the data integration
process. It employs both commercial and self-developed
components. Figure 8 illustrates the different layers of the
DQMS architecture.</p>
          <p>Layer 1 contains the data storage modules (based on
Microsoft SQL Server) for business and meta data. Layer 2
offers interfaces to these storage modules. Business data
are accessed by way of ODBC, metadata – represented
in accordance with the Open Information Model (OIM)
1-8
Data Migrator</p>
          <p>Data Linker</p>
          <p>Data Auditor</p>
          <p>Rule Editor</p>
          <p>Measurement Engine
Microsoft DTS</p>
          <p>Vality Integrity</p>
          <p>MLC++</p>
          <p>ILOG Rules
Key:
external component
implemented</p>
          <p>partially implemented
ODBC</p>
          <p>Business
data</p>
          <p>Microsoft SQL Server</p>
          <p>Domain-Specific Measurement Engine</p>
          <p>SPC Engine
MS Repository</p>
          <p>Metadata
6
5
4
3
2
1
[MDC01] – by way of the COM-based Microsoft
Repository [Ber97].</p>
          <p>Layers 4 and 5 form the DQMS core, their submodules
being tailored to single steps of the data integration process
(Data Migrator: step 1, Data Linker: steps 5 and 6, Data
Auditor: step 8, Rule Editor/Measurement Engine: steps
3 and 7, SPC Engine: step 2). On their part, they make
use of the commercial components of layer 3, including the
Microsoft Data Transformation Services [AL99], the ETL
tool Integrity [Val01], the data mining class library MLC++
[KSD96], and the rule engine ILOG Rules [Ilo01].</p>
          <p>To be able to integrate domain-specific quality control
methods, a so-called Domain-Specific Measurement
Engine was introduced in layer 6. Data quality planning,
control, and improvement are being coordinated by the DQM
controller in the top layer of the software system.</p>
          <p>Those software components that have been developed
within CLIQ have been implemented using Microsoft
Visual C++ 6.0.
4</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>Related Work</title>
        <p>The importance of data quality for organizational success
has been underestimated for a long time. For this reason,
quality management in information processing is not nearly
as established as in other disciplines, e. g. manufacturing.
Contemplating the related work concerning DQM, a clear
distinction can be made between commercial approaches
on the one hand and research activities on the other.
4.1</p>
        <p>Commercial Approaches
In recent years, a large number of ETL tools (cf. Sect. 3)
hit the market, claiming to simplify the process of
populating a data warehouse significantly. But although some of
these tools provide sophisticated graphical interfaces and
comprehensive libraries of transformation functions, they
are insufficient for DQM for the following reasons:
Their functionality is usually limited to the quality
characteristic of uniformity (cf. Fig. 1) by providing
typical data migration functions, like mapping of
attributes, standardization of addresses, and value
conversions.</p>
        <p>They do not support any DQM functions, like quality
planning, control, assurance, and improvement.</p>
        <p>Nevertheless, an ETL tool can be particularly useful as a
single module of an overall DQMS (cf. step 1 in Sect. 3.1).
Some prominent examples of commercial ETL tools are:
DataStage Suite (Ardent Software)
DQ Plus (Group 1 Software)
Genio Suite (Hummingbird)
i. D. Centric Data Quality Suite (Firstlogic)
InfoRefiner/InfoTransport/InfoPump (Computer
Associates, formerly Platinum)
Integrity (Vality)
Pure*Integrate (Oracle)
Trillium Software System (Trillium)</p>
        <p>Warehouse Workbench (Systemfabrik).</p>
        <p>A quite extensive list of ETL tools can be found in
[Eng99].
1-9
In 1992, DeLone und McLean [DM92] set up a model of
information systems success which included the quality of
data as a critical success factor, along with system quality.</p>
        <p>In the following years, two major research projects
emerged, namely MIT’s Total Data Quality Management
(TDQM) program [Wan98], active since 1992, and the
ESPRIT project Foundations of Data Warehouse Quality
(DWQ) [JJQV98], running from 1996 to 1999.</p>
        <p>TDQM claims to establish a theoretical foundation of
data quality. Based on the enterprise philosophy of
Total Quality Management [Dem86], a so-called TDQM
cycle is derived from Deming’s classical PDCA cyle (plan,
do, check, act) comprising phases of data quality planning,
measurement, analysis and improvement. Later, this
concept has been extended by [Eng99] and [Hel00].
Additionally, TDQM identifies a set of data quality
characteristics (called quality dimensions) based on empirical studies
and proposes some simple metrics for data quality
measurement. Finally, TDQM provides an approach to enrich
the relational data model with data quality information by
introducing meta relations. Although TDQM offers some
interesting ideas, most of the concepts remain on a quite
superficial level, orientating much more to business
economics than to information technology.</p>
        <p>The DWQ project, on the other hand, is tailored to the
data warehousing world. Not only aspects of data
quality are considered, but also aspects of schema and software
quality. There is a clear distinction between conceptual,
logical, and physical data models, the semantics of which
are explicitly described as metadata. Integration of source
schemata is supported by so-called interschema knowledge
networks which specify relationships between schemata.
Quality assessment is done using the goal-question-metric
(GQM) approach by [BW84]. With GQM, quality goals
are specified and then mapped to ”quality questions” (in
this case: queries on a meta database). To get a response to
a quality question, (simple) metrics are used. Since DWQ
covers a wide range of aspects, it inevitably shows some
shortcomings relating to specific questions of detail, e. g.
concerning sophisticated data quality metrics.</p>
        <p>Apart from TDQM and DWQ, several minor research
activities have been launched during the last two to three
years, reflecting the rising awareness of the importance of
DQM: In the CARAVEL project [GFSS00], a
metadatabased cleansing system called AJAX has been developed,
including modules for the specification, optimization,
execution, and explanation of cleansing tasks. The main focus
of CARAVEL lies on the elimination of duplicates (quality
characteristic ”absence of redundancy”, cf. Fig. 1). This
approach is similar to the IntelliClean project [LLL00],
the Sorted-Neighborhood Method by [HS95], and related
statistical methods [Jar89]. The project HiQiQ [NLF99]
deals with the inclusion of data quality information into
query processing in distributed databases, based on a
wrapper/mediator architecture. [RCH99] describe an interactive
framework for data cleansing, called Potter’s Wheel
A-BC, that tightly integrates data transformation and error
detection. Potter’s Wheel A-B-C allows the user to
gradually build transformations by adding or undoing transforms
through a spreadsheet-like interface. In the background,
data deficiencies are flagged as they are found. [MM00]
and [DJ99] use data mining methods to detect errors
automatically, comparable with the commercial tool WizRule
[Wiz01].</p>
        <p>All in all, even though there are quite a few projects
dealing with data quality aspects, there is still a serious
lack of formally funded methods to measure data quality.
Furthermore, there is no process model supporting
qualitydriven data integration.
5</p>
      </sec>
      <sec id="sec-8-3">
        <title>Conclusion</title>
        <p>In this paper, we have proposed a quality management
system for the integration of data from heterogeneous sources.
Following a rigorous analogy between the integration
process and a conventional manufacturing process, we were
able to map the requirements of the ISO 9001:2000
standard to our model and thus implement a standard-compliant
data quality management system for data warehouse
systems.</p>
        <p>During the next months, we will evaluate the DQMS
by means of the epidemiological cancer registry of
LowerSaxony [HP99] which features a typical data warehouse
architecture. Further future work will include the adaptation
of the DQMS to the metadata standard CWM (Common
Warehouse Metamodel) [OMG00].
[AL99]
[Ber97]
[BT99]
[BW84]
[DeM82]</p>
        <p>Awalt, D., Lawton, B. K.: Introduction
to Data Transformation Services. In: SQL
Server Magazine, 1 (1): 34–36, 1999.</p>
        <p>Bernstein, P. A. et al.: The Microsoft
Repository. In: Proc. of the 23rd Intl. Conf. on
Very Large Databases (VLDB ’97), Athens,
Greece, 1997.</p>
        <p>Ballou, D. P., Tayi, G. K.: Enhancing Data
Quality in Data Warehouse Environments. In:
Communications of the ACM, 42 (1): 73–78,
1999.</p>
        <p>Basili, V. R., Weiss, D. M.: A Method for
Collecting Valid Software Engineering Data.</p>
        <p>In: IEEE Transactions on Software
Engineering, 10 (6): 728–738, 1984.</p>
        <p>DeMarco, T.: Controlling Software Projects,
Yourdon Press, New York, 1982.
1-10
[Dem86]
[DJ99]
[DM92]
[Eng99]
[GFSS00]
[Hel00]
[Hin01]
[HP99]
[HS95]
[HW00]
[Ilo01]
[Inm92]</p>
        <p>Deming, W. E.: Out of the crisis, MIT, Center
for Advanced Engineering Study, 1986.</p>
        <p>Dasu, T., Johnson, T.: Hunting of the
Snark: Finding Data Glitches with Data
Mining Methods. In: Proc. Information Quality
IQ1999, MIT, Boston, MA, 1999.</p>
        <p>DeLone, W. H., McLean, E. R.: Information
Systems Success: The Quest for the
Dependent Variable. In: Inf. Systems Research, 3
(1): 60–95, 1992.</p>
        <p>English, L. P.: Improving Data Warehouse
and Business Information Quality. Wiley,
New York, 1999.</p>
        <p>Galhardas, H., Florescu, D., Shasha, D.,
Simon, E.: Declaratively Cleaning your Data
using AJAX. In: Journ. Bases de Donn e´es
Avance´es, Oct. 2000.</p>
        <p>Helfert, M.: Massnahmen und Konzepte zur
Sicherung der Datenqualitaet. In: Data
Warehousing Strategie – Erfahrungen, Methoden,
Visionen, pp. 61–77, Springer, Berlin, 2000
(in German).</p>
        <p>Hinrichs, H.: Datenqualitaetsmanagement in
Data Warehouse-Umgebungen. In:
Datenbanksysteme in Buero, Technik und
Wissenschaft, 9. GI-Fachtagung BTW 2001,
Oldenburg, pp. 187–206, Springer, Berlin, 2001 (in
German).</p>
        <p>Hinrichs, H., Panienski, K.: Experiences
with Knowledge-Based Data Cleansing at the
Epidemiological Cancer Registry of
LowerSaxony. In: XPS-99: Knowledge-Based
Systems - Survey and Future Directions, pp. 218–
226, LNAI 1570, Springer, Berlin, 1999.</p>
        <p>Hernandez, M. A., Stolfo, S. J.: The
Merge/Purge Problem for Large Databases.</p>
        <p>In: Proc. of the 1995 ACM SIGMOD
Conference, 1995.</p>
        <p>Hinrichs, H., Wilkens, T.: Metadata-Based
Data Auditing. In: Data Mining II (Proc.
of the 2nd Intl. Conf. on Data Mining,
Cambridge, UK), pp. 141–150, WIT Press,
Southampton, 2000.</p>
        <p>http://www.ilog.com/products/
Ilog Inc.:
rules, 2001.</p>
        <p>Inmon, W. H.: Building the Data Warehouse.</p>
        <p>Wiley, New York, 1992.
[ISO00a]
[ISO00b]
[Jar89]
[JJQV98]
[Kim98]
[KSD96]
[LLL00]
[MDC01]
[MM00]
[NLF99]
[OMG00]
[RCH99]</p>
        <p>International Organization for
Standardization: ISO 9000:2000: Quality Management
Systems – Fundamentals and Vocabulary.</p>
        <p>Beuth, Berlin, 2000.</p>
        <p>International Organization for
Standardization: ISO 9001:2000: Quality Management
Systems – Requirements. Beuth, Berlin, 2000.</p>
        <p>Jaro, M. A.: Advances in Record Linkage
Methodology as Applied to Matching the
1985 Census of Tampa, Florida. In: Journal
of the American Statistical Association, 84:
414–420, 1989.</p>
        <p>Jarke, M., Jeusfeld, M. A., Quix, C.,
Vassiliadis, P.: Architecture and Quality in Data
Warehouses. In: Proc. of the 10th Intl. Conf.</p>
        <p>CAiSE*98, Pisa, Italy, pp. 93–113, Springer,
Berlin, 1998.</p>
        <p>Kimball, R.: The Data Warehouse Lifecycle
Toolkit. Wiley, N. Y., 1998.</p>
        <p>Kohavi, R., Sommerfield, D., Dougherty, J.:
Data Mining using MLC++ – A Machine
Learning Library. In: Tools with AI 1996, pp.
234–245, 1996.</p>
        <p>Lee, M. L., Ling, T. W., Low W. L.:
IntelliClean – A Knowledge-Based Intelligent Data
Cleaner. In: Proc. of the 6th ACM SIGKDD
Intl. Conf. on Knowledge Discovery and Data
Mining, Boston, MA, 2000.</p>
        <p>Meta Data Coalition: http://www.MDCinfo.
com, 2001.</p>
        <p>Maletic, J. I., Marcus, A.: Data Cleansing –
Beyond Integrity Analysis. In: Proc. of the
Conf. on Information Quality IQ2000, MIT,
Boston, MA, pp. 200–209, 2000.</p>
        <p>Naumann, F., Leser, U., Freytag, J. C.:
Quality-Driven Integration of Heterogeneous
Information Sources. In: Proc. of the 1999
Intl. Conf. on Very Large Databases (VLDB
’99), Edinburgh, UK, 1999.</p>
        <p>Object Management Group: Common
Warehouse Metamodel (CWM) Specification,
2000.</p>
        <p>Raman, V., Chou, A., Hellerstein, J. M.:
Scalable Spreadsheets for Interactive Data
Analysis. In: Proc. of the ACM-SIGMOD
Workshop on Research Issues in Data Mining and
Knowledge Discovery (DMKD),
Philadelphia, 1999.
1-11</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Shewhart</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          :
          <article-title>Economic Control of Quality of Manufactured Product</article-title>
          . D. Van Nostrand, New York,
          <year>1931</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Vality Technology Inc</surname>
          </string-name>
          .: http://www.vality. com,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , R. Y.:
          <article-title>A Product Perspective on Total Data Quality Management</article-title>
          .
          <source>In: Communications of the ACM</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>WizSoft</given-names>
            <surname>Inc</surname>
          </string-name>
          .: http://www.wizsoft.com,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>