=Paper=
{{Paper
|id=Vol-1135/paper11
|storemode=property
|title=Detection and Resolution of Data Inconsistencies, and Data Integration using Data Quality Criteria
|pdfUrl=https://ceur-ws.org/Vol-1135/paper11.pdf
|volume=Vol-1135
|dblpUrl=https://dblp.org/rec/conf/quatic/AngelesM04
}}
==Detection and Resolution of Data Inconsistencies, and Data Integration using Data Quality Criteria==
<pdf width="1500px">https://ceur-ws.org/Vol-1135/paper11.pdf</pdf>
<pre>
                                                                                                                                              1


Detection and Resolution of Data Inconsistencies,
 and Data Integration using Data Quality Criteria
                                                  Pilar Angeles, Lachlan M. MacKinnon

      Abstract — In the processes and optimization of information integration, such as query processing, query planning and hierarchical
      structuring of results to the user, we argue that user quality priorities, data inconsistencies and data quality differences among the
      participating sources have not been fully addressed. We propose the development of a Data Quality Manager (DQM) to establish
      communication between the process of integration of information, the user and the application, to deal with semantic heterogeneity and data
      quality. DQM will contain a Reference Model, a Measurement Model, and an Assessment Model to define the quality criteria, the metrics and
      the assessment methods. DQM will also help in query planning by considering data quality estimations to find the best combination for the
      execution plan. After query execution, and detection of inconsistent data, data quality might also be used to perform data inconsistency
      resolution. Integration and ranking of query results using quality criteria defined by the user will be an outcome of this process.

      Index Terms — Data Quality, Heterogeneous Databases, Information Integration, Information Quality, Semantic Integration.

                                               ——————————  ——————————

1 INTRODUCTION

T    he problems of data inconsistency in data integration have
     been widely discussed and researched for a number of
     years, and a large number of these have been resolved, as
                                                                                The aim of this paper is to establish the context and back-
                                                                             ground on data quality for information retrieval and propose a
                                                                             Data Quality Manager to deal with data integration and data
described in our own work [35], [36]. However, the combina-                  inconsistencies through the use of data quality properties.
tion of these solutions, and the resolution of the remaining is-                This paper is organized as follows: in Section 2 the back-
sues, still remains an open issue. This has exacerbated as the               ground on the establishment of data quality criteria, models
development of Information Systems, network communica-                       and assessment is discussed. In Section 3 some issues are pre-
tions and the World Wide Web, has permitted widespread ac-                   sented in order to help measuring data quality in Heterogene-
cess to autonomous, distributed and heterogeneous data                       ous Databases. In Section 4 the elements of the Data Quality
sources. An increasing number of databases, especially those                 Manager are presented, and how it interacts with data integra-
published on the Web, are becoming available to external us-                 tion and data fusion processes. Finally Section 5 concludes this
ers. User requests are converted to queries over several data                paper identifying main points of this paper.
sources with different data quality, but the quality of the data
sources utilised is not a feature of the process.
                                                                             2 BACKGROUND
    Integration of schemas on existing databases into a global
unified schema is an approach developed over 20 years ago,                   2.1 Data Integration in Heterogeneous Database
[4]. However information quality can not be guaranteed after                     Systems
integration, because data quality is dependent on the design of              Data integration is the process of extracting and merging data
the data and its provenance [31], [5]. Even greater levels of in-            from multiple heterogeneous sources to be loaded into an inte-
consistency exist when data is retrieved from different data                 grated information resource [4]. Solving structural, syntactical
sources.                                                                     and semantic heterogeneities between source and target data
    On the other hand, different expectations exist on the qual-             has been a complex problem for data integration for a number
ity of the information, depending on the user. A casual user on              of years [28], [4],[35], [36].
the Web does not expect complete and precise information                         One solution to this problem has been developed through
[21], but close to his selection condition. A professional user              the use of a single global database schema that represents the
expects accuracy and completeness of the information re-                     integrated information with mappings from global schema to
trieved in order to make a decision irrespective of the time it              local schemas, where each query to the global schema is trans-
could take to retrieve the data, although speed is still likely to           lated to queries to the local databases using these mappings
be a lesser priority.                                                        [4]. The use of domain ontology, metadata, transformation
    User priorities, data inconsistencies and data quality differ-           rules, user, and system constraints have resolved the majority
ences among the participating sources have not been fully ad-                of the problems of domain mismatch associated with schematic
dressed in the processes and optimizations of information in-                integration and global schematic approaches. However, even
tegration, such as query processing, query planning and hier-                when all the mappings, semantic and structure heterogeneity
archical structuring of results to the user.                                 are solved in the global schema, consistency may not have
                                                                             been achieved, because the data provided by the sources may
                    ————————————————
                                                                             be mutually inconsistent. This problem has remained because
• P. Angeles, School of Mathematical and Computer Sciences, Heriot Watt
  University Edinburgh,U.K.EH14 4AS. E-mail: pilar@macs.hw.ac.uk.
                                                                             it is impossible to capture all the information and avoid null
• L.M. MacKinnon, School of Mathematical and Computer Sciences, Heriot       values. At the same time, each autonomous component data-
  Watt University Edinburgh,U.K.EH14 4AS. E-mail: lach-                      base deals with its own properties or domain constraints on
  lan@macs.hw.ac.uk.                                                         data, such as accuracy, reliability, availability, timeliness and
2                                                                                                              QUATIC’2004 PROCEEDINGS


cost of data access.                                                   on multi-database integration, or data inconsistency detection
   Several approaches to solve inconsistency between data-                                       TABLE 1
bases have been implemented:                                              CLASSIFICATION BASED ON INTERNAL OR EXTERNAL VIEW [31]
   1.    By reconciliation of data, also known as data fusion:
         different values become just one using a fusion func-                                        Dimensions
         tion (i.e. average, highest, and majority), depending on          Internal view              Data- related:
         the data semantic [16].                                           (design operation)         Accuracy, reliability, timeliness, com-
   2.    On the basis of individual data properties: associated                                       pleteness, currency, consistency, pre-
         with each data source (i.e. cost of retrieving data, how                                     cision
         recent is the data, level of authority associated with                                       System-related:
         this source, or accuracy and completeness of data).                                          Reliability
         These properties can be specified at different levels:
                                                                           External view              Data-related:
         the global schema design level, the query itself or in
                                                                           (use, value)               Timeliness, relevance, content, impor-
         the profile of the user [2].
                                                                                                      tance, sufficiency, usableness, useful-
   Some definitions of data quality criteria, metrics and meas-
                                                                                                      ness, clarity, conciseness, freedom of
   urement methods are presented in the following sections.
                                                                                                      bias, informativeness, level of detail,
2.2 Data Quality (DQ) vs. Information Quality (IQ)                                                    quantitativeness, scope, interpretabil-
                                                                                                      ity, understandability
“High data quality has been defined as data that is fit for use
                                                                                                      System-related:
by data consumers and is treated independent of the context in
                                                                                                      Timeliness, flexibility, format, efficiency
which data is produced and used” [29].
    Data quality has been characterized by quality criteria or
dimensions such as accuracy, completeness, consistency and                                              TABLE 2
timeliness [31], [16], [8], [22], [29], [25] and [20]. However there
is no general agreement on data quality dimensions [32], [14].           CLASSIFICATION BASED ON DATA-CONSUMER PERSPECTIVE [29]
    There has not been a specific differentiation between IQ and
DQ, because the terms data and information are often used               DQ                DQ concerns           Causes                DQ Dimen-
synonymously. However, Data quality is related to accuracy              Category                                                      sions
and integrity and on the other hand, Information Quality is             Intrinsic         Mismatches            Multiple sources      Accuracy
concern with data quality in context, and is related to how the                           among sources of      of same data.         Objectivity
information is produced and interpreted.                                                  the same data are     Judgment in-          Believability
                                                                                          common cause of       volved in data        Reputation
2.3 Data Quality Classifications
                                                                                          intrinsic DQ con-     production.
A definition of quality dimensions and a framework for analy-                             cerns
sis of data quality as a research area was first proposed by
                                                                        Accessibility     Lack of computing     Systems difficult     Accessibility
Richard Wang et al. [32]. An ontologically based approach
                                                                                          resources.            to access.            Access Security
was developed by Yair Wand et al. [31], this model analyzed
                                                                                          Problems on pri-      Must protect
data quality based on discrepancies between the representa-
                                                                                          vacy and confiden-    confidentiality.
tion mapping from real world (RW) to information system
                                                                                          tiality:              Representational
(IS) and vice versa, through design and operation activities
                                                                                          Interpretability.     DQ dimensions
involved in the construction of an information system as an
                                                                                          Understandability.    are causes of
internal view. A real world system is said to be properly rep-
                                                                                          Data representa-      inaccessibility.
resented if there exists an exhaustive mapping, and no two
                                                                                          tion
states in RW are mapped into the same state in IS. Four in-
                                                                        Contextual        Operational Data      Incomplete data.      Relevancy
trinsic data quality dimensions were identified: complete,
                                                                                          production prob-      Inconsistent          Value Added
unambiguous, meaningful and correct. Additionally mapping
                                                                                          lems:                 representation.       Timeliness
problems and data deficiency repairs were suggested. The
                                                                                          Changing data         Inadequately          Completeness
analysis produced a classification of data quality dimensions as
                                                                                          consumers needs.      defined or            Amount of
related to the internal or external views. Data Quality meas-
                                                                                          Distributed com-      measured data.        Data
urement method was not addressed (See table 1).
                                                                                          puting.               Data results not
    A different classification of data quality dimension was de-
                                                                                                                properly aggre-
veloped by Diane Strong et al. in [29] is based on a data-
                                                                                                                gated.
consumer perspective. Data quality categories were identified
as intrinsic, accessibility, contextual and representational. Data      Represent-        Computerizing and     Data inaccessi-       Interpretability
quality measurement method was not addressed. Each cate-                ational           data analyzing        ble because:          Ease of un-
gory was directly addressed to different data quality dimen-                                                    Multiple interpre-    derstanding
sions (See table 2).                                                                                            tations across        Concise and
   In Total Data Quality Management (TDQM) [33] the con-                                                        multiple speciali-    Consistent
                                                                                                                ties and limited      representation
cepts, principles and procedures are presented as a method-
                                                                                                                capacities to         Timeliness
ology which defines the following life cycle: define, measure,
                                                                                                                summarize             Amount of
analyze and improve data as essential activities to ensure
                                                                                                                across image.         data
high quality, managing data as a product. There is no focus
PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA                      3


                                                            TABLE 3
                QUALITY DIMENSIONS DEFINITIONS, DETERMINANT FACTORS AND METRICS BY AUTHOR [9], [10], [16], [25], [31].

Dimension         Concern                                                            Author          Factors                 Metric
                  “Inaccuracy implies that Information System (IS) represents        Wand /Wang      RW/IS states
Accuracy          a Real World (RW) state different from the one that should
                  have been represented”
                  “Whether the data available are the true values (correct-          Motro/Rakov
                  ness, precision accuracy or validity)”                                             Data values
                  “The degree of correctness and precision with which real           Gertz
                  world data of interest to an application domain are repre-
                  sented in an information system.
Precision         Ambiguity: Improper representation: multiple RW states             Wand /Wang      RW/IS states
                  mapped to the same IS state
                  “Ability of an IS to represent every meaningful state of the       Wand/Wang       RW/IS states
Completeness      represented real world system. Thus is not tied to data-
                  related concepts such as attributes, variables, or values”
                  “The extent to which data is not missing and does not have                         Data model
                  sufficient breadth and depth for the task at hand”                 Pipino/Wang     (table, row,            1 – ( #incomplete
                  “All values for a certain variable are recorded”                                   attribute, classes)     items / #total items)
                  “Whether all the data are available”                               Ballou          schema
                  “The degree to which all data relevant to an application           Motro           column
                  domain have been recorded in an information system.”               Gertz           population
                  “The IS state may be mapped back into a meaningful state,          Wand/Wang       RW/IS states
                  the correct one”
Correctness       “The extend to which data is correct and reliable”                 Pipino/Wang                             1 - (# errors /
                                                                                                                                 # total)

Timeliness        “Whether the data is out of date, An availability of output on     Wand/Wang       Currency
                  time”                                                                              Volatility
                  “The extent to which data is sufficiently up to date for the       Pipino/Wang                             Max (0,
                  task at hand”                                                                                              1 - (# currency /
                  The degree to which the recorded data are up-to-date”              Gertz                                       #volatility))

                  “How fast the IS state is updated after the real world sys-        Wand/Wang
Currency          tem changes.”
                  Age: of data, when first received by the system                    Pipino/Wang     Age                     Age + delivery time –
                  Delivery time: when data is delivered by the user                                  Delivery time           input time
                  Input time: When data is received by the system.                                   Input time
                  “Whether the data are up to date, reflecting the most recent       Motro
                  values”
Volatility        “The rate of change of the real world.”                            Wand/Wang
                                                                                                                             Time data invalid
                  “Refers to the length of time data remains valid.”                 Pipino/Wang     Time                    - Time start valid
Consistency       “Refers to several aspects of data. In particular, to values       Wand/Wang       RW/IS states
                  of data inconsistency would mean that the representation                           Values of data on
                  mapping is one to many. This is not considered a defi-                             Integrity constraints
                  ciency.”                                                                           Data representation.    1–
                  “The extent to which data is presented in the same format”         Pipino/Wang     Physical rep. data      ( #inconsistent /
                  as consistent representation                                                       Values of data on         #total consistency
                  “Often referred as integrity constraints state the proper          Motro           Integrity constraints   checks)
                  relationships among different data elements”
                  “The degree to which the data managed in an information            Gertz
                  system satisfy specified constraints and business rules.”
Believability     “The extent to which data is regarded as true and credible”        Pipino/Wang     Source of data S
                                                                                                     Accepted stand. A
                                                                                                     Previ. experience P     Min(A,S,P)
Accessibility     “The extent to which data is available, or easily and quickly      Pipino/Wang     Time request TR         Max (0,
                  retrievable”                                                                       Time delivery TD        1 – (TR – TD /
                                                                                                     Time no longer useful       TR – TN))
                                                                                                     TN.
                                                                                                     Data path A.            Min (A,B,C)
                                                                                                     Structure B
                                                                                                     Path lengths C
or database retrieval solutions.
  There are just definitions, and in the best cases, measure-                     2.4 The assessment methods for information quality
                                                                                      criteria
ment of data quality aspects.
                                                                                  Information Quality (IQ) criteria have been classified in an as-
  In table 3, the different quality dimension definitions are
                                                                                  sessment-oriented model by F. Naumann in [20], where for
presented with the relevant factors on each dimension and
                                                                                  each criterion an assessment method is identified.
the proposed metric by author.
4                                                                                                   QUATIC’2004 PROCEEDINGS


In this classification the user, the data and the query process   two main problems, intensional and extensional inconsisten-
are considered as sources of information quality by themselves,   cies. Intensional are related to resolving the schematic differ-
(see Table 4.)                                                    ences between the component databases, this issue is also
                             TABLE 4                              known as semantic heterogeneity. Extensional inconsistencies
                                                                  are related to reconciling the data differences among the par-
          AN ASSESSMENT-ORIENTED CLASSIFICATION [20]              ticipating databases [16]. Information integration is the process
                                                                  of merging multiple query results into a single response to the
Assessment         IQ Criterion          Assessment Method
                                                                  user. There are several important areas of related work to con-
Class
                                                                  sider in the following approaches.
Source IQ of
metadata                                                          3.1 Data Integration Techniques based on Data Quality
                   Believability         User experience              Aspects
                   Concise represent.    User Sampling            Data integration techniques have been developed by Gertz [8],
Subject Criteria   Interpretability      User sampling            [9] based on data quality aspects within an object oriented data
                   Relevancy             Continuous assessment    model, and data quality information stored in metadata. Qual-
                   Reputation            User experience          ity aspects such as timeliness, accuracy and completeness were
User               Understandability     User sampling            considered in the process of database integration. The main
                   Value-added           Continuous assessment    aspect was the assumption that quality of the data stored at
                   Completeness          Continuous assessment    different sites can be different and the quality varies over time.
                   Customer Support      Parsing, sampling        Query language extensions were necessary to support the
Object Criteria    Documentation         Parsing                  specification of data quality goals for global queries and thus
                   Objectivity           Expert input             data integration. In the case of data conflicts between semanti-
                   Price                 Contract                 cally equivalent objects, the object with best data quality must
Information/       Reliability           Continuous assessment    be chosen. If no conflicts exist between objects but their quality
Data               Security              Parsing                  level is different, the integrated objects need to be grouped to
                   Timeliness            Parsing                  allow the ranking of the results.
                   Verifiability         Expert input
                                                                  3.2 Multiplex
                   Accuracy              Sampling, cleansing
                                                                  The project MULTIPLEX directed by Motro and Rakov [16],
Process Criteria   Amount of Data        Continuous assessment
                                                                  addressed the problem of extensional inconsistencies and a
                   Availability          Continuous assessment
                                                                  Data Quality Model for Relational Databases. MULTIPLEX
                   Consistent repress.   Parsing
                                                                  was based on accuracy and completeness as quality criteria,
Query Process      Latency               Continuous assessment
                                                                  this model assigned a quality specification for each instance of
                   Response time         Continuous assessment
                                                                  a relation, and these quality specifications were calculated by
                                                                  extending the relational algebra. The quality of answers was
   The AIM Quality Methodology (AIMQ) [34] is a practical
                                                                  calculated by the measure of arbitrary queries from the overall
tool for assessing and benchmarking IQ organizations, with
                                                                  quality specification of the database [16]. In the case of multi-
three components: PSP/IQ Model which presents a quality
                                                                  ple sets of records as possible answers to one query, each set of
dimension classification by product quality and service quality
                                                                  records has an individual quality specification. A voting
using information consumer perspective, and consolidates the
                                                                  scheme, using probabilistic arguments, identifies the best set of
dimensions into four quadrants: sound, dependable, useful,
                                                                  records to provide a complete and sound answer and ranking
and usable information, these quadrants are relevant to IQ im-
                                                                  of tuples in the answer space. The conflict resolution strategy,
provement decisions. IQA instrument measures IQ for each IQ
                                                                  and the quality estimates are addressed by the multidatabase
dimension. In a pilot study, using questionnaires answered by
                                                                  designer.
information collectors, information consumers, and IS profes-
sionals in six companies, these measures are average for the      3.3 Fusionplex
four quadrants and the scale used in assessing each item          An enhancement of the Multiplex system FUSIONPLEX [2], [3]
ranged from 0 “not at all” to 10 “completely” and the IQ Gap      stores information features or quality criteria scores in meta-
Analysis Techniques assess the information quality for each of    data, the considered quality dimensions are timestamp, accu-
the four quadrants. These gap assessments are the basis for       racy, availability, clearance and cost of retrieval. Inconsisten-
focusing IQ improvement efforts. This methodology uses ques-      cies are resolved by data fusion, allowing the user to define
tionnaires as main measurement method, taking a very prag-        data quality estimation on a vector of features weights, per-
matic approach regarding IQ.                                      formance thresholds and a fusion function at attribute level, as
    In the following section we will present some approaches      required. This approach reconciles the conflicting values at
demonstrating how a data quality model, assessment methods        attribute level using an intermediate result named polyin-
and user priorities, based on the work discussed above, can       stance, which contains the inconsistencies. First the polyin-
help in the process of data integration.                          stance is divided in polytuples, and using the feature weights
                                                                  and the threshold, members of each polytuple are discarded.
3      MEASURING DATA QUALITY IN HETEROGENEOUS                    Second each polytuple is separated into mono-attribute poly-
       DATABASES                                                  tuples using the primary key, assuming that the same value of
                                                                  the primary key between databases refers to the same object
Database integration is divided by Motro and Rakov [16] in        but with different data values, and attribute values are dis-
PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA                   5


carded based on corresponding feature values. Finally the             frameworks and models of reference have been developed as
mono-attribute tuples are joined back together resulting in sin-      standards, such as ISO 15504 [12] and CMMI [1], [7].
gle tuples.                                                              Here, the general objective is to establish good practices for
                                                                      software engineering and to be able to talk the same language
3.4 Information Quality Reasoning                                     during software processes, no matter the architecture or im-
Information Quality reasoning is defined as the integration of        plementation methodology. The same challenge need to be
information quality aspects, to the process of planning and           taken up in the Data Quality area, based on the following:
optimizing queries against databases and information systems             1. It is essential to identify a framework that establishes the
by F. Naumann in [21]. Such aspects are related through the           models corresponding to the criteria of quality, methods of
establishment of information quality criteria, assessment             measurement, assessment and improvement, and considers the
methods and measure.                                                  data quality life cycle.
   Selection of data sources, and optimization of query plan-            This framework can be used as good practice during infor-
ning by considering user priorities has been also addressed in        mation system development, integration, capture and tracking
[21] by the definition of a quality model and a quality assess-       of changes in data. Tracking changes should offer quality im-
ment method under the following assumptions:                          provement and data cleaning based on a feedback provided by
     1. Query processing: Concerned with efficiently answer-          the same information system or a set of recommendations to
         ing a user query to a single or multi database. In this      the information manager, and will help to achieve self regulat-
         context efficiency means speed.                              ing systems.
     2. Query planning: Is concerned with finding the best               2. This framework might be considered in heterogeneous
         possible answer given some cost or time constraint.          systems, before, during and after the integration of informa-
         Query planning involves regarding many query exe-            tion.
         cution plans across different, autonomous sources that          3. We propose a Data Quality Manager as the mechanism to
         together form the complete result.                           establish communication between the user, the application and
   In this approach information sources were selected by using        the process of integration of information, to deal with semantic
Data Envelopment Analysis method (DEA) [6], and the follow-           heterogeneity problems, as part of the framework mentioned
ing quality dimensions: understandability, extent, availability,      above (see Figure 1.)
time and price, discarding sources with poor quality before
executing the query.
   However different sources have different quality scores and                                                 Selection of data sources
they must be fused to determine the best quality result, the
quality fusion can be done in two ways 1) applying a fusion
function per each quality criteria and find the best combination
to query [17] or 2) computing the information quality score                     Reference                            Query Planning
using different quality criteria such as availability, price, accu-              Model
racy, completeness, amount response time for each plan and
thus a ranking of the plans using Simple Additive Weighting                   Measurement                      Detection and Fusion of
method (SAW) explained in [11].                                                  Model                         Data inconsistencies
   The completeness of the query result derived from different
sources is approached in [24] considering the number of results                  Quality
(coverage) and the number of attribute values in the result                     Metadata
                                                                                                                  Query Integration
(density). Completeness is calculated as the product between
the density and the coverage of the corresponding set of in-                   Assessment
formation sources.                                                               Model
                                                                                                                  Ranking query results
3.5 Data Quality on the Web
In this seminar, it was established that it is essential to first          Data Quality Manager
concentrate on developing expressive data quality models, and                                                Information Integration Process
once such models are in place, develop tools that help users
                                                                        Fig. 1. Data Quality Manager in the process of information integration.
and IT managers to capture and analyze the state of data qual-
ity in an information system. [10].
                                                                         4. The Data Quality Manager will contain the following
                                                                      elements:
4   DATA QUALITY MANAGER                                                     • Reference Model: In this model the data quality cri-
Databases have traditionally been considered to be sources of                   teria will be defined depending on data sources,
information that are precise and complete. However the design                   users and application domain.
and implementation of such systems is carried out by human                   • Measurement Model: This will contain the defini-
beings, whose are imperfect, so during the whole software life                  tion of the metrics to be used to measure data qual-
cycle errors occur that are reflected in the quality of both soft-              ity, also the definition of a quality metadata (QMD)
ware and information. Furthermore, when these sources of                        and the specification of data quality requirements
data come from different applications, distributed both physi-                  such as user profiles, query language.
cally and logically, these errors multiply. In the field of Infor-           • Assessment Model: The quality scores definition is
mation Systems, this shortcoming has been realized and                          essential to establish how the quality indicators are
6                                                                                                                    QUATIC’2004 PROCEEDINGS


           going to be represented and interpreted.                                                  Fig. 4. Query Planning
   5. The Data Quality Manager will establish the basis for tak-
ing decisions during the identification of data sources in het-                    •   After query execution, and detection of inconsistent
erogeneous systems, such that:                                                         data, data quality might be used to perform data
       • To classify the sources of data based on certain cri-                         fusion (see Figure 5).
           teria of quality, depending on the application do-
           main. The scores must be stored in a metadata for
           every source of data (see Figure 2.)                                                          ResultX

                                                                                       Execute                                                Data
                                                                                        Query            ResultY                         Inconsistencies
    Definition of            Definition of               Quality Meta-
      Quality                Metrics and                 data Definition                Plan                                                Detection
      Criteria                Indicators                                                                 ResultZ


                                                                                                      Quality user        Inconsistent          Consistent
                                                                                        QMD            priorities            Query                Query
     QMD                       Quality                   Assessment                                                          Result              Result
                              Metadata                     of Data
                              Population                  Sources

             Fig. 2. Data Quality Manager Components Definition.                                             Data
                                                                                                            Fusion
       •       The use of quality aspects previously stored in the
                                                                                   Fig. 5. Detection and Resolution of Data Inconsistencies.
               metadata as a whole with the user priorities for the
               selection of the best sources of information before
               the execution of the queries, for example if the user               •   Integration of the information sources ranking with
               prefers those sources of information that are more                      the quality criteria estimated by the user (see Figure
               current with regard to those of major credibility                       6)
               (see Figure 3.)
                                                                                                                      Quality user
                             Mapping                      Data Sources                            ResultJ              priorities
       User                 Global/Local                 Involved in the
       Query                 Schemas                         Query
                                                                               Data               ResultK
                                                                              Fusion                                                               Query
                                                            QMD                                                          Query                     Result
                           Ranking of
                                                                                                  ResultL             Integration                 Ranking
                           best data
                            sources
                                                         Quality
                                                        User Pri-                             Consistent
                                                          orities                               Query                     QMD
                    Fig. 3. Selection of best data sources.                                    Result

                                                                                                 Fig. 6. Ranking of Query Result
       •       Help the query planning process by considering
               data quality estimations to find the best combina-
               tion for the execution plan (see Figure 4.)                  CONCLUSION
                                                                               We have shown that, although there has been considerable
                                              Quality                       past work in the resolution of semantic heterogeneity in multi
                                             User Pri-                      data source systems over a number of years, expressive data
                                              orities                       quality models and tools to utilise them remain to be devel-
                                                                            oped [10]. The approach developed for Information Quality
                                                                            reasoning [21] provides some mechanisms for data source se-
                                             QueryA
                                                                            lection, but does not address many of the data quality factors
     User                                                           Top     identified in Table 3. Accordingly, we propose a Data Quality
    Global             Query                 QueryB               ranking   Manager as a framework to deal with data inconsistencies and
    Query             Partition                                    Query    lack of quality due to different sources; presenting a continu-
                                             QueryC
                                                                    Plan    ous process of data validation, such as definition of quality
                                                                            criteria, selection of best data sources, ranking of query plan,
                                                                            detection and fusion of data inconsistencies and ranking of
                                              QMD                           query result considering quality of data sources and user ex-
                                                                            pectations. This work is already under way and performance
                                                                            reporting of the tools developed will appear in the next twelve
PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA                                            7


months.                                                                                     [19] F. Naumann and C. Roker, ʺDo Metadata Models meet IQ Requirementsʺ,
                                                                                                 Proceedings of the International Conference on Information Quality, MIT Cam‐
ACKNOWLEDGEMENT                                                                                  bridge, 1999.
                                                                                            [20] F. Naumann and C. Roker C., ʺAssessment Methods for Information
This work was supported by financial funding from Consejo                                        Quality Criteriaʺ, Proceedings of the International Conference on Informa‐
Nacional de Ciencia y Tecnologia CONACYT, Mexico.                                                tion Quality (IQ2000), Cambridge, Mass., 2000.
                                                                                            [21] F. Naumann, ʺFrom Databases to Information Systems‐Information
REFERENCES                                                                                       Quality Makes the Differenceʺ, Proceedings of the International Confer‐
                                                                                                 ence on Information Quality (IQ2001), Cambridge, Mass., 2001.
[1]  D.M. Ahern, A. Clouse, and R. Turner, ”CMMI® Distilled: A Practical
                                                                                            [22] F. Naumann, ʺQuality‐Driven Query Answering for Integrated In‐
     Introduction to Integrated Process Improvement”, The SEI Series in                          formation Systemsʺ, Lecture Notes in Computer Sciences LNCS 2261,
     Software Engineering, Addison Wesley Professional, 2003.                                    Springer Verlag, Heidelberg, 2002.
[2] P. Anokhin and A. Motro, “Data Integration: Inconsistency Detection and                 [23] F. Naumann and M. Haeussler, ʺDeclarative Data Merging with Con‐
     Resolution Based on Source Propertiesʺ, Proc. of FMII 2001, 10th International              flict Resolutionʺ, Proceedings of the International Conference on Informa‐
     Workshop on Foundations of Models for Information Integration. Viterbo, Italy., 2001        tion Quality (IQ2002) Cambridge, Mass., 2002.
[3] P. Anokhin and A. Motro, ʺFusionplex: Resolution of Data Inconsis‐                      [24] F. Naumann, J. Freytag and U. Lesser, ʺCompleteness of Information
     tencies in the Integration of Heterogeneous Information Sourcesʺ,                           Sourcesʺ, Workshop on Data Quality in Cooperative Information Systems
     Technical Report ISE‐TR‐03‐06, Information and Software Engineering                         (DQCIS2003), Cambridge, Mass., 2003.
     Dept., George Mason Univ., Fairfax, Virginia, 2003.                                    [25] L. Pipino, W.L. Yang and R. Wang, ʺData Quality Assessmentʺ, Com‐
[4] C. Batini, M. Lenzerini and S.B. Navathe ʺA comparative Analysis of                          munications of the ACM,vol. 44 no. 4e, pp.211‐218, 2002.
     Methodologies for Database Schema Integrationʺ, ACM Computing                          [26] [Parsian99] A. Parssian, S. Sumit and V. Jacob, ʺAssessing Data Qual‐
     Surveys, vol. 18, no. 4, pp. 323‐364, 1986.                                                 ity for Information Productsʺ, Proceeding of the 20th International Con‐
                                                                                                 ference in Information Systems (ICIS1999), Charlotte, North Carolina
[5] P. Buneman, M. Liberman, C.J. Overton and V. Tannen, “Data Prove‐
                                                                                                 USA, pp. 428‐433, 1999.
     nance”, http://www.cis.upenn.edu/~wctan/DataProvenance, [(date in‐
                                                                                            [27] E. Pierce, ʺAssessing Data Quality with Control Matricesʺ, Communi‐
     formation as accessed by the author citing the references, e.g. 17 Aug.
                                                                                                 cations of the ACM, vol.47, no. 2, pp.82‐86, 2004.
     2004.)]
                                                                                            [28] A. Sheth and L. Larson, ʺFederated Database Systems for Managing
[6] A. Charnes, W. Cooper, and E. Rhodes. “Measuring the efficiency of                           Distributed Heterogeneous and Autonomous Databasesʺ, ACM Com‐
     decision making units”, European Journal of Operational Research, pp.                       puting Surveys, vol. 22, no. 3, pp.184‐236, 1990.
     429‐444, 1978.                                                                         [29] D.M. Strong, W.L. Yang and R.Y. Wang, ʺData Quality in Contextʺ,
[7] M.B. Chrissis, M. Konrad and S. Shrum “CMMI®: Guidelines for Process                         Communications of the ACM, vol. 40, no. 5, pp.103‐110, 1997.
     Integration and Product Improvement”, The SEI Series in Software Engineering,          [30] D.M. Strong, W.L. Yang and R.Y. Wang, ʺ10 Potholes in the Road to
     Addison Wesley Professional, 2003.                                                          Information Qualityʺ, Proceedings of IEEE, vol.18, no. 9162, pp.38‐46,
[8] M. Gertz and I. Schmitt, ʺData Integration Techniques Based on Data                          1997.
     Quality Aspectsʺ, 3rd National Workshop on Federal Databases, Magde‐                   [31] Y. Wand and R. Wang, ʺAnchoring Data Quality Dimensions in Onto‐
     burg, Germany, 1998.                                                                        logical Foundationsʺ, Communications of the ACM, vol. 39, no. 11,
[9] M. Gertz, ʺManaging Data Quality and Integrity in Federated Data‐                            pp.86‐95, 1996.
     basesʺ, Second Annual IFIP TC‐11 WG 11.5 Working Conference on Integ‐                  [32] R.Y. Wang, V.C. Storey, and C.P. Firth, ʺA Framework for Analysis of
                                                                                                 Data Quality Research,ʺ IEEE Trans. Knowledge and Data Eng.,1995.
     rity and Internal Control in Information Systems. Warrenton, Virginia,
                                                                                            [33] R. Wang, ʺA Product Perspective on Total Data Quality Manage‐
     Kluwer Academic Publishers, 1998
                                                                                                 mentʺ, Communications of the ACM, vol. 41, no. 2, pp.58‐65, 1998.
[10] M. Gertz, ʺReport on the Daugstuhl Seminar, Data Quality on the Webʺ,
                                                                                            [34] L. Yang, D. Strong and R. Wang, ʺAIMQ: A Methodology for Informa‐
     SIGMOD Record, Vol. 33, No. 1, Mar. 2004.
                                                                                                 tion Quality Assessmentʺ, Information and Management, vol. 40, no. 2,
[11] C.L. Hwang and K. Yoon, “Multiple Attribute Decision Making:
                                                                                                 pp. 133‐146, 2002.
     Methods and Applications: a state‐of‐the‐art survey”, Berlin;
                                                                                            [35] L.M. MacKinnon, D.H. Marwick, H. Williams, “A Model for Query Decom-
     Springer‐Verlag.                                                                            position and Answer Construction in Heterogeneous Database Systems”.,
[12] ISO/IEC Joint Technical Committee 1 (JTC1), Subcommittee 7 (SC7)                            Journal of Intelligent Information Systems, 1998.
     Working Group 10 (WG10) page, there are nine parts of ISO 15504.                       [36] H. Williams, H.T. El‐Khatib, L.M. MacKinnon, “A framework and
     1998.                                                                                       test‐suite for assessing approaches to resolving heterogeneity”, In‐
[13] H. Kon , E. Madrick, and M. Siegel, ʺGood answers from bad dataʺ, Sloan                     formation and Software Technology, 2000.
     WP#3868, 1995.
[14] G. Tayi, D. Ballou and Guest Editors, ʺExamining Data Qualityʺ, Communica‐             Pilar Angeles obtained her first degree in computer engineering from the
     tions of the ACM, vol. 41,no.2, pp.54‐57, 1998.                                        Universidad Nacional Autonoma de Mexico (UNAM), in 1993,
                                                                                            Diploma in Expert Systems from The Instituto Tecnologico Autonomo de
[15] U. Leser and F. Naumann, ʺQuery Planning with Information Quality                      Mexico (ITAM) in 1994, Diploma in Telematic Systems from ITAM in 1995,
     Boundsʺ, Proceedings of the 4th International Conference on Flexible Query An‐         and M.Sc. in Computer Science, regarding Quality in Software Enginnering
     swering, (FQAS00), Warsaw Poland, 2000.                                                from the UNAM in 2000. Since 1989 she has been working on Technical
[16] A. Motro and I. Rakov I, ʺEstimating the Quality of Databasesʺ, Proceedings            Support for Databases in Casa de Bolsa Probursa, Nissan Mexicana,
     of FQAS 98: Third International Conference on Flexible Query Answering Systems,        Software AG, Sybase de Mexico and e-Strategy Mexico. Recent research
                                                                                            interests have included Data Quality and Heterogeneous Databases. She
     T. Andreasen, H. Christiansen, and H.L. Larsen, ed., pp. 298‐307. Roskilde,
                                                                                            is a Funder Member of the “Quality in Software Engineering Mexican As-
     Den.mark, Springer‐Verlag, Berlin, Germany, 1998.                                      sociation” (AMCIS).
[17] F. Naumann, ʺData Fusion and Data Qualityʺ, Proceedings of the New Tech‐
     niques & Technologies for Statistics Seminar. Surrent, Italy 1998.                     Lachlan M. MacKinnon is Reader in Computer Science, and Director of
[18] F. Naumann, ʺQuality‐driven Integration of Heterogeneous Informa‐                      Postgraduate Study in Computer Science, at Heriot-Watt University. He has
     tion Systemsʺ, Proceedings of the 25th Very Large Data Bases Conference                a first degree in Computer Science, and a PhD in Intelligent Querying for
                                                                                            Heterogeneous Databases. He researches and consults widely in Data,
     (VLDB99), Edinburgh, Scotland, 1999.                                                   Information and Knowledge Technologies. He is a member of the IEEE,
                                                                                            British Computer Society, ACM, AACE, immediate past Chair of the British
8                                                                         QUATIC’2004 PROCEEDINGS


National Conference on Databases, and upcoming Chair of the British HCI
Conference. He has over 50 conference and journal publications in this
area.

</pre>