1 Detection and Resolution of Data Inconsistencies, and Data Integration using Data Quality Criteria Pilar Angeles, Lachlan M. MacKinnon Abstract — In the processes and optimization of information integration, such as query processing, query planning and hierarchical structuring of results to the user, we argue that user quality priorities, data inconsistencies and data quality differences among the participating sources have not been fully addressed. We propose the development of a Data Quality Manager (DQM) to establish communication between the process of integration of information, the user and the application, to deal with semantic heterogeneity and data quality. DQM will contain a Reference Model, a Measurement Model, and an Assessment Model to define the quality criteria, the metrics and the assessment methods. DQM will also help in query planning by considering data quality estimations to find the best combination for the execution plan. After query execution, and detection of inconsistent data, data quality might also be used to perform data inconsistency resolution. Integration and ranking of query results using quality criteria defined by the user will be an outcome of this process. Index Terms — Data Quality, Heterogeneous Databases, Information Integration, Information Quality, Semantic Integration. —————————— ‹ —————————— 1 INTRODUCTION T he problems of data inconsistency in data integration have been widely discussed and researched for a number of years, and a large number of these have been resolved, as The aim of this paper is to establish the context and back- ground on data quality for information retrieval and propose a Data Quality Manager to deal with data integration and data described in our own work [35], [36]. However, the combina- inconsistencies through the use of data quality properties. tion of these solutions, and the resolution of the remaining is- This paper is organized as follows: in Section 2 the back- sues, still remains an open issue. This has exacerbated as the ground on the establishment of data quality criteria, models development of Information Systems, network communica- and assessment is discussed. In Section 3 some issues are pre- tions and the World Wide Web, has permitted widespread ac- sented in order to help measuring data quality in Heterogene- cess to autonomous, distributed and heterogeneous data ous Databases. In Section 4 the elements of the Data Quality sources. An increasing number of databases, especially those Manager are presented, and how it interacts with data integra- published on the Web, are becoming available to external us- tion and data fusion processes. Finally Section 5 concludes this ers. User requests are converted to queries over several data paper identifying main points of this paper. sources with different data quality, but the quality of the data sources utilised is not a feature of the process. 2 BACKGROUND Integration of schemas on existing databases into a global unified schema is an approach developed over 20 years ago, 2.1 Data Integration in Heterogeneous Database [4]. However information quality can not be guaranteed after Systems integration, because data quality is dependent on the design of Data integration is the process of extracting and merging data the data and its provenance [31], [5]. Even greater levels of in- from multiple heterogeneous sources to be loaded into an inte- consistency exist when data is retrieved from different data grated information resource [4]. Solving structural, syntactical sources. and semantic heterogeneities between source and target data On the other hand, different expectations exist on the qual- has been a complex problem for data integration for a number ity of the information, depending on the user. A casual user on of years [28], [4],[35], [36]. the Web does not expect complete and precise information One solution to this problem has been developed through [21], but close to his selection condition. A professional user the use of a single global database schema that represents the expects accuracy and completeness of the information re- integrated information with mappings from global schema to trieved in order to make a decision irrespective of the time it local schemas, where each query to the global schema is trans- could take to retrieve the data, although speed is still likely to lated to queries to the local databases using these mappings be a lesser priority. [4]. The use of domain ontology, metadata, transformation User priorities, data inconsistencies and data quality differ- rules, user, and system constraints have resolved the majority ences among the participating sources have not been fully ad- of the problems of domain mismatch associated with schematic dressed in the processes and optimizations of information in- integration and global schematic approaches. However, even tegration, such as query processing, query planning and hier- when all the mappings, semantic and structure heterogeneity archical structuring of results to the user. are solved in the global schema, consistency may not have been achieved, because the data provided by the sources may ———————————————— be mutually inconsistent. This problem has remained because • P. Angeles, School of Mathematical and Computer Sciences, Heriot Watt University Edinburgh,U.K.EH14 4AS. E-mail: pilar@macs.hw.ac.uk. it is impossible to capture all the information and avoid null • L.M. MacKinnon, School of Mathematical and Computer Sciences, Heriot values. At the same time, each autonomous component data- Watt University Edinburgh,U.K.EH14 4AS. E-mail: lach- base deals with its own properties or domain constraints on lan@macs.hw.ac.uk. data, such as accuracy, reliability, availability, timeliness and 2 QUATIC’2004 PROCEEDINGS cost of data access. on multi-database integration, or data inconsistency detection Several approaches to solve inconsistency between data- TABLE 1 bases have been implemented: CLASSIFICATION BASED ON INTERNAL OR EXTERNAL VIEW [31] 1. By reconciliation of data, also known as data fusion: different values become just one using a fusion func- Dimensions tion (i.e. average, highest, and majority), depending on Internal view Data- related: the data semantic [16]. (design operation) Accuracy, reliability, timeliness, com- 2. On the basis of individual data properties: associated pleteness, currency, consistency, pre- with each data source (i.e. cost of retrieving data, how cision recent is the data, level of authority associated with System-related: this source, or accuracy and completeness of data). Reliability These properties can be specified at different levels: External view Data-related: the global schema design level, the query itself or in (use, value) Timeliness, relevance, content, impor- the profile of the user [2]. tance, sufficiency, usableness, useful- Some definitions of data quality criteria, metrics and meas- ness, clarity, conciseness, freedom of urement methods are presented in the following sections. bias, informativeness, level of detail, 2.2 Data Quality (DQ) vs. Information Quality (IQ) quantitativeness, scope, interpretabil- ity, understandability “High data quality has been defined as data that is fit for use System-related: by data consumers and is treated independent of the context in Timeliness, flexibility, format, efficiency which data is produced and used” [29]. Data quality has been characterized by quality criteria or dimensions such as accuracy, completeness, consistency and TABLE 2 timeliness [31], [16], [8], [22], [29], [25] and [20]. However there is no general agreement on data quality dimensions [32], [14]. CLASSIFICATION BASED ON DATA-CONSUMER PERSPECTIVE [29] There has not been a specific differentiation between IQ and DQ, because the terms data and information are often used DQ DQ concerns Causes DQ Dimen- synonymously. However, Data quality is related to accuracy Category sions and integrity and on the other hand, Information Quality is Intrinsic Mismatches Multiple sources Accuracy concern with data quality in context, and is related to how the among sources of of same data. Objectivity information is produced and interpreted. the same data are Judgment in- Believability common cause of volved in data Reputation 2.3 Data Quality Classifications intrinsic DQ con- production. A definition of quality dimensions and a framework for analy- cerns sis of data quality as a research area was first proposed by Accessibility Lack of computing Systems difficult Accessibility Richard Wang et al. [32]. An ontologically based approach resources. to access. Access Security was developed by Yair Wand et al. [31], this model analyzed Problems on pri- Must protect data quality based on discrepancies between the representa- vacy and confiden- confidentiality. tion mapping from real world (RW) to information system tiality: Representational (IS) and vice versa, through design and operation activities Interpretability. DQ dimensions involved in the construction of an information system as an Understandability. are causes of internal view. A real world system is said to be properly rep- Data representa- inaccessibility. resented if there exists an exhaustive mapping, and no two tion states in RW are mapped into the same state in IS. Four in- Contextual Operational Data Incomplete data. Relevancy trinsic data quality dimensions were identified: complete, production prob- Inconsistent Value Added unambiguous, meaningful and correct. Additionally mapping lems: representation. Timeliness problems and data deficiency repairs were suggested. The Changing data Inadequately Completeness analysis produced a classification of data quality dimensions as consumers needs. defined or Amount of related to the internal or external views. Data Quality meas- Distributed com- measured data. Data urement method was not addressed (See table 1). puting. Data results not A different classification of data quality dimension was de- properly aggre- veloped by Diane Strong et al. in [29] is based on a data- gated. consumer perspective. Data quality categories were identified as intrinsic, accessibility, contextual and representational. Data Represent- Computerizing and Data inaccessi- Interpretability quality measurement method was not addressed. Each cate- ational data analyzing ble because: Ease of un- gory was directly addressed to different data quality dimen- Multiple interpre- derstanding sions (See table 2). tations across Concise and In Total Data Quality Management (TDQM) [33] the con- multiple speciali- Consistent ties and limited representation cepts, principles and procedures are presented as a method- capacities to Timeliness ology which defines the following life cycle: define, measure, summarize Amount of analyze and improve data as essential activities to ensure across image. data high quality, managing data as a product. There is no focus PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA 3 TABLE 3 QUALITY DIMENSIONS DEFINITIONS, DETERMINANT FACTORS AND METRICS BY AUTHOR [9], [10], [16], [25], [31]. Dimension Concern Author Factors Metric “Inaccuracy implies that Information System (IS) represents Wand /Wang RW/IS states Accuracy a Real World (RW) state different from the one that should have been represented” “Whether the data available are the true values (correct- Motro/Rakov ness, precision accuracy or validity)” Data values “The degree of correctness and precision with which real Gertz world data of interest to an application domain are repre- sented in an information system. Precision Ambiguity: Improper representation: multiple RW states Wand /Wang RW/IS states mapped to the same IS state “Ability of an IS to represent every meaningful state of the Wand/Wang RW/IS states Completeness represented real world system. Thus is not tied to data- related concepts such as attributes, variables, or values” “The extent to which data is not missing and does not have Data model sufficient breadth and depth for the task at hand” Pipino/Wang (table, row, 1 – ( #incomplete “All values for a certain variable are recorded” attribute, classes) items / #total items) “Whether all the data are available” Ballou schema “The degree to which all data relevant to an application Motro column domain have been recorded in an information system.” Gertz population “The IS state may be mapped back into a meaningful state, Wand/Wang RW/IS states the correct one” Correctness “The extend to which data is correct and reliable” Pipino/Wang 1 - (# errors / # total) Timeliness “Whether the data is out of date, An availability of output on Wand/Wang Currency time” Volatility “The extent to which data is sufficiently up to date for the Pipino/Wang Max (0, task at hand” 1 - (# currency / The degree to which the recorded data are up-to-date” Gertz #volatility)) “How fast the IS state is updated after the real world sys- Wand/Wang Currency tem changes.” Age: of data, when first received by the system Pipino/Wang Age Age + delivery time – Delivery time: when data is delivered by the user Delivery time input time Input time: When data is received by the system. Input time “Whether the data are up to date, reflecting the most recent Motro values” Volatility “The rate of change of the real world.” Wand/Wang Time data invalid “Refers to the length of time data remains valid.” Pipino/Wang Time - Time start valid Consistency “Refers to several aspects of data. In particular, to values Wand/Wang RW/IS states of data inconsistency would mean that the representation Values of data on mapping is one to many. This is not considered a defi- Integrity constraints ciency.” Data representation. 1– “The extent to which data is presented in the same format” Pipino/Wang Physical rep. data ( #inconsistent / as consistent representation Values of data on #total consistency “Often referred as integrity constraints state the proper Motro Integrity constraints checks) relationships among different data elements” “The degree to which the data managed in an information Gertz system satisfy specified constraints and business rules.” Believability “The extent to which data is regarded as true and credible” Pipino/Wang Source of data S Accepted stand. A Previ. experience P Min(A,S,P) Accessibility “The extent to which data is available, or easily and quickly Pipino/Wang Time request TR Max (0, retrievable” Time delivery TD 1 – (TR – TD / Time no longer useful TR – TN)) TN. Data path A. Min (A,B,C) Structure B Path lengths C or database retrieval solutions. There are just definitions, and in the best cases, measure- 2.4 The assessment methods for information quality criteria ment of data quality aspects. Information Quality (IQ) criteria have been classified in an as- In table 3, the different quality dimension definitions are sessment-oriented model by F. Naumann in [20], where for presented with the relevant factors on each dimension and each criterion an assessment method is identified. the proposed metric by author. 4 QUATIC’2004 PROCEEDINGS In this classification the user, the data and the query process two main problems, intensional and extensional inconsisten- are considered as sources of information quality by themselves, cies. Intensional are related to resolving the schematic differ- (see Table 4.) ences between the component databases, this issue is also TABLE 4 known as semantic heterogeneity. Extensional inconsistencies are related to reconciling the data differences among the par- AN ASSESSMENT-ORIENTED CLASSIFICATION [20] ticipating databases [16]. Information integration is the process of merging multiple query results into a single response to the Assessment IQ Criterion Assessment Method user. There are several important areas of related work to con- Class sider in the following approaches. Source IQ of metadata 3.1 Data Integration Techniques based on Data Quality Believability User experience Aspects Concise represent. User Sampling Data integration techniques have been developed by Gertz [8], Subject Criteria Interpretability User sampling [9] based on data quality aspects within an object oriented data Relevancy Continuous assessment model, and data quality information stored in metadata. Qual- Reputation User experience ity aspects such as timeliness, accuracy and completeness were User Understandability User sampling considered in the process of database integration. The main Value-added Continuous assessment aspect was the assumption that quality of the data stored at Completeness Continuous assessment different sites can be different and the quality varies over time. Customer Support Parsing, sampling Query language extensions were necessary to support the Object Criteria Documentation Parsing specification of data quality goals for global queries and thus Objectivity Expert input data integration. In the case of data conflicts between semanti- Price Contract cally equivalent objects, the object with best data quality must Information/ Reliability Continuous assessment be chosen. If no conflicts exist between objects but their quality Data Security Parsing level is different, the integrated objects need to be grouped to Timeliness Parsing allow the ranking of the results. Verifiability Expert input 3.2 Multiplex Accuracy Sampling, cleansing The project MULTIPLEX directed by Motro and Rakov [16], Process Criteria Amount of Data Continuous assessment addressed the problem of extensional inconsistencies and a Availability Continuous assessment Data Quality Model for Relational Databases. MULTIPLEX Consistent repress. Parsing was based on accuracy and completeness as quality criteria, Query Process Latency Continuous assessment this model assigned a quality specification for each instance of Response time Continuous assessment a relation, and these quality specifications were calculated by extending the relational algebra. The quality of answers was The AIM Quality Methodology (AIMQ) [34] is a practical calculated by the measure of arbitrary queries from the overall tool for assessing and benchmarking IQ organizations, with quality specification of the database [16]. In the case of multi- three components: PSP/IQ Model which presents a quality ple sets of records as possible answers to one query, each set of dimension classification by product quality and service quality records has an individual quality specification. A voting using information consumer perspective, and consolidates the scheme, using probabilistic arguments, identifies the best set of dimensions into four quadrants: sound, dependable, useful, records to provide a complete and sound answer and ranking and usable information, these quadrants are relevant to IQ im- of tuples in the answer space. The conflict resolution strategy, provement decisions. IQA instrument measures IQ for each IQ and the quality estimates are addressed by the multidatabase dimension. In a pilot study, using questionnaires answered by designer. information collectors, information consumers, and IS profes- sionals in six companies, these measures are average for the 3.3 Fusionplex four quadrants and the scale used in assessing each item An enhancement of the Multiplex system FUSIONPLEX [2], [3] ranged from 0 “not at all” to 10 “completely” and the IQ Gap stores information features or quality criteria scores in meta- Analysis Techniques assess the information quality for each of data, the considered quality dimensions are timestamp, accu- the four quadrants. These gap assessments are the basis for racy, availability, clearance and cost of retrieval. Inconsisten- focusing IQ improvement efforts. This methodology uses ques- cies are resolved by data fusion, allowing the user to define tionnaires as main measurement method, taking a very prag- data quality estimation on a vector of features weights, per- matic approach regarding IQ. formance thresholds and a fusion function at attribute level, as In the following section we will present some approaches required. This approach reconciles the conflicting values at demonstrating how a data quality model, assessment methods attribute level using an intermediate result named polyin- and user priorities, based on the work discussed above, can stance, which contains the inconsistencies. First the polyin- help in the process of data integration. stance is divided in polytuples, and using the feature weights and the threshold, members of each polytuple are discarded. 3 MEASURING DATA QUALITY IN HETEROGENEOUS Second each polytuple is separated into mono-attribute poly- DATABASES tuples using the primary key, assuming that the same value of the primary key between databases refers to the same object Database integration is divided by Motro and Rakov [16] in but with different data values, and attribute values are dis- PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA 5 carded based on corresponding feature values. Finally the frameworks and models of reference have been developed as mono-attribute tuples are joined back together resulting in sin- standards, such as ISO 15504 [12] and CMMI [1], [7]. gle tuples. Here, the general objective is to establish good practices for software engineering and to be able to talk the same language 3.4 Information Quality Reasoning during software processes, no matter the architecture or im- Information Quality reasoning is defined as the integration of plementation methodology. The same challenge need to be information quality aspects, to the process of planning and taken up in the Data Quality area, based on the following: optimizing queries against databases and information systems 1. It is essential to identify a framework that establishes the by F. Naumann in [21]. Such aspects are related through the models corresponding to the criteria of quality, methods of establishment of information quality criteria, assessment measurement, assessment and improvement, and considers the methods and measure. data quality life cycle. Selection of data sources, and optimization of query plan- This framework can be used as good practice during infor- ning by considering user priorities has been also addressed in mation system development, integration, capture and tracking [21] by the definition of a quality model and a quality assess- of changes in data. Tracking changes should offer quality im- ment method under the following assumptions: provement and data cleaning based on a feedback provided by 1. Query processing: Concerned with efficiently answer- the same information system or a set of recommendations to ing a user query to a single or multi database. In this the information manager, and will help to achieve self regulat- context efficiency means speed. ing systems. 2. Query planning: Is concerned with finding the best 2. This framework might be considered in heterogeneous possible answer given some cost or time constraint. systems, before, during and after the integration of informa- Query planning involves regarding many query exe- tion. cution plans across different, autonomous sources that 3. We propose a Data Quality Manager as the mechanism to together form the complete result. establish communication between the user, the application and In this approach information sources were selected by using the process of integration of information, to deal with semantic Data Envelopment Analysis method (DEA) [6], and the follow- heterogeneity problems, as part of the framework mentioned ing quality dimensions: understandability, extent, availability, above (see Figure 1.) time and price, discarding sources with poor quality before executing the query. However different sources have different quality scores and Selection of data sources they must be fused to determine the best quality result, the quality fusion can be done in two ways 1) applying a fusion function per each quality criteria and find the best combination to query [17] or 2) computing the information quality score Reference Query Planning using different quality criteria such as availability, price, accu- Model racy, completeness, amount response time for each plan and thus a ranking of the plans using Simple Additive Weighting Measurement Detection and Fusion of method (SAW) explained in [11]. Model Data inconsistencies The completeness of the query result derived from different sources is approached in [24] considering the number of results Quality (coverage) and the number of attribute values in the result Metadata Query Integration (density). Completeness is calculated as the product between the density and the coverage of the corresponding set of in- Assessment formation sources. Model Ranking query results 3.5 Data Quality on the Web In this seminar, it was established that it is essential to first Data Quality Manager concentrate on developing expressive data quality models, and Information Integration Process once such models are in place, develop tools that help users Fig. 1. Data Quality Manager in the process of information integration. and IT managers to capture and analyze the state of data qual- ity in an information system. [10]. 4. The Data Quality Manager will contain the following elements: 4 DATA QUALITY MANAGER • Reference Model: In this model the data quality cri- Databases have traditionally been considered to be sources of teria will be defined depending on data sources, information that are precise and complete. However the design users and application domain. and implementation of such systems is carried out by human • Measurement Model: This will contain the defini- beings, whose are imperfect, so during the whole software life tion of the metrics to be used to measure data qual- cycle errors occur that are reflected in the quality of both soft- ity, also the definition of a quality metadata (QMD) ware and information. Furthermore, when these sources of and the specification of data quality requirements data come from different applications, distributed both physi- such as user profiles, query language. cally and logically, these errors multiply. In the field of Infor- • Assessment Model: The quality scores definition is mation Systems, this shortcoming has been realized and essential to establish how the quality indicators are 6 QUATIC’2004 PROCEEDINGS going to be represented and interpreted. Fig. 4. Query Planning 5. The Data Quality Manager will establish the basis for tak- ing decisions during the identification of data sources in het- • After query execution, and detection of inconsistent erogeneous systems, such that: data, data quality might be used to perform data • To classify the sources of data based on certain cri- fusion (see Figure 5). teria of quality, depending on the application do- main. The scores must be stored in a metadata for every source of data (see Figure 2.) ResultX Execute Data Query ResultY Inconsistencies Definition of Definition of Quality Meta- Quality Metrics and data Definition Plan Detection Criteria Indicators ResultZ Quality user Inconsistent Consistent QMD priorities Query Query QMD Quality Assessment Result Result Metadata of Data Population Sources Fig. 2. Data Quality Manager Components Definition. Data Fusion • The use of quality aspects previously stored in the Fig. 5. Detection and Resolution of Data Inconsistencies. metadata as a whole with the user priorities for the selection of the best sources of information before the execution of the queries, for example if the user • Integration of the information sources ranking with prefers those sources of information that are more the quality criteria estimated by the user (see Figure current with regard to those of major credibility 6) (see Figure 3.) Quality user Mapping Data Sources ResultJ priorities User Global/Local Involved in the Query Schemas Query Data ResultK Fusion Query QMD Query Result Ranking of ResultL Integration Ranking best data sources Quality User Pri- Consistent orities Query QMD Fig. 3. Selection of best data sources. Result Fig. 6. Ranking of Query Result • Help the query planning process by considering data quality estimations to find the best combina- tion for the execution plan (see Figure 4.) CONCLUSION We have shown that, although there has been considerable Quality past work in the resolution of semantic heterogeneity in multi User Pri- data source systems over a number of years, expressive data orities quality models and tools to utilise them remain to be devel- oped [10]. The approach developed for Information Quality reasoning [21] provides some mechanisms for data source se- QueryA lection, but does not address many of the data quality factors User Top identified in Table 3. Accordingly, we propose a Data Quality Global Query QueryB ranking Manager as a framework to deal with data inconsistencies and Query Partition Query lack of quality due to different sources; presenting a continu- QueryC Plan ous process of data validation, such as definition of quality criteria, selection of best data sources, ranking of query plan, detection and fusion of data inconsistencies and ranking of QMD query result considering quality of data sources and user ex- pectations. This work is already under way and performance reporting of the tools developed will appear in the next twelve PILAR ANGELES ET AL.: DETECTION AND RESOLUTION OF DATA INCONSISTENCIES, AND DATA INTEGRATION USING INFORMATION QUALITY CRITERIA 7 months. [19] F. Naumann and C. Roker, ʺDo Metadata Models meet IQ Requirementsʺ, Proceedings of the International Conference on Information Quality, MIT Cam‐ ACKNOWLEDGEMENT bridge, 1999. [20] F. Naumann and C. Roker C., ʺAssessment Methods for Information This work was supported by financial funding from Consejo Quality Criteriaʺ, Proceedings of the International Conference on Informa‐ Nacional de Ciencia y Tecnologia CONACYT, Mexico. tion Quality (IQ2000), Cambridge, Mass., 2000. [21] F. Naumann, ʺFrom Databases to Information Systems‐Information REFERENCES Quality Makes the Differenceʺ, Proceedings of the International Confer‐ ence on Information Quality (IQ2001), Cambridge, Mass., 2001. [1] D.M. Ahern, A. Clouse, and R. Turner, ”CMMI® Distilled: A Practical [22] F. Naumann, ʺQuality‐Driven Query Answering for Integrated In‐ Introduction to Integrated Process Improvement”, The SEI Series in formation Systemsʺ, Lecture Notes in Computer Sciences LNCS 2261, Software Engineering, Addison Wesley Professional, 2003. Springer Verlag, Heidelberg, 2002. [2] P. Anokhin and A. Motro, “Data Integration: Inconsistency Detection and [23] F. Naumann and M. Haeussler, ʺDeclarative Data Merging with Con‐ Resolution Based on Source Propertiesʺ, Proc. of FMII 2001, 10th International flict Resolutionʺ, Proceedings of the International Conference on Informa‐ Workshop on Foundations of Models for Information Integration. Viterbo, Italy., 2001 tion Quality (IQ2002) Cambridge, Mass., 2002. [3] P. Anokhin and A. Motro, ʺFusionplex: Resolution of Data Inconsis‐ [24] F. Naumann, J. Freytag and U. Lesser, ʺCompleteness of Information tencies in the Integration of Heterogeneous Information Sourcesʺ, Sourcesʺ, Workshop on Data Quality in Cooperative Information Systems Technical Report ISE‐TR‐03‐06, Information and Software Engineering (DQCIS2003), Cambridge, Mass., 2003. Dept., George Mason Univ., Fairfax, Virginia, 2003. [25] L. Pipino, W.L. Yang and R. Wang, ʺData Quality Assessmentʺ, Com‐ [4] C. Batini, M. Lenzerini and S.B. Navathe ʺA comparative Analysis of munications of the ACM,vol. 44 no. 4e, pp.211‐218, 2002. Methodologies for Database Schema Integrationʺ, ACM Computing [26] [Parsian99] A. Parssian, S. Sumit and V. Jacob, ʺAssessing Data Qual‐ Surveys, vol. 18, no. 4, pp. 323‐364, 1986. ity for Information Productsʺ, Proceeding of the 20th International Con‐ ference in Information Systems (ICIS1999), Charlotte, North Carolina [5] P. Buneman, M. Liberman, C.J. Overton and V. Tannen, “Data Prove‐ USA, pp. 428‐433, 1999. nance”, http://www.cis.upenn.edu/~wctan/DataProvenance, [(date in‐ [27] E. Pierce, ʺAssessing Data Quality with Control Matricesʺ, Communi‐ formation as accessed by the author citing the references, e.g. 17 Aug. cations of the ACM, vol.47, no. 2, pp.82‐86, 2004. 2004.)] [28] A. Sheth and L. Larson, ʺFederated Database Systems for Managing [6] A. Charnes, W. Cooper, and E. Rhodes. “Measuring the efficiency of Distributed Heterogeneous and Autonomous Databasesʺ, ACM Com‐ decision making units”, European Journal of Operational Research, pp. puting Surveys, vol. 22, no. 3, pp.184‐236, 1990. 429‐444, 1978. [29] D.M. Strong, W.L. Yang and R.Y. Wang, ʺData Quality in Contextʺ, [7] M.B. Chrissis, M. Konrad and S. Shrum “CMMI®: Guidelines for Process Communications of the ACM, vol. 40, no. 5, pp.103‐110, 1997. Integration and Product Improvement”, The SEI Series in Software Engineering, [30] D.M. Strong, W.L. Yang and R.Y. Wang, ʺ10 Potholes in the Road to Addison Wesley Professional, 2003. Information Qualityʺ, Proceedings of IEEE, vol.18, no. 9162, pp.38‐46, [8] M. Gertz and I. Schmitt, ʺData Integration Techniques Based on Data 1997. Quality Aspectsʺ, 3rd National Workshop on Federal Databases, Magde‐ [31] Y. Wand and R. Wang, ʺAnchoring Data Quality Dimensions in Onto‐ burg, Germany, 1998. logical Foundationsʺ, Communications of the ACM, vol. 39, no. 11, [9] M. Gertz, ʺManaging Data Quality and Integrity in Federated Data‐ pp.86‐95, 1996. basesʺ, Second Annual IFIP TC‐11 WG 11.5 Working Conference on Integ‐ [32] R.Y. Wang, V.C. Storey, and C.P. Firth, ʺA Framework for Analysis of Data Quality Research,ʺ IEEE Trans. Knowledge and Data Eng.,1995. rity and Internal Control in Information Systems. Warrenton, Virginia, [33] R. Wang, ʺA Product Perspective on Total Data Quality Manage‐ Kluwer Academic Publishers, 1998 mentʺ, Communications of the ACM, vol. 41, no. 2, pp.58‐65, 1998. [10] M. Gertz, ʺReport on the Daugstuhl Seminar, Data Quality on the Webʺ, [34] L. Yang, D. Strong and R. Wang, ʺAIMQ: A Methodology for Informa‐ SIGMOD Record, Vol. 33, No. 1, Mar. 2004. tion Quality Assessmentʺ, Information and Management, vol. 40, no. 2, [11] C.L. Hwang and K. Yoon, “Multiple Attribute Decision Making: pp. 133‐146, 2002. Methods and Applications: a state‐of‐the‐art survey”, Berlin; [35] L.M. MacKinnon, D.H. Marwick, H. Williams, “A Model for Query Decom- Springer‐Verlag. position and Answer Construction in Heterogeneous Database Systems”., [12] ISO/IEC Joint Technical Committee 1 (JTC1), Subcommittee 7 (SC7) Journal of Intelligent Information Systems, 1998. Working Group 10 (WG10) page, there are nine parts of ISO 15504. [36] H. Williams, H.T. El‐Khatib, L.M. MacKinnon, “A framework and 1998. test‐suite for assessing approaches to resolving heterogeneity”, In‐ [13] H. Kon , E. Madrick, and M. Siegel, ʺGood answers from bad dataʺ, Sloan formation and Software Technology, 2000. WP#3868, 1995. [14] G. Tayi, D. Ballou and Guest Editors, ʺExamining Data Qualityʺ, Communica‐ Pilar Angeles obtained her first degree in computer engineering from the tions of the ACM, vol. 41,no.2, pp.54‐57, 1998. Universidad Nacional Autonoma de Mexico (UNAM), in 1993, Diploma in Expert Systems from The Instituto Tecnologico Autonomo de [15] U. Leser and F. Naumann, ʺQuery Planning with Information Quality Mexico (ITAM) in 1994, Diploma in Telematic Systems from ITAM in 1995, Boundsʺ, Proceedings of the 4th International Conference on Flexible Query An‐ and M.Sc. in Computer Science, regarding Quality in Software Enginnering swering, (FQAS00), Warsaw Poland, 2000. from the UNAM in 2000. Since 1989 she has been working on Technical [16] A. Motro and I. Rakov I, ʺEstimating the Quality of Databasesʺ, Proceedings Support for Databases in Casa de Bolsa Probursa, Nissan Mexicana, of FQAS 98: Third International Conference on Flexible Query Answering Systems, Software AG, Sybase de Mexico and e-Strategy Mexico. Recent research interests have included Data Quality and Heterogeneous Databases. She T. Andreasen, H. Christiansen, and H.L. Larsen, ed., pp. 298‐307. Roskilde, is a Funder Member of the “Quality in Software Engineering Mexican As- Den.mark, Springer‐Verlag, Berlin, Germany, 1998. sociation” (AMCIS). [17] F. Naumann, ʺData Fusion and Data Qualityʺ, Proceedings of the New Tech‐ niques & Technologies for Statistics Seminar. Surrent, Italy 1998. Lachlan M. MacKinnon is Reader in Computer Science, and Director of [18] F. Naumann, ʺQuality‐driven Integration of Heterogeneous Informa‐ Postgraduate Study in Computer Science, at Heriot-Watt University. He has tion Systemsʺ, Proceedings of the 25th Very Large Data Bases Conference a first degree in Computer Science, and a PhD in Intelligent Querying for Heterogeneous Databases. He researches and consults widely in Data, (VLDB99), Edinburgh, Scotland, 1999. Information and Knowledge Technologies. He is a member of the IEEE, British Computer Society, ACM, AACE, immediate past Chair of the British 8 QUATIC’2004 PROCEEDINGS National Conference on Databases, and upcoming Chair of the British HCI Conference. He has over 50 conference and journal publications in this area.