=Paper=
{{Paper
|id=None
|storemode=property
|title=A Semantic Framework for Data Quality Assurance in Medical Research
|pdfUrl=https://ceur-ws.org/Vol-1054/paper-14.pdf
|volume=Vol-1054
|dblpUrl=https://dblp.org/rec/conf/csws/ZhuQC13
}}
==A Semantic Framework for Data Quality Assurance in Medical Research==
A Semantic Framework for Data Quality Assurance in Medical Research 1 Lingkai Zhu, 3Helen Chen* 2 Kevin Quach 1,3 2 School of Public Health and Health Systems Multi Organ Transplant Institute University of Waterloo University Health Network Waterloo, Ontario, N2L 3G1, Canada Toronto, Ontario, M5G 2C4, Canada {1l49zhu, 3helen.chen}@uwaterloo.ca Kevin.Quach@uhn.ca * Corresponding author Abstract — The large amount of patient data amassed in the are most suitable in our context: completeness, consistency Electronic Patient Record systems are of great value for medical and interoperability. research. Aggregating research-grade data from these systems is a laborious, often manual process. We present a semantic B. Improving Data Quality via a Semantic Framework framework that incorporates a data semantic model and Brueggemann and Gruening presented three examples that validation rules to accelerate the cleansing process for data in demonstrate how a domain ontology can help improve data Electronic Patient Record systems. We demonstrate the quality management [5]. According to the authors, applying advantages of this semantic approach in assuring data quality semantic techniques brings advantages like suggesting over traditional data analysis methods. candidate consistent values, using XML namespace to keep track of data origins and flexible annotation on results. We Keywords — data quality assurance, data quality measurement, apply their three-phase methodology (construction, annotation ontology modelling, semantic framework, semantic web standards and appliance) and demonstrate other benefits, e.g. rules I. INTRODUCTION expressed in semantic restrictions are more explicit than external algorithms. Patient care is a highly complex process that involves multiple services and care providers in the continuum of care. Fürber and Hepp pursued a semantic approach of handling Patient data collected may be incorrectly recorded or missing missing value, false value, and functional dependency data during busy clinical encounters. Thus, it is often very difficult quality problems [6]. They chose SPARQL queries to to use patient data aggregated from a hospital’s Electronic implement rules detecting data deficiencies and described Patient Record (EPR) directly in health research which handling missing value sections that constraints, such as requires high quality data. Traditionally, data quality checking cardinality, are difficult to model in RDFS or OWL. However, is performed by manual inspection and information processing, OWL features such as owl:allValuesFrom and owl:oneOf are with the assistance of pre-defined data entry forms to impose sufficient to model constraints from the database schema we data validation rules. The “cleaned” data are then stored in a use. We will express our semantic framework in OWL DL and research database. However, such activities must be SWRL. OWL DL provides class and property restrictions we customized to the registry platform, such as Microsoft Excel need while remains decidable. DL-Safe SWRL rules are and Access. These proprietary rules are hardly interoperable sufficiently expressive for our data quality rules, whilst with other systems and are limited in function. We propose a provide ease of reusing already defined OWL classes and semantic framework that can explicitly describe the validation properties. This combination receives reasoning support from rules to govern data quality. The semantic framework can also the Pellet reasoner 1. perform complex cross-reference checks; whereas traditional error checking mechanisms would have difficulty III. METHODOLOGY incorporating, especially when the list of conditions changes A. Architecture of Data Quality Assurance Framework over time, or changes with different application domains. Therefore, the use of a semantic framework can help The data quality assurance framework is illustrated in Fig. accelerate and generate high quality research data over 1 (rectangles and circles represent data repositories/ontologies traditional techniques. and software modules, respectively). The whole framework revolves around a transplant EPR ontology, which is built with II. LITERATURE REVIEW the openEHR reference model ontology 2 as the core framework, and refers to an ICD-10 ontology 3 for proper A. Categorizing Data Quality Problems diagnoses definitions. The construction of EPR ontology starts The quality of data is measured in multiple dimensions, with a script converting the database schema of an which means “aspects or features of quality” [1]. We refer to three notable summaries of data quality dimensions [2][3][4]. 1 Although there is no general agreement on classifications and http://clarkparsia.com/pellet/ 2 definitions for dimensions, we identified three dimensions that http://trajano.us.es/~isabel/EHR/ 3 https://dkm.fbk.eu/index.php/ICD-10_Ontology anonymized test medical database into an EPR taxonomy. The The interoperability dimension refers to the compatibility attributes in the database are captured in a class hierarchy and of a data element with other information systems. When mapped into the OpenEHR ontology, and patients with data importing diagnosis data, our data aggregator tries to seek are imported as instances. Class restrictions and data quality each value in an external, standardized taxonomy, such as validation rules are written in OWL and SWRL, respectively, ICD-10. If the value is found, an owl:sameAs statement is and the Pellet reasoner handles reasoning for both. Through made to map the value to the standard diagnosis definition, reasoning, data quality issues within the patient instances are and the data element is marked interoperable. recognized and annotated, which enables the data exporter module to clean the data, and provide the cleaned data to IV. PRELIMINARY RESULTS researchers for analysis. Restrictions and rules are implemented reflecting the identified data quality dimensions. Annotation sub-classes, such as "patient with complete demographic info", are created under the patient class. A reasoner is applied to classify all patient instances into these sub-classes. For each instance, we detect how many criteria it meets. For each sub-class, we know how many patients fall into it. Custom filters such as "patients who satisfy all rules" are also constructed. The results are manually reviewed and found correct. V. DISCUSSION AND FUTURE WORK Traditionally, data restrictions are enforced in an E-R database but its limited function could only ensure the completeness and the value range of data. Our semantic framework can perform the latter functions and can check for data consistency and interoperability, which brings greater benefit to medical research data quality. The next step of our work is to repeat our methodology on a real and uncleaned EPR dataset. A research proposal has been submitted to a hospital based in Toronto with a transplant program for access to their dataset of 2000 patients. We will apply our semantic framework and identify any errors for Fig. 1. Data Quality Assurance Framework Architecture review by researchers in the program. Once the framework’s robustness and accuracy is established, EPR data in production B. Data quality assessment by dimensions can be checked regularly to ensure the quality of health data. To assess EPR data, three data quality dimensions are summarized for reference: REFERENCES [1] D. McGilvray, Executing Data Quality Projects: Ten Steps to Quality 1. Completeness Data and Trusted InformationTM. Morgan Kaufmann, 2010. Completeness refers to the proportion of data that is [2] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, “Methodologies for data quality assessment and improvement,” ACM available in EPR relative to an expected complete dataset. Computing Surveys (CSUR), vol. 41, no. 3, p. 16, 2009. This dimension can be used to examine the whole dataset as [3] C. Fürber and M. Hepp, “Towards a vocabulary for data quality well as a single attribute. management in semantic web architectures,” in Proceedings of the 1st International Workshop on Linked Web Data Management, 2011, pp. 1– Example: for all required attributes, instances that have at 8. least one (by defining owl:someValuesFrom restrictions) valid [4] P. Oliveira, F. Rodrigues, and P. Henriques, “A formal definition of data value are annotated as complete. quality problems,” in International Conference on Information Quality, 2005. 2. Consistency [5] S. Brüggemann and F. Gruening, “Using domain knowledge provided The consistency dimension refers to the logical coherence by ontologies for improving data quality management,” in Proceedings of I-Know, pp. 251–258, 2008. of relationships between data from different attributes, which [6] C. Fürber and M. Hepp, “Using semantic web resources for data quality frequently appear in an EPR domain. SWRL rules are management,” in Knowledge Engineering and Management by the employed to translate medical knowledge into logical Masses, Springer, 2010, pp. 211–225. connections properly. Example: a post-transplant diagnosis cannot have a date earlier than transplant date; otherwise, it is a pre-transplant diagnosis and needs to be recorded as an error. A SWRL rule, using the date built-in, is able to identify such temporal inconsistencies and annotate them. 3. Interoperability