=Paper=
{{Paper
|id=None
|storemode=property
|title=A Semantic Framework for Data Quality Assurance in Medical Research
|pdfUrl=https://ceur-ws.org/Vol-1054/paper-14.pdf
|volume=Vol-1054
|dblpUrl=https://dblp.org/rec/conf/csws/ZhuQC13
}}
==A Semantic Framework for Data Quality Assurance in Medical Research==
<pdf width="1500px">https://ceur-ws.org/Vol-1054/paper-14.pdf</pdf>
<pre>
A Semantic Framework for Data Quality Assurance
              in Medical Research
                   1
                     Lingkai Zhu, 3Helen Chen*                                                   2
                                                                                             Kevin Quach
       1,3                                                                            2
             School of Public Health and Health Systems                             Multi Organ Transplant Institute
                      University of Waterloo                                          University Health Network
               Waterloo, Ontario, N2L 3G1, Canada                                 Toronto, Ontario, M5G 2C4, Canada
               {1l49zhu, 3helen.chen}@uwaterloo.ca                                       Kevin.Quach@uhn.ca
                      *
                        Corresponding author


    Abstract — The large amount of patient data amassed in the      are most suitable in our context: completeness, consistency
Electronic Patient Record systems are of great value for medical    and interoperability.
research. Aggregating research-grade data from these systems is
a laborious, often manual process. We present a semantic            B. Improving Data Quality via a Semantic Framework
framework that incorporates a data semantic model and                   Brueggemann and Gruening presented three examples that
validation rules to accelerate the cleansing process for data in    demonstrate how a domain ontology can help improve data
Electronic Patient Record systems. We demonstrate the               quality management [5]. According to the authors, applying
advantages of this semantic approach in assuring data quality       semantic techniques brings advantages like suggesting
over traditional data analysis methods.                             candidate consistent values, using XML namespace to keep
                                                                    track of data origins and flexible annotation on results. We
   Keywords — data quality assurance, data quality measurement,
                                                                    apply their three-phase methodology (construction, annotation
ontology modelling, semantic framework, semantic web standards
                                                                    and appliance) and demonstrate other benefits, e.g. rules
                             I.     INTRODUCTION                    expressed in semantic restrictions are more explicit than
                                                                    external algorithms.
    Patient care is a highly complex process that involves
multiple services and care providers in the continuum of care.          Fürber and Hepp pursued a semantic approach of handling
Patient data collected may be incorrectly recorded or missing       missing value, false value, and functional dependency data
during busy clinical encounters. Thus, it is often very difficult   quality problems [6]. They chose SPARQL queries to
to use patient data aggregated from a hospital’s Electronic         implement rules detecting data deficiencies and described
Patient Record (EPR) directly in health research which              handling missing value sections that constraints, such as
requires high quality data. Traditionally, data quality checking    cardinality, are difficult to model in RDFS or OWL. However,
is performed by manual inspection and information processing,       OWL features such as owl:allValuesFrom and owl:oneOf are
with the assistance of pre-defined data entry forms to impose       sufficient to model constraints from the database schema we
data validation rules. The “cleaned” data are then stored in a      use. We will express our semantic framework in OWL DL and
research database. However, such activities must be                 SWRL. OWL DL provides class and property restrictions we
customized to the registry platform, such as Microsoft Excel        need while remains decidable. DL-Safe SWRL rules are
and Access. These proprietary rules are hardly interoperable        sufficiently expressive for our data quality rules, whilst
with other systems and are limited in function. We propose a        provide ease of reusing already defined OWL classes and
semantic framework that can explicitly describe the validation      properties. This combination receives reasoning support from
rules to govern data quality. The semantic framework can also       the Pellet reasoner 1.
perform complex cross-reference checks; whereas traditional
error checking mechanisms would have difficulty                                           III.       METHODOLOGY
incorporating, especially when the list of conditions changes       A. Architecture of Data Quality Assurance Framework
over time, or changes with different application domains.
Therefore, the use of a semantic framework can help                     The data quality assurance framework is illustrated in Fig.
accelerate and generate high quality research data over             1 (rectangles and circles represent data repositories/ontologies
traditional techniques.                                             and software modules, respectively). The whole framework
                                                                    revolves around a transplant EPR ontology, which is built with
                       II.        LITERATURE REVIEW                 the openEHR reference model ontology 2 as the core
                                                                    framework, and refers to an ICD-10 ontology 3 for proper
A. Categorizing Data Quality Problems                               diagnoses definitions. The construction of EPR ontology starts
    The quality of data is measured in multiple dimensions,         with a script converting the database schema of an
which means “aspects or features of quality” [1]. We refer to
three notable summaries of data quality dimensions [2][3][4].                         1
Although there is no general agreement on classifications and                          http://clarkparsia.com/pellet/
                                                                                  2
definitions for dimensions, we identified three dimensions that                      http://trajano.us.es/~isabel/EHR/
                                                                          3
                                                                            https://dkm.fbk.eu/index.php/ICD-10_Ontology
anonymized test medical database into an EPR taxonomy. The            The interoperability dimension refers to the compatibility
attributes in the database are captured in a class hierarchy and   of a data element with other information systems. When
mapped into the OpenEHR ontology, and patients with data           importing diagnosis data, our data aggregator tries to seek
are imported as instances. Class restrictions and data quality     each value in an external, standardized taxonomy, such as
validation rules are written in OWL and SWRL, respectively,        ICD-10. If the value is found, an owl:sameAs statement is
and the Pellet reasoner handles reasoning for both. Through        made to map the value to the standard diagnosis definition,
reasoning, data quality issues within the patient instances are    and the data element is marked interoperable.
recognized and annotated, which enables the data exporter
module to clean the data, and provide the cleaned data to                               IV.     PRELIMINARY RESULTS
researchers for analysis.                                              Restrictions and rules are implemented reflecting the
                                                                   identified data quality dimensions. Annotation sub-classes,
                                                                   such as "patient with complete demographic info", are created
                                                                   under the patient class. A reasoner is applied to classify all
                                                                   patient instances into these sub-classes. For each instance, we
                                                                   detect how many criteria it meets. For each sub-class, we
                                                                   know how many patients fall into it. Custom filters such as
                                                                   "patients who satisfy all rules" are also constructed. The
                                                                   results are manually reviewed and found correct.
                                                                                   V.     DISCUSSION AND FUTURE WORK
                                                                      Traditionally, data restrictions are enforced in an E-R
                                                                   database but its limited function could only ensure the
                                                                   completeness and the value range of data. Our semantic
                                                                   framework can perform the latter functions and can check for
                                                                   data consistency and interoperability, which brings greater
                                                                   benefit to medical research data quality.
                                                                       The next step of our work is to repeat our methodology on
                                                                   a real and uncleaned EPR dataset. A research proposal has
                                                                   been submitted to a hospital based in Toronto with a transplant
                                                                   program for access to their dataset of 2000 patients. We will
                                                                   apply our semantic framework and identify any errors for
Fig. 1. Data Quality Assurance Framework Architecture              review by researchers in the program. Once the framework’s
                                                                   robustness and accuracy is established, EPR data in production
B. Data quality assessment by dimensions                           can be checked regularly to ensure the quality of health data.
   To assess EPR data, three data quality dimensions are
summarized for reference:                                                                         REFERENCES
                                                                   [1]   D. McGilvray, Executing Data Quality Projects: Ten Steps to Quality
    1. Completeness                                                      Data and Trusted InformationTM. Morgan Kaufmann, 2010.
   Completeness refers to the proportion of data that is           [2]   C. Batini, C. Cappiello, C. Francalanci, and A. Maurino,
                                                                         “Methodologies for data quality assessment and improvement,” ACM
available in EPR relative to an expected complete dataset.               Computing Surveys (CSUR), vol. 41, no. 3, p. 16, 2009.
This dimension can be used to examine the whole dataset as         [3]   C. Fürber and M. Hepp, “Towards a vocabulary for data quality
well as a single attribute.                                              management in semantic web architectures,” in Proceedings of the 1st
                                                                         International Workshop on Linked Web Data Management, 2011, pp. 1–
    Example: for all required attributes, instances that have at         8.
least one (by defining owl:someValuesFrom restrictions) valid      [4]   P. Oliveira, F. Rodrigues, and P. Henriques, “A formal definition of data
value are annotated as complete.                                         quality problems,” in International Conference on Information Quality,
                                                                         2005.
    2. Consistency                                                 [5]   S. Brüggemann and F. Gruening, “Using domain knowledge provided
    The consistency dimension refers to the logical coherence            by ontologies for improving data quality management,” in Proceedings
                                                                         of I-Know, pp. 251–258, 2008.
of relationships between data from different attributes, which
                                                                   [6]   C. Fürber and M. Hepp, “Using semantic web resources for data quality
frequently appear in an EPR domain. SWRL rules are                       management,” in Knowledge Engineering and Management by the
employed to translate medical knowledge into logical                     Masses, Springer, 2010, pp. 211–225.
connections properly.
    Example: a post-transplant diagnosis cannot have a date
earlier than transplant date; otherwise, it is a pre-transplant
diagnosis and needs to be recorded as an error. A SWRL rule,
using the date built-in, is able to identify such temporal
inconsistencies and annotate them.
    3. Interoperability

</pre>