<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Framework for Data Quality Assurance in Medical Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1Lingkai Zhu, 3Helen Chen*</string-name>
          <email>3helen.chen@uwaterloo.ca</email>
          <email>{1l49zhu, 3helen.chen}@uwaterloo.ca * Corresponding author</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>2Kevin Quach</string-name>
          <email>Kevin.Quach@uhn.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1,3School of Public Health and Health Systems, University of Waterloo</institution>
          ,
          <addr-line>Waterloo, Ontario, N2L 3G1</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>2Multi Organ Transplant Institute, University Health Network</institution>
          ,
          <addr-line>Toronto, Ontario, M5G 2C4</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- The large amount of patient data amassed in the Electronic Patient Record systems are of great value for medical research. Aggregating research-grade data from these systems is a laborious, often manual process. We present a semantic framework that incorporates a data semantic model and validation rules to accelerate the cleansing process for data in Electronic Patient Record systems. We demonstrate the advantages of this semantic approach in assuring data quality over traditional data analysis methods.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>data quality assurance</kwd>
        <kwd>data quality measurement</kwd>
        <kwd>ontology modelling</kwd>
        <kwd>semantic framework</kwd>
        <kwd>semantic web standards</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>Patient care is a highly complex process that involves
multiple services and care providers in the continuum of care.
Patient data collected may be incorrectly recorded or missing
during busy clinical encounters. Thus, it is often very difficult
to use patient data aggregated from a hospital’s Electronic
Patient Record (EPR) directly in health research which
requires high quality data. Traditionally, data quality checking
is performed by manual inspection and information processing,
with the assistance of pre-defined data entry forms to impose
data validation rules. The “cleaned” data are then stored in a
research database. However, such activities must be
customized to the registry platform, such as Microsoft Excel
and Access. These proprietary rules are hardly interoperable
with other systems and are limited in function. We propose a
semantic framework that can explicitly describe the validation
rules to govern data quality. The semantic framework can also
perform complex cross-reference checks; whereas traditional
error checking mechanisms would have difficulty
incorporating, especially when the list of conditions changes
over time, or changes with different application domains.
Therefore, the use of a semantic framework can help
accelerate and generate high quality research data over
traditional techniques.</p>
      <p>II.</p>
      <p>LITERATURE REVIEW</p>
      <p>
        The quality of data is measured in multiple dimensions,
which means “aspects or features of quality” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We refer to
three notable summaries of data quality dimensions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Although there is no general agreement on classifications and
definitions for dimensions, we identified three dimensions that
are most suitable in our context: completeness, consistency
and interoperability.
      </p>
      <p>B. Improving Data Quality via a Semantic Framework</p>
      <p>
        Brueggemann and Gruening presented three examples that
demonstrate how a domain ontology can help improve data
quality management [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. According to the authors, applying
semantic techniques brings advantages like suggesting
candidate consistent values, using XML namespace to keep
track of data origins and flexible annotation on results. We
apply their three-phase methodology (construction, annotation
and appliance) and demonstrate other benefits, e.g. rules
expressed in semantic restrictions are more explicit than
external algorithms.
      </p>
      <p>
        Fürber and Hepp pursued a semantic approach of handling
missing value, false value, and functional dependency data
quality problems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They chose SPARQL queries to
implement rules detecting data deficiencies and described
handling missing value sections that constraints, such as
cardinality, are difficult to model in RDFS or OWL. However,
OWL features such as owl:allValuesFrom and owl:oneOf are
sufficient to model constraints from the database schema we
use. We will express our semantic framework in OWL DL and
SWRL. OWL DL provides class and property restrictions we
need while remains decidable. DL-Safe SWRL rules are
sufficiently expressive for our data quality rules, whilst
provide ease of reusing already defined OWL classes and
properties. This combination receives reasoning support from
the Pellet reasoner1.
      </p>
      <p>III.</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>A. Architecture of Data Quality Assurance Framework</p>
      <p>The data quality assurance framework is illustrated in Fig.
1 (rectangles and circles represent data repositories/ontologies
and software modules, respectively). The whole framework
revolves around a transplant EPR ontology, which is built with
the openEHR reference model ontology 2 as the core
framework, and refers to an ICD-10 ontology 3 for proper
diagnoses definitions. The construction of EPR ontology starts
with a script converting the database schema of an
1 http://clarkparsia.com/pellet/
2 http://trajano.us.es/~isabel/EHR/
3 https://dkm.fbk.eu/index.php/ICD-10_Ontology
anonymized test medical database into an EPR taxonomy. The
attributes in the database are captured in a class hierarchy and
mapped into the OpenEHR ontology, and patients with data
are imported as instances. Class restrictions and data quality
validation rules are written in OWL and SWRL, respectively,
and the Pellet reasoner handles reasoning for both. Through
reasoning, data quality issues within the patient instances are
recognized and annotated, which enables the data exporter
module to clean the data, and provide the cleaned data to
researchers for analysis.
B. Data quality assessment by dimensions</p>
      <p>To assess EPR data, three data quality dimensions are
summarized for reference:</p>
      <sec id="sec-2-1">
        <title>1. Completeness</title>
        <p>Completeness refers to the proportion of data that is
available in EPR relative to an expected complete dataset.
This dimension can be used to examine the whole dataset as
well as a single attribute.</p>
        <p>Example: for all required attributes, instances that have at
least one (by defining owl:someValuesFrom restrictions) valid
value are annotated as complete.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2. Consistency</title>
        <p>The consistency dimension refers to the logical coherence
of relationships between data from different attributes, which
frequently appear in an EPR domain. SWRL rules are
employed to translate medical knowledge into logical
connections properly.</p>
        <p>Example: a post-transplant diagnosis cannot have a date
earlier than transplant date; otherwise, it is a pre-transplant
diagnosis and needs to be recorded as an error. A SWRL rule,
using the date built-in, is able to identify such temporal
inconsistencies and annotate them.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3. Interoperability</title>
        <p>The interoperability dimension refers to the compatibility
of a data element with other information systems. When
importing diagnosis data, our data aggregator tries to seek
each value in an external, standardized taxonomy, such as
ICD-10. If the value is found, an owl:sameAs statement is
made to map the value to the standard diagnosis definition,
and the data element is marked interoperable.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>PRELIMINARY RESULTS</title>
      <p>Restrictions and rules are implemented reflecting the
identified data quality dimensions. Annotation sub-classes,
such as "patient with complete demographic info", are created
under the patient class. A reasoner is applied to classify all
patient instances into these sub-classes. For each instance, we
detect how many criteria it meets. For each sub-class, we
know how many patients fall into it. Custom filters such as
"patients who satisfy all rules" are also constructed. The
results are manually reviewed and found correct.</p>
    </sec>
    <sec id="sec-4">
      <title>DISCUSSION AND FUTURE WORK</title>
      <p>Traditionally, data restrictions are enforced in an E-R
database but its limited function could only ensure the
completeness and the value range of data. Our semantic
framework can perform the latter functions and can check for
data consistency and interoperability, which brings greater
benefit to medical research data quality.</p>
      <p>The next step of our work is to repeat our methodology on
a real and uncleaned EPR dataset. A research proposal has
been submitted to a hospital based in Toronto with a transplant
program for access to their dataset of 2000 patients. We will
apply our semantic framework and identify any errors for
review by researchers in the program. Once the framework’s
robustness and accuracy is established, EPR data in production
can be checked regularly to ensure the quality of health data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>McGilvray</surname>
          </string-name>
          ,
          <article-title>Executing Data Quality Projects: Ten Steps to Quality Data and Trusted InformationTM</article-title>
          . Morgan Kaufmann,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cappiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Francalanci</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          , “
          <article-title>Methodologies for data quality assessment and improvement</article-title>
          ,
          <source>” ACM Computing Surveys (CSUR)</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>16</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fürber</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hepp</surname>
          </string-name>
          , “
          <article-title>Towards a vocabulary for data quality management in semantic web architectures,”</article-title>
          <source>in Proceedings of the 1st International Workshop on Linked Web Data Management</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Henriques</surname>
          </string-name>
          , “
          <article-title>A formal definition of data quality problems</article-title>
          ,” in International Conference on Information Quality,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brüggemann</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Gruening</surname>
          </string-name>
          , “
          <article-title>Using domain knowledge provided by ontologies for improving data quality management,”</article-title>
          <source>in Proceedings of I-Know</source>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>258</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fürber</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hepp</surname>
          </string-name>
          , “
          <article-title>Using semantic web resources for data quality management,” in Knowledge Engineering and Management by the Masses</article-title>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>