Addressing Quality Issues in Secondary Use of
Health Data
Kalinka Kaloyanova 1, 2, Ina Naydenova 2 and Zlatinka Kovacheva 2, 3
1
  Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski, 5
James Bourchier Blvd., Sofia 1164, Bulgaria Sofia University, St. Kliment Ohridski, 15
Tsar Osvoboditel Blvd., Sofia, 1504, Bulgaria
2
  Institute of Mathematics and Informatics – Bulgarian Academy of Science, 8 Acad.
Georgi Bonchev Str., Sofia, 1113, Bulgaria
3
  University of Mining and Geology “St. Ivan Rilski”, Sofia, 1700, Bulgaria


             Abstract
             During the last two decades, medical data digitalization has grown
             constantly. This process raises a lot of challenges regarding data privacy,
             data interoperability, and data quality. Despite the variety of systems that
             manage and analyze medical data, in many cases, data is not properly
             collected and used. A significant part of these problems can be identified and
             overcome when the collected data is reused. Recent European initiatives to
             establish a common space for health data also create opportunities for more
             efficient secondary use of data. The paper discusses basic quality issues in
             the secondary use of data and how they could be addressed.

             Keywords
             Data quality, quality attributes, health data, secondary use of data, European
             Health Data Space (EHDS)

1. Introduction
      Many innovations during the last decades influence the health sector. In ad-
dition to new drugs and methods of treatment, new devices and software applica-
tions were used and large amounts of medical data were generated. Unfortunate-
ly, there are many cases where this data is not properly collected and documented.
Most frequently mentioned flaws concern health data interoperability, missing
data, and low data quality. The secondary use of already obtained data can be
applied not only as a mechanism for gaining more value from data but also as a
mechanism that reveals and solves many problems with data quality.
Information Systems & Grid Technologies: Fifteenth International Conference ISGT’2022, May 27–28, 2022, Sofia, Bulgaria
EMAIL: kkaloyanova@fmi.uni-sofia.bg (K. Kaloyanova); naydenova@gmail.com (I. Naydenova); zkovacheva@hotmail.
com (Z. Kovacheva)
ORCID: 0000-0003-0222-7607 (K. Kaloyanova); 0000-0002-9995-8299 (I. Naydenova); 0000-0001-7401-3072 (Z.
Kovacheva)

            © 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
     The secondary use of medical and health data can be explored in different
directions. Data can be used for improving health care for patients, as well as for
optimizing health systems services at different levels – “personal care planning,
medicines development, safety monitoring, research, and policymaking” [11].
These optimizations cannot be achieved if the information collected does not
meet certain quality criteria.
     The secondary use of data is different from the primary use of data in many
aspects. In the case of health data, the primary data use is connected mainly with
individual care for patients. For example, clinical data is accumulated from di-
agnoses, treatment recommendations, prescribed medicine, etc. Personal data of
patients, as well as health insurance data, is also included. However, the health
data could include much more details – for example, data coming from different
medical devices or smartphone applications. In addition, the secondary use of
health (medical) data is connected with the use of aggregated data, coming from
different sources “…such as electronic health records, health insurance claims
and health insurance data” [4]. This data can be reprocessed for new purposes –
different types of research on the data, seeking cost-effectiveness for products and
services, resolving problems, etc.
     Most of the research discusses data quality in the case of the primary use of
health data. The reuse of data, on the other side, may set new requirements for the
data to change the criteria for their quality.
     In this paper we outline basic quality issues, concerning health data second-
ary use and propose useful recommendations for data processing with a focus on
data quality. We also briefly discuss some aspects related to the confidentiality of
the medical data and the legal basis for their processing for purposes other than
the original ones.

2. Secondary use of health data
    Secondary use of health data is related to the use of medical data for purposes
other than the reasons they were collected and stored initially. Medical data reuse
has many advantages over primary data use:
    • significant volumes of medical data are available as they are stored and
    processed in a variety of applications;
    • data is structured, in many cases even summarized and generalized;
    • data are collected in a certain period of time;
    • there is no need for physical interventions or other ways of collecting
    data.
    Apart from all considerations that traditionally are important when process-
ing data, several other aspects, such as legal and ethical ones, are of big impor-
tance with regard to health-related information.

                                        359
2.1. Privacy, legal and ethical considerations
     All European countries and institutions are seriously considering data pri-
vacy issues. The EU “General Data Protection Regulation (GDPR)” presents the
rules for the use of personal data that must be followed by all organizations.
The GDPR aims to ensure secure methods for data processing. In addition, this
regulation requires rules to be defined and implemented to achieve this goal. It
introduces six main principles that need to be followed when personal data is
processing: (1) lawfulness, fairness, and transparency; (2) purpose limitation; (3)
data minimization; (4) accuracy; (5) storage limitation; and (6) integrity and con-
fidentiality [6].
     The General Data Protection Regulation is focused on the protection of in-
dividual data. However, medical data has the potential to be used for purposes
that affect a large part of society, even in a form that does not contain personal
information. It is therefore essential that the ethical aspects of the use of medical
data be regulated, too.
     As for data reuse, Recital 50 of GDPR indicates that the secondary use of
personal data should be compatible with the reasons for the initial collection and
use of data [6]. Furthermore, according to Article 9 health-related personal data
is considered as “sensitive” and it is differentiated as a “special category” of data.
The special categories require extra attention and need more protection because of
their sensitivity. Ten conditions for processing special category data are presented
in Article 9. For lawfully processing of special category data, both a lawful basis
under Article 6 and a separate condition for processing under Article 9 should be
identified [6]. The two justifications should not be linked. To avoid unacceptable
distribution of sensitive information, two main techniques are used: anonymisa-
tion, where personal information is deleted (or permanently replaced by unrelated
characters), and pseudonymisation, where sensitive data is encrypted in a way
that allows it to be re-identified with the help of additional information.
     Further, different countries could provide at the national level specific initia-
tives, procedures, and rules. In 2007, the American Medical Informatics Associa-
tion provided a broad discussion on the issues, related to the secondary use of
data [14]. The Finnish model for Secure use of data presents a detailed view of
the ethical aspects of national health data policy [1].

2.2. EU regulations focused on health data space
     European countries had made great efforts to create common principles for
the processing of medical data [2], [5].
     In May 2022, the European Commission (EC) published a proposal for a
REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL
on the European Health Data Space [5]. This proposal aimed at the establishment

                                         360
of a common framework for health data sharing. “The general objective of the
intervention is to establish the rules governing the European Health Data Space to
ensure natural persons’ access and control over their own health data, to improve
the functioning of the single market for the development and use of innovative
health products and services based on health data, and to ensure that research-
ers, innovators, policy-makers and regulators can make the most of the available
health data for their work while preserving trust and security” [13].
     The scope of health data is expanded to include health records, social data,
administrative data, genetic and genomic data, public registries, clinical studies,
research questionnaires, and biomedical data such as biobanks [7].
     The document presents the new rights of patients regarding their personal
electronic health data records such as the right to have free of charge access to
a readable and accessible form of their personal electronic health data, “for ex-
ample through the personal health data access service” [13]. It also explains when
secondary use is allowed: “Data users are allowed to re-use health data only after
receiving a data permit from a competent authority” [10].
     It is expected the new regulation to encourage scientific research, as well
as the development of advanced products and services in the health area. In ad-
dition, it will strengthen the cross-border exchange of health data between the
different Member States.

3. Health data quality aspects
     Data quality dimensions represent measurable data quality characteristics.
The international standard ISO/IEC 25012 introduced a Data Quality Model with
fifteen major data quality characteristics – accuracy, completeness, consistency,
currentness, accessibility, creditability, compliance, efficiency, confidentiality,
availability, recoverability, portability, as well as precision, traceability, and un-
derstandability [8]. The importance of quality characteristics of health data is
broadly discussed in many publications and a lot of efforts are invested in achiev-
ing them, but the results are still not satisfactory [15], [9], [16].
     The standard presents a common understanding of the importance of data
characteristics, but the particular domain, where data is used, also has a strong
influence on data characteristics and their prioritization [9]. In the table below,
the quality dimensions, that are most relevant to health data, are listed and briefly
described.


                                         361
Table 1
Priority Health Data Quality Dimensions
 Dimensions         Description
 Accuracy           Degree of correct representation of the object
 Completeness       Reflects the presence of values of all required attributes
 Relevance          Presents how usable is data
 Timeliness         The time expectation for accessibility and availability of information
 Consistency        Data is presented in the same format
 Security           Security access to data
 Accessibility      Presents the degree of retrievability of the data

     The prioritization of these quality characteristics can further differ depending
on the type of records. A number of sources reported major challenges to the data
quality of electronic health records [3], [9],[15]:
     • Incompleteness – missing important details (attributes) of information;
     • Inconsistency – incompatible, conflicting information between different
     data sources or even in the same EHR record;
     • Inaccuracy – partially or completely incorrectly entered values.
     For secondary use, data quality is no less important [12]. But in this case,
quality dimensions can be viewed in a different way, compared to their use for
primary purposes.
     The incompleteness usually is reported as a leading data quality issue in the
cases of the primary use of data but it could be overcome in some cases of reuse.
When massive data sets are processed, missing or wrong values in some parts of
them will not have a significant impact on the conclusions. Inconsistency also
could be dismissed, if the detected cases are not too many and can be ignored.
Nevertheless, the data of a particular individual is not of significant importance,
the final results of the processing may have a significant influence because of the
potential to touch much more people.
     Data accuracy is a quality attribute that could be closely related to the con-
text of use, so the new views on data may insist on new levels of accuracy. Data
completeness is also quite sensitive to the specific objectives of the processing
and should be assessed again.
     In the primary use of data, where the focus is on individual care for patients,
the data is validated by a physician, respectively the inaccuracy and incomplete-
ness of the data are compensated by the expertise of the therapist. In addition, the
human factor can easily deal with inconsistencies in data obtained through dif-
ferent channels (consistency and integrity issues). The secondary use of medical
data relies much more on algorithmic and machine processing, where the results
of the analysis are much more sensitive to the quality of data. In the secondary

                                           362
use scenarios the problems related to the integration of data from various sources,
as well as the validity of data across relationships, emerge in full force and hinder
the effective use of data. The lack of sufficient details on the context in which
the data was collected is a major obstacle to identifying the reasons for the in-
consistency in the information and how to use it reliably. Without this context,
in the presence of contradictions, even the human factor would find it difficult to
determine which information is reliable.
     Data time characteristic is important, too, as some data may be outdated
because of the requirements of the new data processing. On the other side, the
research on data can be provided over different time intervals and in these cases,
the accumulated data with time characteristics could be extremely useful.

4. Supporting data quality in the main activities of the
secondary use of health data
     Evaluating and improving the quality of data through its secondary use is a
multi-component task. The major factor here is that it depends on the data set, ac-
cumulated for the purposes of its primary use. The secondary use is based on the
data volume extracted from existing applications, where data is presented in data
structures, corresponding to the goals of the primary use. The quality of collected
data also depends on the requirements of the primary use.
     To reach appropriate levels of data quality in secondary use, new require-
ments are enforced. This leads to a transformation of organizational processes
and changes the groups of data consumers and data providers and the relation-
ships between the stakeholders. New procedures and rules may be enforced. As
data will be used for new purposes, new competencies and skills may be required
from the participants in data processing, for example, related to data analytics.
     Figure 1 summarizes the main components that need to be considered and
reorganized in case of secondary use of data.


Figure 1: Secondary use of data – areas of changes


                                        363
4.1.    Data reformatting and quality criteria
     Evaluating existing data volumes
     The sources of health data are clinical trials, electronic health records, wear-
able technologies, health-insurance claims data, health registry data, etc. that are
accumulated gradually. In the case of secondary use mainly sets of summarized
health data are processed. This data should be consistent, trustable, and shared
across different organizations. It should also be considered that data must be
clean and compatible after being processed or coming from other systems but
these quality aspects should be reviewed in the context of new uses of data.
     This raises two main questions about data interoperability and the use of
standards.
     When data from different sources is collected, the main obstacle is data com-
patibility. The use of common data models is a big challenge even on a national
level in most domains. The efforts of many organizations and committees in Eu-
rope are now focused on these issues and many initiatives are recently presented,
especially for health data. Unfortunately, not all existing software applications
follow these standards. The latest EC initiatives could foster the European coun-
tries to resolve this problem, both on technical and legislative levels.
     Discovering new use cases
     The goals of secondary use usually differ from the purposes of the initial
data collection and use. The initial data collecting purposes could make a strong
influence on data entities and their characteristics. Data that are extracted from
operational systems and other software applications for routine activities should
be carefully checked and validated again. The level of granularity is essential in
determining new use cases and it could be different for the data reuse.
     Urgency, usefulness, and relevancy could be considered not only as impor-
tant quality characteristics but also to initiate new use cases, particularly for clini-
cians.
     Collecting the right (quality) data for reuse
     Not all gathered data will be valuable for the new use cases. Therefore, not
all data will be used in the new environment. The adequacy, regarding the scope
of the new data processing, should lead to the criteria for data extraction.
     Consistency is a high-level quality dimension and should be addressed in
any particular case. Here, consistency can also be considered in the terms of how
the extracted data sets are logically compatible with the new scenarios.
     Modeling data in new structures
     After the data extraction, the new volume of data should be organized into a
new structure and processed with new tools.


                                          364
    When data is used for specific research and analysis, in many cases the vol-
ume of data will be significantly smaller, so the quality attributes could be sup-
ported easily. Traditional relational databases usually fit these purposes.
    In other cases, data from different health data sets can be combined for larger
studies – statistics or descriptive analysis. Then other, non-relational decisions
could be applied.

4.2. Work organization restructuring
     The group of stakeholders in the health data processing usually includes pa-
tients, healthcare professionals, healthcare regulators, healthcare service provid-
ers, policy and lawmakers, information regulators, health system administrators,
and others. Among them, data producers and data consumers are most closely
involved in data quality aspects. The full engagement of the stakeholders in data
entry and data processing is important for reaching high levels of data quality.
     Among healthcare workers, clinicians, and managers of health organizations
are the most active users of the health software applications. But the list of the
stakeholders, as well as their prioritization, may change during the reuse of health
data, as new, revised or specific requirements will be set. Particular barriers to
data sharing may arise. This can also affect some work procedures and change
the roles and responsibilities of the participants. New rules need to be considered.
     Relevance, usefulness, completeness, comparability, and conciseness should
be considered as key quality attributes, related to the reorganization of data.

4.3. Application reengineering
    Building the new infrastructure that corresponds to the new use cases and
goals and the adequate technologies are critical for the efficiency of data rework.
Several considerations can be helpful here:
    • In some cases, only a part of the data needs to be used. This reflects on the
    size of the applications and the technologies used.
    • When new applications are developed, a part of the functionality that sup-
    ports daily operations on data or different user management could be avoided
    as research purposes do not require complete administration.
    However, the newly developed applications need to provide an appropriate
level of usability, a clear understandable interface, and good visualization of the
results.

5. Conclusion
    The paper highlights the importance of data quality for the secondary use of
health data, as successfully resolving data quality issues is a key prerequisite for


                                        365
significant results in many directions. The secondary use of health data would
lead to positive results not only in improving patient treatment but also in op-
timizing health system organization and spreading out innovations. Recent EC
initiatives will help in the establishment of a common environment for health
data sharing and will support the reuse of health data among the Member States.

6. Acknowledgments
      This research is supported by Project BG05M2P001-1.001-0004 “Universi-
ties for Science, Informatics and Technologies in the e-Society (UNITe)” financed
by Operational Program “Science and Education for Smart Growth”, co-financed
by the European Regional Development Fund and National Scientific Program
“eHealth” in Bulgaria.

7. References
[1]  Act on the Secondary Use of Social Welfare and Health Care Data, URL:
     https://stm.fi/en/secondary-use-of-health-and-social-data.
[2] COM(2018) 232 final, Towards a common European data space,
     Brussels,      2018,    URL:      https://eur-lex.europa.eu/legal-content/en/
     TXT/?uri=CELEX:52018DC0232.
[3] D. R Schlegel and G. Ficheur, Secondary use of patient data: review of the
     literature in Yearbook of medical informatics, 26(01), 2016, pp. 68–71.
[4] EU policy on secondary use of health data, Open Data Institute, July 2021,
     URL: https://theodi.org/article/white-paper-eu-policy-on-secondary-use-
     of-health-data.
[5] European Health Data Space, URL: https://ec.europa.eu/health/ehealth-
     digital-health-and-care/european-health-data-space_en.
[6] GDPR, General Data Protection Regulation – Official Legal Text,
     URL:https://gdpr-info.eu.
[7] G. Fortuna and L.a Bertuzzi, LEAK: The EU Commission’s data space for
     unleashing health data, URL: https://www.euractiv.com/section/digital/
     news/leak-the-eu-commissions-data-space-for-unleashing-health-data.
[8] ISO/IEC 25012 Software and Data Quality, URL: https://iso25000.com/
     index.php/en/iso-25000-standards/iso-25012.
[9] K. Kaloyanova, I. Naydenova, Z. Kovacheva, Addressing Data Quality in
     Healthcare, Proc. of the 14-th conference on Information Systems and Grid
     Technologies, ISGT 2021, Sofia, Bulgaria, May 28–29, 2021, CEUR-WS.
     org, vol-2933, pp. 155–164.
[10] K. Van Quathem, S. Choi & A. de Meneses, Leaked: Draft Version of the
     European Health Data Space Regulation, URL: https://www.insideprivacy.


                                       366
     com/international/european-union/leaked-draft-version-of-the-european-
     health-data-space-regulation.
[11] Open Data Institute, “Discover which European countries are ready for
     the secondary use of health data”, 2021, URL: https://theodi.org/project/
     discover-how-ready-your-country-is-for-the-secondary-use-of-health-data.
[12] P. R. Burton et al., Policies and strategies to facilitate secondary use of re-
     search data in the health sciences, International Journal of Epidemiology,
     2017, pp. 1729–1733.
[13] Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND
     OF THE COUNCIL on the European Health Data Space URL: https://eur-
     lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52022PC0197.
[14] S. Fox, Tang PC, et al. Toward a National Framework for the Secondary
     Use of Health Data: An American Medical Informatics Association White
     Paper. J Am Med Inform Assoc., 2007 Jan 1; 14(1): 1–9.
[15] T. Botsis, G.Hartvigsen, F. Chen, C. Weng, (2010). Secondary use of EHR:
     Data quality issues and informatics opportunities. Summit on Translational
     Bioinformatics, 2010: 1–5.
[16] World Health Organization, 2020: Overview of the Data Quality Review
     (DQR) Framework and Methodology, URL: https://cdn.who.int/media/
     docs/default-source/data-quality-pages/who-dqrframework-v1-0-over-
     view.pdf.


                                        367