=Paper= {{Paper |id=Vol-3741/paper47 |storemode=property |title=A Minimum Metadataset for Data Lakes Supporting Healthcare Research |pdfUrl=https://ceur-ws.org/Vol-3741/paper47.pdf |volume=Vol-3741 |authors=Davide Piantella,Pierluigi Reali,Priyansh Kumar,Letizia Tanca |dblpUrl=https://dblp.org/rec/conf/sebd/PiantellaRKT24 }} ==A Minimum Metadataset for Data Lakes Supporting Healthcare Research== https://ceur-ws.org/Vol-3741/paper47.pdf
                                A Minimum Metadataset for Data Lakes Supporting
                                Healthcare Research
                                (Discussion paper)

                                Davide Piantella* , Pierluigi Reali, Priyansh Kumar and Letizia Tanca†
                                Politecnico di Milano - Department of Electronics, Information, and Bioengineering
                                Via G. Ponzio 34/5, 20133 Milano, Italy


                                           Abstract
                                           While data lakes have emerged as a solution for storing vast amounts of heterogeneous and often
                                           unstructured data, responding to the growing need for flexible data storage, integration, and analytics in
                                           different domains, the digital transformation of healthcare processes has led to an exponential increase
                                           in various types of health records, necessitating efficient data management solutions and making this
                                           domain an ideal arena for experimenting data lake efficacy. In data lakes, effective metadata extraction and
                                           management are crucial for describing raw data, establishing connections, and ensuring interoperability
                                           among datasets ingested into the lake. To address this, we propose a minimum set of metadata tailored
                                           for clinical research, which includes relevant information common to significant branches of healthcare.
                                           Our metadataset not only streamlines data ingestion processes but also enhances the accessibility and
                                           usability of healthcare datasets for research purposes. By standardizing the collected metadata within
                                           the clinical research domain, we also facilitate data integration, analysis, and exploration, facilitating
                                           comprehensive data description and management within the data lake environment.

                                           Keywords
                                           medatata, healthcare, data lakes, interoperability




                                1. Introduction
                                Responding to the pressing demand for flexible and easily-accessible data analytics [1], an
                                emerging trend involves data lakes as repositories for vast amounts of data and documents in
                                the big-data context [2]. Notably, data lakes operate without a predefined schema, enabling the
                                ingestion of raw data in various formats (including relational data, images, text, data streams,
                                and logs) without the need for prior preprocessing [3]. This adaptability empowers users and
                                organizations to seamlessly store and access their data, facilitating data analytics, data-driven
                                applications, and machine learning tasks.
                                   In the field of medicine, the transition to digital healthcare processes and services has led to
                                an exponential increase in medical data. Within hospitals, daily operations generate a multitude
                                of (often unstructured) digital documents, including medical images, nursing notes, discharge

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                  Corresponding author.
                                †
                                  PNRR - M4 C2, Invest 1.3 - D.D. 1551.11-10-2022, PE00000004). CUP MICS D43C22003120001.
                                $ davide.piantella@polimi.it (D. Piantella); pierluigi.reali@polimi.it (P. Reali); priyansh.kumar@mail.polimi.it
                                (P. Kumar); letizia.tanca@polimi.it (L. Tanca)
                                 0000-0003-1542-0326 (D. Piantella); 0000-0003-3041-4004 (P. Reali); 0000-0003-2607-3171 (L. Tanca)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
letters, and laboratory results. Moreover, advancements in medical devices, applications, and
monitoring technologies have digitized patient data, resulting in the collection, analysis, and
storage of vast amounts of heterogeneous information. In fact, it seems that, by 2025, the annual
growth rate of healthcare data will surpass that of generic data, reaching 36% compared to
circa 27% [4]. These challenges make the realm of medicine the ideal one for experimenting the
effectiveness of the use of Metadata.
   With the increasing availability of Electronic Health Records facilitating real-world-evidence
clinical trials [5], a significant application of healthcare data management is medical research.
In this context, the ability to collect and analyze data from heterogeneous sources is crucial [6],
and, given the diverse formats of healthcare data and its sheer volume, a data lake is a very
interesting solution. Since the datasets ingested by a data lake are extremely heterogeneous,
accessing and manipulating the stored raw data can be very expensive in terms of computational
and time complexity, therefore effective metadata extraction and management, establishing
connections among the ingested datasets [7], are essential for describing raw data. In fact,
metadata provide valuable information regarding the data without the need to directly analyze
the datasets.
   To achieve this, we propose a minimum set of metadata (i.e., a minimum metadataset) for
the context of clinical research, which encloses the relevant information common to the main
branches of healthcare. Data feeders can then specify additional metadata that further describe
the datasets.


2. Methodology and Related Work
We consider metadata models and tools specifically tailored for the healthcare context.
   Our primary objective is to construct a minimum metadata model that not only offers essential
information relevant to associated healthcare data but also creates a distinctive framework
that facilitates the sharing of clinical data coming from diverse formats and sources. This
improves interoperability, which, in turn, supports seamless data exchange and collaboration
across different healthcare organizations. Our proposed minimum metadata model serves as a
foundation that can be further enhanced and specialized for each specific scope of use.
   We now briefly describe the existing clinical metadata models and management tools we
analyzed, which contributed to the design of our minimum metadataset.

2.1. Genosurf
Genosurf [8] is a metadata integration and search system designed to efficiently analyze
genomics datasets from various sources in biological and clinical research settings. It leverages
a Genomics Conceptual Model (GCM) and implements a multi-ontology semantic search system.
The metadata repository includes millions of metadata entries from multiple datasets, focusing
on significant genomics data. The system offers a web-based interface that allows users to
perform targeted searches based on specific metadata attributes and values. In this way, users
can inspect descriptions of matching datasets, explore the related metadata, and obtain the link
to the original datasets. Moreover, Genosurf facilitates free-text searches and offers query
preparation functionalities for further data processing.
2.2. PDXFinder
Patient-Derived tumor Xenograft (PDX) models are essential tools to study the effects of
chemotherapy on tumors. PDXFinder [9] provides centralized access to an extensive collection
of PDX models. It supports advanced search functionalities that enable users to filter and refine
their searches based on specific criteria such as cancer type, molecular characteristics, and
treatment history. As a result, researchers can access detailed information about each PDX
model, including clinical annotations, molecular profiling data, histopathological features, and
associated research publications.

2.3. HL7 - FHIR
The Health Level Seven International (HL7) Fast Healthcare Interoperability Resources
(FHIR) [10] is widely acknowledged as a fundamental metadata model for achieving healthcare
data interoperability, presenting a standardized approach to data representation and exchange.
FHIR provides extensive information, encompassing various aspects of healthcare data, such
as patient demographics, clinical observations, medications, and procedures. These resources
are designed to be easily accessible using RESTful APIs[11], further enhancing its appeal and
ease of implementation. The standardized nature of FHIR and its support for RESTful APIs
can enable data exchange and sharing between diverse healthcare systems, regardless of their
underlying technology and platforms.

2.4. Datacite
Datacite [12, 13] is an internationally recognized organization that provides persistent identi-
fiers, known as DOIs (Digital Object Identifiers), for research data. Although not specifically a
metadata model, Datacite significantly contributes to data discoverability, access, and reference.
By assigning DOIs to research datasets, Datacite ensures their long-term accessibility and
establishes a standardized approach for referencing and linking data. The metadata offered
by Datacite includes essential details about the dataset, such as its title, authors, publisher,
publication date, version, and any related resources. This metadata is usually presented in a
standardized format, leveraging a metadata schema, defining specific data elements and their
required or recommended attributes.

2.5. EOSC FAIR principles
The European Open Science Cloud (EOSC) initiative1 aims to create a seamless and open
research environment by providing access to research data, services, and infrastructures across
Europe. EOSC recently published its guidelines and recommendations to promote the Findability,
Accessibility, Interoperability, and Reusability (FAIR) of research data and services [14]. The
EOSC FAIR principles emphasize the importance of making research data and related resources
easily discoverable, accessible, and interoperable. By adhering, researchers and data providers
1
    https://digital-strategy.ec.europa.eu/en/policies/open-science-cloud. These principles align with the broader FAIR
    data movement, which seeks to maximize the value and impact of research data by ensuring its usability and
    long-term preservation.
adopt standardized metadata (categorized as mandatory, recommended, and optional), data
formats, and interoperability standards. This enables efficient data discovery, access, integration,
and reuse, promoting collaboration, knowledge sharing, and interdisciplinary research within
the EOSC ecosystem.

2.6. Standardized terminologies and coding systems
Standardized terminologies and coding systems are vital for achieving healthcare data inter-
operability [15], providing a common vocabulary and coding structure, to ensure that clinical
concepts and metadata are represented in a consistent and standardized manner across different
healthcare systems and datasets. For example, SNOMED-CT [16] is a comprehensive clinical
vocabulary widely used in healthcare. It allows for the precise and uniform encoding of clinical
observations, diagnoses, procedures, and other medical concepts. Similarly, LOINC [17] is a
standardized coding system specifically designed for clinical laboratory observations and results.
It provides a unified representation of laboratory tests, measurements, and observations.

2.7. Clinical Document Architecture
The Clinical Document Architecture (CDA) [18], developed by HL7, serves as a minimum
metadata model for exchanging clinical documents. CDA defines the structure and semantics
of clinical records, enabling the standardized sharing of healthcare information. It enables
interoperability across different healthcare organizations by providing a common framework for
representing patient clinical summaries, discharge letters, progress notes, and other healthcare
documents.


3. Minimum metadataset
We report in Table 1 our proposed minimum metadataset for healthcare. Following several
research works [19, 20, 21], we decided to employ for our model three main categories: (i)
administrative metadata, (ii) data provenance metadata, and (iii) descriptive metadata.

 Category           Attributes
 Administrative     GUID, Creator, Owner, Rights, Terms of access
 Data provenance    Publication year, Upload date, Acquisition method, Acquisition tools, Down-
                    load URL, Checksum, Encryption algorithm, File version, Update/modification
                    date, Update frequency
 Descriptive        File description, File format, Min age, Max age, Ethnicity, Patient sex, Blood group,
                    Primary site, Collection site, Disease names, Disease types, Disease variants

Table 1
Proposed minimum metadataset for healthcare

Administrative metadata refer to the administrative aspects of data management, facilitat-
ing effective data governance, management, and administration. They include the following
metadata, regarding ownership, access authorizations, and policies:
    • GUID (Globally Unique Identifier): a unique identifier assigned to each dataset for identifi-
      cation and referencing purposes.
    • Creator: the entity responsible for creating or generating the dataset.
    • Owner: the entity that owns the dataset and holds responsibility for its management.
    • Rights: the permissions or restrictions associated with accessing and using the dataset.
    • Terms of access: the terms and conditions that govern the access and usage of the dataset.
Data provenance metadata serve as a detailed record of the data lifecycle. They offer valuable
insights on reliability and quality by capturing information about collection methods, processing
steps, and modifications:
    • Publication year: the year in which the dataset was officially published or made available.
    • Upload date: the date when the dataset was uploaded into the repository.
    • Acquisition method: a description of the acquisition process employed to collect the data.
    • Acquisition tools: SW and HW used to collect the data, with the related version details.
    • Download URL: it refers to the specific web address that enables users to download the
      dataset to their local systems.
    • Checksum: a hash value that acts as a verification mechanism for data integrity.
    • Encryption algorithm: if, for privacy reasons, the dataset is encrypted, this reports the
      algorithm used to protect sensitive information.
    • File version: an identifier or label that denotes the version or the revision of the dataset.
      It allows healthcare professionals, researchers, and stakeholders to track and manage
      different instances of the dataset, ensuring proper documentation and version control.
    • Update/modification date: it stores the date when the dataset was last updated or modified.
      This provides valuable information about the currency and freshness of the data, allowing
      the users to ascertain the relevance and applicability of the dataset for their specific needs.
    • Update frequency: it indicates the regularity or frequency at which the dataset is updated.
Descriptive metadata refer to the content and characteristics of a dataset, freeing the users
from the need to examine the resource itself in detail. This category is essential for classifying
and organizing datasets, enabling efficient search and retrieval, and facilitating decision-making
about which resources better fit the needs of the users:
    • File description: a brief description or summary of the dataset, providing an overview of
      its purpose, scope, and data content.
    • File format: the specific file format in which the dataset is stored (e.g., CSV, XML, or
      DICOM).
    • Min age: the minimum age of the patients represented in the dataset.
    • Max age: the maximum age of the patients represented in the dataset.
    • Ethnicity: the ethnic background of the patients represented in the dataset.
    • Patient sex: the sex of the patients included in the dataset.
    • Blood group: the blood type of the patients included in the dataset.
    • Primary site: the primary anatomical site or organ associated with the data collected.
    • Collection site: the location or institution where the data was collected or originated.
    • Disease names: names of the diseases or medical conditions primarily represented in the
      dataset.
    • Disease types: the classification or type of diseases or medical conditions.
    • Disease variants: specific variants or subtypes of diseases or medical conditions.


4. Adequacy of the proposed model
In Table 2 we compare our minimum metadataset with the models described in Section 2,
leveraging some of the main metadata commonly employed for both general-purpose [19, 20,
21, 22, 23] and healthcare-related [24, 25] domains. Moreover, we evaluated the quality of our
model by studying how it addresses some of the most common challenges encountered in –
but not limited to – clinical data science and integration, which are acknowledged as critical
barriers also in the National Institute of Health (NIH) strategic plan for data science research [26].

                                                     Genosurf PDXFinder CDA/FHIR Datacite    EOSC
                                         Our model
                                                        [8]      [9]     [18, 10] [12, 13]    [14]
                                        Healthcare                         Health  General General
           Domain                                   Genomic     Cancer
                                       (in general)                      documents purpose purpose
           Terms&Conditions                 ✓          ✗          ∼⋆         ✗        ✓       ✓
           File details                     ✓          ✗          ✗          ✗        ✓       ✓
           Content description              ✓          ✗          ✗          ✗        ✓       ✓
           File format and structure        ✓          ✓          ✓          ✓        ✓       ✓
           Provenance                       ✓         ∼*          ✓          ✗        ✗      ∼†
           Publication date                 ✓          ✗          ✗          ✓        ✓       ✓
           Reference to vocabularies        ✗          ✗          ✗          ✓        ✗       ✗
           Access rights                    ✓          ✗          ✗          ✗        ✗       ✓
           Integrity information            ✓         ∼+          ✗          ✗        ✗       ✓
           Encryption information           ✓          ✗          ✗          ✗        ✗       ✗
           Patient: blood group             ✓          ✗          ✗          ✓        ✗       ✗
           Patient: age                     ✓          ✓          ∼‡         ✓        ✗       ✗
           Patient: gender                  ✓          ✓          ✗          ✓        ✗       ✗
           Patient: ethnicity               ✓          ✓          ✗          ✗        ✗       ✗
           Observation: disease             ✓          ✓          ✓          ✓        ✗       ✗
           Observation: collection site     ✓          ✓          ✓          ✓        ✗       ✗
           ⋆ License type only.
           * Techniques only.
           † Software used only.
           + Claimed to be stored, although not displayed.
           ‡ Predefined ranges only.


Table 2
Comparison of our minimum metadataset for healthcare with other models


4.1. Lack of standard structure and policies
The lack of consistent data standards and formats across different healthcare systems poses
a significant challenge in achieving effective interoperability. Healthcare organizations often
employ diverse coding schemes, data structures, and terminologies, leading to inconsistencies
and incompatibilities when exchanging health information. This inconsistency may result in
errors and misinterpretations and possibly make data integration more complex.
   Proposed Solution The minimum metadataset we suggest in Table 1 provides a stan-
dardized model for organizing and describing essential information about healthcare data for
research purposes. By adopting this model, healthcare organizations can establish a common
structure for data representation, promoting consistency and compatibility in data exchange.
To ensure high interoperability, this solution is in line with state-of-the-art standards and
frameworks, such as HL7 FHIR [10] and DICOM (Digital Imaging and Communications in
Medicine)[27], as well as the others mentioned in Section 2. Moreover, leveraging the proposed
minimum metadata model, healthcare systems could map their local data elements and ter-
minologies to the standardized model, facilitating accurate interpretation and integration of
health information. Finally, the standardized metadata elements can also help in designing a
data catalog for a data lake that can easily accommodate different types of healthcare data.

4.2. Privacy concerns
Healthcare data is inherently sensitive and requires robust protection to maintain confidentiality
and ensure the secure and reliable exchange of health information. The already stringent privacy
regulations implemented in the US (HIPAA [28]) and Europe (GDPR [29]) must comply with
other privacy regulations specific to each country or region.
   Proposed Solution Our model prioritizes data protection and privacy by excluding sen-
sitive information such as patient names, dates of birth, and unique identifiers that could
potentially disclose patient or clinician identities. The model minimizes the risk of privacy
breaches by carefully selecting and including only non-identifying attributes. This approach
aligns with privacy and security policies, safeguarding the confidentiality of healthcare data
and promoting a secure environment for data exchange and interoperability.

4.3. Incomplete and inaccurate data
The quality and usefulness of healthcare datasets can be compromised by inconsistent data
capture and incomplete or inaccurate data entry practices. These issues significantly impact the
integrity and reliability of the exchanged data. Inconsistent data capture refers to variations in
how data is collected and recorded across different healthcare systems or organizations. This
can be due to discrepancies in terminology, coding systems, and acquisition processes, making
it challenging to compare and integrate information accurately. Incomplete or inaccurate data
entry practices further compound these challenges by introducing errors or missing information
into the exchanged data.
   Proposed Solution The model incorporates essential attributes describing data provenance,
ensuring both the elicitation of details regarding the acquisition process and the tracking and
management of different versions of datasets over time. These attributes allow healthcare
professionals and researchers to clearly understand the acquisition methods and identify the
most up-to-date version of a dataset, reducing the risk of utilizing outdated or incomplete data.
By clearly indicating the dataset version, our model promotes data integrity and ensures that
users work with the most accurate and complete information.

4.4. Data bias
Data bias is a significant concern in healthcare research [30, 31], as it can lead to unequal
treatment, inaccurate research findings, and disparities in patient outcomes. Bias can arise
from several aspects, e.g., the demographics of the population sampled, the methods used to
collect and analyze data, and other intrinsic biases. For example, if a dataset primarily includes
information from individuals of a certain age or ethnicity, the findings and conclusions drawn
from that data may not be applicable or representative of the broader population. Similarly,
biases can occur when selecting variables to be measured, leading to incomplete or skewed
representations of health conditions. Addressing data bias is crucial to ensure fair and reliable
research insights.
   Proposed Solution Addressing data bias is a complex task that requires a multifaceted
approach. While our proposal focuses on a minimum metadata model, this does not solve
the problem completely. Therefore, scientists, researchers, and medical professionals must
employ various methodologies to tackle this issue comprehensively [32, 33]. We recognize
the significance of including attributes such as ethnicity, sex, and collection site, which can
help researchers and professionals analyze the demographic and geographical scope of the
datasets, thus assessing potential biases and accounting for them in their analyses. By combining
the strengths of the minimum metadata model, which addresses the identification of possible
data bias through attribute inclusion, with other approaches [34, 35, 36], researchers can work
towards mitigating and minimizing data bias, ultimately enhancing the quality and fairness of
their research outcomes.

4.5. Data discovery
The rapid generation of healthcare data brings the difficulty of finding relevant datasets for
specific research or clinical purposes, in terms of required variables, population demographics,
or specific clinical parameters. This issue is further compounded by the lack of standardized
data formats, inconsistent data labeling, and varying data storage practices across different
healthcare systems and organizations.
   Proposed Solution Researchers can leverage the proposed metadata model to establish a
standardized framework for organizing and describing essential information about healthcare
datasets. This includes attributes such as data structure, variables, demographics, diseases, and
anatomical sites. Data consumers can then utilize this standardized metadata to efficiently
search and filter through the vast amount of available datasets.


5. Conclusions and future works
Metadata can be used to support the storage, retrieval and analysis of complex datasets without
the need to directly accessing raw data.In this paper we demonstratedthis possibility using
the example of clinical metadata, which shows the essential information that data feeders
should attach to each dataset before ingesting it into a data lake. We have shopwn as well
that the use of metadata enhances data findability across multiple datasets, helping researchers
acquire suitable data for their studies. Future extensions of this work will include bounding the
values of the metadata fields to specific vocabularies to reduce representation ambiguities. In
the clinical domain, an attractive solution could be exploiting the Unified Medical Language
System (UMLS) [37], a controlled compendium of medical vocabularies including, among others,
SNOMED-CT [16] and LOINC [17]. The multi-language support of UMLS could certainly
facilitate the adoption and usage of our minimum metadataset by clinicians and researchers.
Acknowledgments
This work was carried out within the MICS (Made in Italy - Circular and Sustainable) Extended
Partnership and received funding from Next-Generation EU (Italian PNRR - M4 C2, Invest 1.3 -
D.D. 1551.11-10-2022, PE00000004). CUP MICS D43C22003120001.


References
 [1] H. Fang, Managing data lakes in big data era: What’s a data lake and why has it became
     popular in data management ecosystem, in: 2015 IEEE International Conference on Cyber
     Technology in Automation, Control, and Intelligent Systems (CYBER), IEEE, 2015, pp.
     820–824.
 [2] D. Piantella, A research on data lakes and their integration challenges, in: Proceedings of
     the 30th Italian Symposium on Advanced Database Systems, SEBD, volume 3194 of CEUR
     Workshop Proceedings, 2022, pp. 616–621.
 [3] R. Hai, C. Koutras, C. Quix, M. Jarke, Data lakes: A survey of functions and systems, IEEE
     Transactions on Knowledge and Data Engineering (2023).
 [4] D. R.-J. G.-J. Rydning, J. Reinsel, J. Gantz, The digitization of the world from edge to core,
     Framingham: International Data Corporation 16 (2018).
 [5] U. FDA, Framework for FDA’s real-world evidence program, Silver Spring, MD: US
     Department of Health and Human Services Food and Drug Administration (2018).
 [6] H. Kondylakis, L. Koumakis, M. Tsiknakis, K. Marias, Implementing a data management
     infrastructure for big healthcare data, in: 2018 IEEE EMBS International Conference on
     Biomedical & Health Informatics (BHI), IEEE, 2018, pp. 361–364.
 [7] F. Ravat, Y. Zhao, Metadata management for data lakes, in: New Trends in Databases and
     Information Systems: ADBIS 2019 Short Papers, Workshops BBIGAP, QAUCA, SemBDM,
     SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, September 8–11, 2019,
     Proceedings 23, Springer, 2019, pp. 37–44.
 [8] A. Canakoglu, A. Bernasconi, A. Colombo, M. Masseroli, S. Ceri, GenoSurf: metadata
     driven semantic search system for integrated genomic datasets, Database 2019 (2019) 132.
 [9] N. Conte, J. C. Mason, C. Halmagyi, S. Neuhauser, A. Mosaku, G. Yordanova, A. Chatzipli,
     D. A. Begley, D. M. Krupke, H. Parkinson, T. F. Meehan, C. C. Bult, PDX Finder: A portal
     for patient-derived tumor xenograft model discovery, Nucleic acids research 47 (2019)
     D1073–D1079.
[10] R. H. Dolin, L. Alschuler, Approaching semantic interoperability in health level seven,
     Journal of the American Medical Informatics Association 18 (2011) 99–103.
[11] A. Ehsan, M. A. M. Abuhaliqa, C. Catal, D. Mishra, RESTful API testing methodologies:
     Rationale, challenges, and solution directions, Applied Sciences 12 (2022) 4369.
[12] J. Brase, DataCite: a global registration agency for research data, in: 2009 fourth interna-
     tional conference on cooperation and promotion of information resources in science and
     technology, IEEE, 2009, pp. 257–261.
[13] P. Scott, R. Worden, Semantic mapping to simplify deployment of HL7 v3 clinical document
     architecture, Journal of biomedical informatics 45 (2012) 697–702.
[14] O. Corcho, M. Eriksson, K. Kurowski, M. Ojsteršek, C. Choirat, M. Van de Sanden, F. Cop-
     pens, EOSC interoperability framework, Report from the EOSC Executive Board Working
     Groups FAIR and Architecture, 2021.
[15] O. Bodenreider, R. Cornet, D. J. Vreeman, Recent developments in clinical terminologies
     SNOMED-CT, LOINC, and RxNorm, Yearbook of medical informatics 27 (2018) 129–139.
[16] K. Donnelly, SNOMED-CT: The advanced terminology and coding system for ehealth,
     Studies in health technology and informatics 121 (2006) 279.
[17] C. J. McDonald, S. M. Huff, J. G. Suico, G. Hill, D. Leavelle, R. Aller, A. Forrey, K. Mercer,
     G. DeMoor, J. Hook, W. Williams, J. Case, P. Maloney, LOINC, a universal standard for
     identifying laboratory observations: a 5-year update, Clinical chemistry 49 (2003) 624–633.
[18] R. H. Dolin, L. Alschuler, S. Boyer, C. Beebe, F. M. Behlen, P. V. Biron, A. Shabo, HL7
     clinical document architecture, release 2, Journal of the American Medical Informatics
     Association 13 (2006) 30–39.
[19] C. Lagoze, C. A. Lynch, R. Daniel Jr, The Warwick Framework: A Container Architecture
     for Aggregating Sets ofMetadata, Technical Report, Cornell University, 1996.
[20] A. J. Gilliland, Setting the stage, Introduction to metadata 2 (2008) 7.
[21] U.S.      National      Archives,       Metadata      in    electronic       records    man-
     agement,                             https://records-express.blogs.archives.gov/2016/11/21/
     metadata-in-electronic-records-management/, 2016. Online; accessed April-2024.
[22] R. Gabriel, T. Hoppe, A. Pastwa, Classification of metadata categories in data warehousing
     - A generic approach, in: Sustainable IT Collaboration Around the Globe. 16th Ameri-
     cas Conference on Information Systems, AMCIS 2010, Lima, Peru, August 12-15, 2010,
     Association for Information Systems, 2010, p. 133.
[23] J. Greenberg, A quantitative categorical analysis of metadata elements in image-applicable
     metadata schemas, Journal of the American Society for Information Science and Technol-
     ogy 52 (2001) 917–924.
[24] J. Pierson, L. Seitz, H. Duque, J. Montagnat, Metadata for efficient, secure and extensible
     access to data in a medical grid, in: Proc. 15th International Workshop on Database and
     Expert Systems Applications, 2004., IEEE Computer Society, 2004, pp. 562–566.
[25] R. Badawy, F. Hameed, L. Bataille, M. A. Little, K. Claes, S. Saria, J. M. Cedarbaum,
     D. Stephenson, J. Neville, W. Maetzler, A. J. Espay, B. R. Bloem, T. Simuni, D. R. Kar-
     lin, Metadata concepts for advancing the use of digital health technologies in clinical
     research, Digital biomarkers 3 (2020) 116–132.
[26] U.S. National Institutes of Health, NIH strategic plan for data science, https://datascience.
     nih.gov/nih-strategic-plan-data-science, 2018. Online; accessed April-2024.
[27] M. Mustra, K. Delac, M. Grgic, Overview of the DICOM standard, in: 2008 50th International
     Symposium ELMAR, volume 1, IEEE, 2008, pp. 39–44.
[28] I. G. Cohen, M. M. Mello, HIPAA and protecting health information in the 21st century,
     Jama 320 (2018) 231–232.
[29] C. J. Hoofnagle, B. Van Der Sloot, F. Z. Borgesius, The European Union general data
     protection regulation: what it is and what it means, Information & Communications
     Technology Law 28 (2019) 65–98.
[30] I. G. Cohen, R. Amarasingham, A. Shah, B. Xie, B. Lo, The legal and ethical concerns
     that arise from using complex predictive analytics in health care, Health affairs 33 (2014)
     1139–1147.
[31] A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, M. H. Chin, Ensuring fairness in machine
     learning to advance health equity, Annals of internal medicine 169 (2018) 866–872.
[32] N. Norori, Q. Hu, F. M. Aellen, F. D. Faraci, A. Tzovara, Addressing bias in big data and AI
     for health care: A call for open science, Patterns 2 (2021).
[33] C. Criscuolo, T. Dolci, M. Salnitri, Towards assessing data bias in clinical trials, in: VLDB
     Workshop on Data Management and Analytics for Medicine and Healthcare, Springer,
     2022, pp. 57–74.
[34] J. R. Marcelin, D. S. Siraj, R. Victor, S. Kotadia, Y. A. Maldonado, The impact of unconscious
     bias in healthcare: how to recognize and mitigate it, The Journal of infectious diseases 220
     (2019) S62–S73.
[35] C. FitzGerald, S. Hurst, Implicit bias in healthcare professionals: a systematic review, BMC
     medical ethics 18 (2017) 1–18.
[36] J. Odgaard-Jensen, G. E. Vist, A. Timmer, R. Kunz, E. A. Akl, H. Schünemann, M. Briel, A. J.
     Nordmann, S. Pregno, A. D. Oxman, Randomisation to protect against selection bias in
     healthcare trials, Cochrane database of systematic reviews (2011).
[37] O. Bodenreider, The unified medical language system (UMLS): integrating biomedical
     terminology, Nucleic acids research 32 (2004) D267–D270.