<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. Criscuolo);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Criscuolo</string-name>
          <email>chiara.criscuolo@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Piantella</string-name>
          <email>davide.piantella@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierluigi Reali</string-name>
          <email>pierluigi.reali@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Gabriella Signorini</string-name>
          <email>mariagabriella.signorini@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano - Department of Electronics</institution>
          ,
          <addr-line>Information, and Bioengineering</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Via G. Ponzio 34/5</institution>
          ,
          <addr-line>20133 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In medicine, the digitization of healthcare processes and health services is generating an incredible amount of medical data. However, the huge data volume and variety of formats significantly impact the eficient sharing of data collected across diferent hospitals. This could compromise the quality of multicentric studies and hamper the potentiality of modern medical research through AI-based systems and machine learning analysis. In this context, being able to extract and manage good-quality metadata is paramount, since, especially when dealing with heterogeneous and unstructured datasets, metadata provides valuable ready-to-use information regarding the dataset without the need to directly analyze its content. Several data models exist that are specific for storing and conveniently organizing clinical metadata, such as the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), providing a flexible solution for multiple types of healthcare data. Furthermore, being compliant with the EU AI Act is a necessary requirement for medical AI-systems, thus metadata can also support ethical data science. The role of metadata in clinical contexts has been studied and analyzed in the Health Big Data project, whose goal is involving 51 Italian research hospitals (IRCCS) to maximize the interoperability of healthcare datasets and enhance clinical research. In this discussion paper, we describe how efective management of metadata in clinical datasets is crucial for ensuring data usability, harmonization, and ethics in AI-driven healthcare applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The digitization of healthcare processes has led to an explosion of medical data, with the availability
of large datasets allowing researchers to leverage federated architectures and applying innovative
analytical techniques. However, this transformation is occurring in an environment of uncertainty and
rapid change, where current decisions will shape the future of healthcare data management and analysis.
In addition, the diversity of data formats and huge volume pose significant challenges for eficient data
sharing across hospitals. These barriers can compromise the quality of multicentric studies and limit
the potential of AI-driven medical research. Moreover, as automation in data collection and analysis
increases—along with the capability of identifying large-scale patterns in biomedical data—it becomes
crucial to question which systems govern these processes and how they are regulated.</p>
      <p>
        In this context, metadata plays a key role in managing heterogeneous and unstructured datasets,
ofering valuable, ready-to-use information without requiring direct content analysis. In addition,
ensuring compliance with regulations such as the European AI Act [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is essential for promoting ethical
and trustworthy AI-driven applications in healthcare. Addressing these challenges, the Health Big Data
project, involving 51 Italian research hospitals, aims to enhance healthcare data interoperability and
improve clinical research through efective metadata management.
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <sec id="sec-1-1">
        <title>1.1. Contributions and Paper Structure</title>
        <p>In this paper, we provided three key contributions that highlight metadata as a fundamental driver in
healthcare data analysis:</p>
        <p>C1 We describe the role of metadata in managing complex, heterogeneous, and unstructured
healthcare data.</p>
        <p>
          C2 Considering a real-world use case, we apply a well-established metadata framework (i.e., the
Observational Medical Outcome Partnership Common Data Model, OMOP CDM [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]) to enhance
electrocardiogram (ECG) data collection and analysis by identifying relevant metadata and
proposing a workflow for their automatic extraction from raw data.
        </p>
        <p>C3 We illustrate how metadata supports ethical data science in the medical domain, ensuring that
healthcare datasets and systems remain trustworthy and regulation-compliant.</p>
        <p>The rest of this paper is organized as follows. Section 2 introduces some preliminary concepts such
as Big Data, Electronic Health Records, and metadata classifications. Section 3 presents our case
study: considering ECG data, we identify relevant metadata and present a workflow for automatically
extracting them from raw datasets. Section 4 describes the role of metadata in advancing ethical data
science in the healthcare domain. Section 5 concludes the paper and suggests novel directions for future
research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>
        For the scope of this paper, some preliminary concepts must be defined. We regard Big Data both as
a technological and cultural phenomenon since (i) it necessitates considerable computational power
and algorithmic precision to collect, analyze, link, and compare vast datasets [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and (ii) it allows
for uncovering patterns that inform decisions in various fields, including economics, social sciences,
technology, and, more specifically, healthcare and medicine.
      </p>
      <p>
        The “4 Vs” of Big Data – Velocity, Volume, Variety, and Veracity – represent the core characteristics
of this phenomenon. Many researchers extended the traditional “Vs” with additional features [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ],
such as: Complexity, Value, and Variability. Although not traditionally included in the definition of Big
Data, these concepts emphasize even more the importance of efective data management for maximizing
the intrinsic value of data. In healthcare, this factor is particularly critical, as the variability of healthcare
data can impact the quality and reliability of analyses. In fact, Big Data can be viewed as a combination
of structured (e.g., relational or tabular), semi-structured (e.g., XML or JSON), and unstructured data
(e.g., natural language texts, raw images), or any combination thereof, collected by organizations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this context, the OMOP CDM ofers powerful tools to standardize the structure and representation
of data from diferent sources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Its main components comprise a set of standardized tables (with
associated predefined fields) and interconnected vocabularies. The tables store the clinical data and
related metadata of interest; the vocabularies provide coded concepts to identify the described measures,
diagnoses, procedures, etc.
      </p>
      <p>
        Once datasets are collected, they typically require preparation to align with the specific research goals.
This preparation includes cleaning and preprocessing the datasets, making them ready for analysis and
suitable for machine learning or AI-based applications. These steps constitute the so-called Data Science
Pipeline [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In healthcare research, this pipeline presents unique challenges, such as the variety of
datasets and their formats, prompting additional steps for the integration or harmonization of diferent
data sources. Moreover, in the pipeline, it is essential to not only consider data quality, but also to
address ethical issues, including the identification of potential biases related to labels, representation,
and sampling [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and, possibly, their mitigation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Managing Electronic Health Records (EHRs)</title>
        <p>
          The increasing digitization of healthcare processes results in a huge amount of medical data, such as
clinical images, laboratory results, and discharge letters. It has been estimated that by 2025 the annual
growth rate of healthcare data will reach 36%, significantly higher than the general data growth rate,
estimated at 27% [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          The term Electronic Health Records (EHRs) defines a comprehensive, cross-institutional, and
longitudinal collection of healthcare data to encompass the entire clinical history of a patient [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. EHRs
store both structured (e.g., date of birth, diagnosis codes, and laboratory results) and unstructured (e.g.,
clinical notes, and medical images) data. While unstructured data are more challenging to manage and
extract information from, they ofer a richer description of patient conditions and valuable contextual
information that structured formats struggle to capture (e.g., social history, anamnesis).
        </p>
        <p>One possible solution to enhance data analysis capabilities when dealing with unstructured
heterogeneous data is to leverage metadata, i.e., information describing the diferent characteristics of the
data itself. For instance, metadata for natural language texts may specify the language used and the
topics discussed. In the case of a medical image, metadata should include the scanned body region,
type and configuration of the imaging device. Metadata can be associated with diferent levels of
information, from a single data point (e.g., the patient’s age in an ECG signal) to entire datasets (e.g.,
the age range and ethnicity of patients in a multicentric clinical trial) or even a whole data provider
(e.g., its trustworthiness and other quality metrics).</p>
        <p>
          The increasing availability of EHRs has enabled (i) real-world-evidence clinical trials [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], (ii) the use
of deep learning algorithms for advancing healthcare data—particularly, medical images—analysis [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
and (iii) the generative-AI application for natural language processing tasks (i.e., reading comprehension,
summarization, translation, and question answering) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], as well as real-world risk forecasting and
clinical research to study disease progression, simulate interventions, and support medical education [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
Thus, the ability to collect, harmonize, and integrate data from multiple heterogeneous datasets becomes
paramount [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. A common approach for storing unstructured datasets in heterogeneous formats is the
use of data lakes: schema-less data repositories capable of ingesting raw data without preprocessing [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
However, to exploit the full potential of data lakes and EHRs, we must properly describe and catalog
datasets in a structured and harmonized manner: metadata are crucial for outlining raw datasets and
establishing meaningful connections among them [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Metadata classifications</title>
        <p>
          Analyzing some of the most relevant general-purpose metadata classifications [
          <xref ref-type="bibr" rid="ref21">21, 22, 23, 24</xref>
          ], we can
identify three main clusters of metadata categories: (i) administrative, supporting data governance,
management and administration; (ii) data provenance, providing a detailed record of the dataset’s lineage
and evolution throughout its lifecycle; (iii) descriptive, allowing the users to understand the content,
purpose and relevance of the datasets.
        </p>
        <p>Additionally, healthcare data management requires specialized metadata categories to ensure accurate
description, privacy protection, and interoperability. We now present two healthcare-specific metadata
models that align with the general-purpose classifications while addressing domain-specific
requirements. Pierson et al. [25] introduce a classification designed for handling medical data, identifying six
primary metadata categories:
• Patient-related: these metadata include simple information regarding the patient (e.g., sex, age).
• Image-related: medical images are often characterized by dimensions, voxel size, and encoding.
• Acquisition-related: decisions taken during the acquisition process significantly impact the
resulting medical images. For this reason, we store information such as the acquisition device, the set
of parameters for the acquisition process, and the acquisition date.
• Hospital-related: this category includes the department responsible for the acquisition and hospital
information in general.
• Medical record: medical history is fundamental for interpreting medical exam results, which are
often compared with previous exams of the same patient.
• Security-related: sensitive information must be kept private. Therefore, we must store information
regarding authorization and encryption.</p>
        <p>Similarly, Badawy et al. [26] report the following classification of healthcare metadata, endorsed by a
consortium of 33 domain experts:
• Person-related: this category includes relevant information regarding the subjects of the study (i.e.,
the patients), such as age, sex, medical history, and concomitant medications. When applicable, it
also includes information regarding care providers (e.g., clinicians or relatives) to avoid biases.
• Observation-related: this category includes data collected during a study or analyzed in post-study
evaluations, integrating information from both digital health technologies and human participants.
Examples are devices used for acquisition, software names and versions used for the analysis,
and sensor precision.
• Context of collection: they include details of the clinical study, conduct, eligibility criteria, processes,
and procedures.
• Time-related: they provide temporal information regarding the data collected and reported in the
datasets, such as start time, end time, time precision, time format, and time zone.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Case study: metadata for ECG datasets</title>
      <p>As a case study, we focus on electrocardiograms (ECGs), biosignals that record the heart’s electrical
activity over time. We identify a set of relevant metadata to describe this specific type of dataset and
propose a workflow to automatically extract these metadata from raw ECG data.</p>
      <sec id="sec-3-1">
        <title>3.1. Minimum set of metadata</title>
        <p>When we have multiple data providers (e.g., in a multicentric study), a best practice consists in identifying
a minimum set of metadata that each source must attach to the dataset before sharing it with other
partners. This will ensure that essential metadata are present for each dataset, leaving the possibility
for data feeders to specify additional metadata that further describes their datasets.</p>
        <p>To properly represent ECG data, we extend the healthcare minimum metadataset illustrated in [27]
with additional metadata, specifically tailored to describe an ECG, shown in bold font in Table 1.</p>
        <p>Category
Administrative
Data provenance
Descriptive</p>
        <p>Attributes
GUID, Creator, Owner, Rights, Terms of access
Publication year, Upload date, Acquisition method, Acquisition tools, Number of leads,
Sampling frequency, Bandwidth, Download URL, Checksum, Encryption algorithm,
File version, Update/modification date, Update frequency</p>
        <p>Specifically, the Number of leads defines the number of ECG channels, each capturing the electrical
activity along a diferent direction and enriching the information content of the signal. The Sampling
frequency describes the temporal resolution of the ECG, while the Bandwidth informs about the lowest
and highest frequencies represented in the signal; these metadata are fundamental because certain
analyses necessitate high-resolution ECGs [28]. Heart Rate Variability (HRV) indices are time-domain,
frequency-domain, and non-linear metrics derived from ECG signals through appropriate
processing [29]; together with automatically detected Arrhythmias, they can be queried to identify signals
presenting specific characteristics of interest for a clinical study.</p>
        <p>Furthermore, as concerns ECG signals, the Acquisition method is used to diferentiate 2-minute
or shorter recordings (e.g., diagnostic ECGs) from 24-hour or longer ones (Holter ECGs), providing
essential context to interpret the extracted HRV indices correctly [30].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Extraction and representation of ECG-specific metadata</title>
        <p>Common ECG data formats (e.g., EDF [31], BIDS [32], CSV + header file or JSON) typically store, in
addition to raw signals, also several provenance and descriptive metadata in a structured manner,
facilitating their identification and retrieval. However, HRV indices and Arrhythmias must be derived from
the ECG traces through specific processing. To enhance interpretability and semantic interoperability
with datasets from other sources, we insert these features into OMOP CDM structures and map them
to unique concepts, leveraging OMOP vocabularies. This is achieved through a data processing and
standardization pipeline we developed [33] that takes raw ECGs as input, extracts HRV features and
the detected arrhythmias, and structures the output according to the OMOP CDM. Suitable concepts
to describe the metadata of interest in the OMOP Vocabulary are identified through the Athena web
interface1. In the following, we summarize the main steps of this pipeline.</p>
        <p>An ECG signal is made up of characteristic waveforms representing diferent phases of the cardiac
cycle, as shown in Figure 1. Both the time distance (specifically, its variability over time) between such
waveforms and their morphology can provide critical information on patient health. After traditional
preprocessing, established algorithms (e.g., Pan-Tompkins [34]) are applied to detect the position of
the R peaks in the ECG traces, obtain the RR-interval time series, and calculate typical HRV indices,
including the RR-interval mean, standard deviation, and root mean square of successive diferences
(RMSSD) [29]. In addition, waveform classification models are employed to automatically identify the
occurrence of arrhythmias of particular interest for clinical research, including atrioventricular blocks,
bundle branch blockades, atrial fibrillation, bradycardia, and tachycardia.</p>
        <p>Open-source tools, such as the Neurokit2 toolbox [35] and specialized Deep Neural Networks
(DNN) [36], can be leveraged for this purpose, with the additional benefit of favoring tracking of
software version, code, and characteristics of the applied feature extraction methods.</p>
        <p>Once feature extraction is completed, HRV indices and automated diagnoses are mapped into a suitable
OMOP CDM structure. Given the patient-centric approach of the CDM, the Person table is the first
one to be populated, which can conveniently accommodate the Descriptive metadata related to patient
demographics (i.e., Age, Ethnicity, Sex) by means of dedicated fields 2. Then, a Procedure_occurrence
table is initialized that stores the basic properties of the collected ECG signals. For example, a specific
ifeld ( procedure_type_concept_id) stores the previously described Acquisition method metadata,
allowing for diferentiating between diagnostic and Holter recordings. The other ECG-tailored Data
provenance metadata, namely Number of leads, Sampling frequency, and Bandwidth, can be allocated
in distinct instances of the Observation table, each one mapped to a specific OMOP Vocabulary
concept ensuring unambiguous representation (e.g., Sampling frequency maps to “Digital Sampling
Rate”, OMOP ID: 37533243).</p>
        <p>The Observation table is also used to store the automatically detected arrhythmias, which are
represented through the observation_concept_id “ECG automated diagnosis” (OMOP ID: 35810893),
defining the specific metadata reported in that instance, and the value_as_concept_id field populated
1https://athena.ohdsi.org/search-terms/start
2https://ohdsi.github.io/CommonDataModel/cdm54.html
with the arrhythmia (if any) identified in the ECG (e.g., “ECG: atrial fibrillation”, OMOP ID: 4064452).
With a similar strategy, the calculated HRV indices are allocated in separate instances of the
Measurement table, which is preferred for storing numerical values. All the instances of the Observation and
Measurement tables are connected to the previous Procedure_occurrence table, so the association
between the mapped metadata and the ECG of origin is clearly defined.</p>
        <p>Finally, the remaining patient-related Descriptive metadata, i.e., Blood group and Disease name, can
be represented as records of the Condition_occurrence table, which is directly linked to the Person
table and primarily stores patient diagnoses (also associated with a specific period of time), as well as
generally immutable facts (e.g., blood group). Certain patient characteristics included in the designed
OMOP CDM structure, such as age, sex, and ethnicity, demand careful consideration from an ethical
perspective.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Metadata supporting Ethical Data Science</title>
      <p>
        Healthcare data, such as clinical trials, are well-suited for analysis through mining methods and
machine learning techniques. However, the European AI Act [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] document classifies the application of
Data Science to healthcare data for medical diagnosis as a high-risk application due to its significant
impact on patients’ lives. Consequently, such systems must necessarily undergo rigorous compliance
checks and adhere to the principles outlined in the document. Among the principles listed in the AI Act,
we first report the three pillars required to achieve trustworthy AI systems 3:
• Lawful: systems must respect all the applicable laws and regulations, and in the medical domain
should be also compliant with domain-specific rules.
• Ethical: systems should incorporate ethical principles and values.
• Robust: systems must be designed and used to prevent any unintentional harm from both
technical and social perspectives.
      </p>
      <p>Ethical considerations are fundamental for ensuring that a system is trustworthy and, thus, applicable
in real-world contexts. Ethical Data Science is a broad field that focuses on minimizing harm, ensuring
3https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
moral rights, and evaluating practices involved in the generation, collection, analysis, and dissemination
of data that could potentially afect people and society adversely. By identifying specific risks and
challenges, Figure 2 categorizes ethical issues and concerns in the following areas: fairness and diversity,
privacy, transparency and explainability, accountability and governance. We believe that metadata can
play a crucial role in enhancing many of these ethical aspects.</p>
      <p>Fairness and Diversity. In the context of healthcare and Machine Learning (ML) tasks, fairness refers
to the equitable treatment of individuals or groups by ML models, irrespective of sensitive or protected
characteristics. To identify and measure fairness, analysts require access to protected characteristics, such
as sex, ethnicity, or age, which should not serve as discriminatory factors in predictions. As previously
shown, these attributes can be recorded as metadata within the dataset. Additionally, metadata can
capture the distribution of these characteristics by reporting the percentage of occurrences for each
value. This enables diversity analysis, helping to identify representation inequities that could lead
to systemic biases in data-driven decisions. Such biases may result in greater prediction errors for
underrepresented groups or minorities, ultimately exacerbating disparities in healthcare outcomes.
Privacy. Privacy must be guaranteed through the entire AI system lifecycle, ensuring that personal
information is protected from unauthorized users, and that data usage is traceable at every stage of
the data science pipeline (i.e., monitor data flow, track changes in the data transformation/processing,
identify who can access or modify the data). This is particularly critical in the healthcare sector, where
informed consent plays a central role in clinical trials. Informed consent ensures that participants are
fully aware of the experiment, voluntarily agree to participate, and retain the right to withdraw their
consent at any time. Metadata can significantly enhance privacy in AI systems by indicating whether a
dataset contains personally identifiable information and by tracing the history of privacy-preserving
transformations applied to the records [37, 38]. An additional metadata usage in this area could be a
log tracking consent status, ensuring researchers only use data from patients who have given informed
consent.</p>
      <p>Transparency and Explainability. Transparency and explainability are essential for fostering trust
in AI systems. When users understand how algorithmic decisions are made, they are more likely to
trust and accept the outcomes. Additionally, regulations such as GDPR and the AI Act emphasize the
need for transparency in AI applications. Transparency and explainability ensure that the appropriate
information reaches the relevant stakeholders [39]. These principles are particularly relevant in medical
diagnosis. For instance, if an AI system is used to screen patients at high risk for cancer, a medical
researcher needs to understand the factors contributing to the diagnosis. Similarly, when an AI system
predicts a particular medical condition, it is crucial that its data sources, analytical processes, and
decision-making logic are well-documented and accessible. Metadata can play a key role in enhancing
transparency and explainability by providing detailed records of AI system functionalities, data sources,
and decision rationales. Metadata could contain model interpretability metrics, ensuring that predictions
are explainable to medical experts. Furthermore, metadata can be tailored to diferent audiences,
ensuring that explanations are adapted to various levels of expertise and backgrounds, thereby making
AI systems more accessible and interpretable.</p>
      <p>Accountability and Governance. Can an AI system be held accountable for its actions? Accountability
refers to the responsibility of individuals and organizations to ensure that their data processes and
algorithms operate ethically and fairly, preventing harm and addressing ethical lapses when they arise.
Through efective governance, accountability should be maintained at every stage of the AI system
lifecycle or data science pipeline. However, assigning responsibility to specific actors within an AI
system is inherently challenging. In the healthcare sector, ethical review boards typically oversee data
science practices to ensure responsible management and adherence to ethical standards. Metadata
can also play a crucial role in strengthening governance. Metadata can document the version history
of an ML model, tracking changes in training data, algorithm updates, and human interventions. By
recording assessments of algorithms, data, and design processes, metadata facilitates auditability and
enhances oversight. Additionally, it can document potential redress strategies for addressing issues in
data science methodologies, further supporting responsible AI development.</p>
      <p>To summarize, the metadata contributions to ethical data science in healthcare on the four areas are:
• Identification of bias and supporting fairness analysis : (i) in bias detection metadata can
record demographic attributes (e.g., Age, Ethnicity, Sex) to assess and mitigate bias in AI models;
(ii) in data distribution insights they can store statistical summaries of dataset diversity to prevent
representation inequities; (iii) metadata can be used to report fairness metrics (i.e. fairness-related
performance indicators).
• Tracking privacy: (i) metadata can identify personal, anonymized, or sensitive health data to
ensure proper handling; (ii) regarding consent management, metadata can log patient consent
status and usage permissions, ensuring compliance with regulations; (iii) metadata can record
privacy-preserving actions (e.g., encryption, de-identification) applied to data.
• Documenting transparency: (i) metadata can document how an AI system arrives at a specific
diagnosis or prediction; (ii) regarding model interpretability, metadata can store explanations of
model behavior and reasoning for end-users; (iii) metadata can adapt explanations based on the
user’s expertise (e.g., doctors vs. patients).
• Enabling accountability: (i) in audit trails, metadata can capture version histories, documenting
changes in data, models, and decisions; (ii) metadata can help compliance by marking datasets
and models that meet ethical and legal standards; (iii) metadata can log incorrect predictions and
associated corrective actions to improve error and redress tracking.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>In this work, we analyzed the role of metadata as a key driver in healthcare data management and
analysis. We demonstrated how metadata facilitates the handling of complex, heterogeneous, and
unstructured healthcare data by providing a precise description of metadata in clinical contexts, including
a dedicated metadata set for ECG data. Additionally, we applied a well-established metadata framework
(OMOP CDM) to enhance real-world use cases, specifically focusing on ECG data collection and analysis.
Finally, we explored the role of metadata in ethical data science within the medical domain, particularly
in bias detection and fairness analysis, privacy tracking, transparency augmentation, and accountability
enablement. Future work consists in:
• Studying how to reduce representation ambiguities in metadata values. This could be achieved by
enriching the OMOP CDM vocabulary with a multi-language compendium of medical ontologies
such as Unified Medical Language System [ 40].
• Validate our ECG metadata representation on a real-world multicentric study by applying it to
healthcare datasets from multiple sources and assessing its impact on data interoperability and
AI-driven analysis.
• Expand the state-of-the-art metadata frameworks to encompass additional aspects of ethical data
science and validate the approach in real-world scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been supported by the Health Big Data project, funded by the Italian Ministry of Economy
and Finance and coordinated by the Ministry of Health. We also thank Davide Martinenghi for his
support and advice.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 for: grammar and spelling check. After
using these tools, the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.
[22] R. Gabriel, T. Hoppe, A. Pastwa, Classification of metadata categories in data warehousing - A
generic approach, in: Sustainable IT Collaboration Around the Globe. 16th Americas Conference
on Information Systems, AMCIS 2010, Lima, Peru, August 12-15, 2010, Association for Information
Systems, 2010, p. 133.
[23] A. J. Gilliland, Setting the stage, Introduction to metadata 2 (2008) 1–19.
[24] J. Greenberg, A quantitative categorical analysis of metadata elements in image-applicable
metadata schemas, Journal of the American Society for Information Science and Technology 52
(2001) 917–924.
[25] J. Pierson, L. Seitz, H. Duque, J. Montagnat, Metadata for eficient, secure and extensible access
to data in a medical grid, in: 15th International Workshop on Database and Expert Systems
Applications (DEXA 2004), with CD-ROM, 30 August - 3 September 2004, Zaragoza, Spain, IEEE
Computer Society, 2004, pp. 562–566.
[26] R. Badawy, F. Hameed, L. Bataille, M. A. Little, K. Claes, S. Saria, J. M. Cedarbaum, D. Stephenson,
J. Neville, W. Maetzler, et al., Metadata concepts for advancing the use of digital health technologies
in clinical research, Digital biomarkers 3 (2020) 116–132.
[27] D. Piantella, P. Reali, P. Kumar, L. Tanca, A minimum metadataset for data lakes supporting
healthcare research, in: Proceedings of the 32nd Symposium of Advanced Database Systems, 2024,
volume 3741 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 681–691.
[28] L. G. Tereshchenko, M. E. Josephson, Frequency content and characteristics of ventricular
conduction, Journal of Electrocardiology 48 (2015) 933–937. doi:10.1016/j.jelectrocard.2015.08.
034.
[29] F. Shafer, J. P. Ginsberg, An overview of heart rate variability metrics and norms, Frontiers in</p>
      <p>Public Health 5 (2017) 1–17. doi:10.3389/fpubh.2017.00258.
[30] M. Malik, J. T. Bigger, A. J. Camm, R. E. Kleiger, A. Malliani, A. J. Moss, P. J. Schwartz, Heart rate
variability: Standards of measurement, physiological interpretation, and clinical use, European
Heart Journal 17 (1996) 354–381. doi:10.1093/oxfordjournals.eurheartj.a014868.
[31] B. Kemp, J. Olivan, European data format ’plus’ (edf+), an edf alike standard format for the
exchange of physiological data, Clinical Neurophysiology 114 (2003) 1755–1761. doi:10.1016/
S1388-2457(03)00123-8.
[32] K. J. Gorgolewski, T. Auer, V. D. Calhoun, R. C. Craddock, S. Das, E. P. Duf, G. Flandin, S. S.</p>
      <p>Ghosh, T. Glatard, Y. O. Halchenko, D. A. Handwerker, M. Hanke, D. Keator, X. Li, Z. Michael,
C. Maumet, B. N. Nichols, T. E. Nichols, J. Pellman, J.-B. Poline, A. Rokem, G. Schaefer, V. Sochat,
W. Triplett, J. A. Turner, G. Varoquaux, R. A. Poldrack, The brain imaging data structure, a format
for organizing and describing outputs of neuroimaging experiments, Scientific Data 3 (2016)
160044. doi:10.1038/sdata.2016.44.
[33] P. Reali, A. Carotenuto, D. Piantella, L. Tanca, P. Plebani, M. G. Signorini, Development of data
ingestion pipelines for the federated use of biomedical data in research: The health big data
project, in: 2024 IEEE 22nd Mediterranean Electrotechnical Conference (MELECON), IEEE, 2024,
pp. 678–683. doi:10.1109/MELECON56669.2024.10608617.
[34] J. Pan, J. W. Tompkins, A real-time qrs detection algorithm, IEEE Transaction on Biomedical</p>
      <p>Engineering 32 (1985) 230–236. doi:10.1109/TBME.1985.325532.
[35] D. Makowski, T. Pham, Z. J. Lau, J. C. Brammer, F. Lespinasse, H. Pham, C. Schölzel, S. H. A.</p>
      <p>Chen, Neurokit2: A python toolbox for neurophysiological signal processing, Behavior Research
Methods 53 (2021) 1689–1696. doi:10.3758/s13428-020-01516-y.
[36] A. H. Ribeiro, M. H. Ribeiro, G. M. M. Paixão, D. M. Oliveira, P. R. Gomes, J. A. Canazart, M. P. S.</p>
      <p>Ferreira, C. R. Andersson, P. W. Macfarlane, W. Meira, T. B. Schön, A. L. P. Ribeiro, Automatic
diagnosis of the 12-lead ecg using a deep neural network, Nature Communications 11 (2020) 1760.
doi:10.1038/s41467-020-15432-4.
[37] A. Pika, M. T. Wynn, S. Budiono, A. H. Ter Hofstede, W. M. van der Aalst, H. A. Reijers,
Privacypreserving process mining in healthcare, International journal of environmental research and
public health 17 (2020) 1612.
[38] S. A. Sohail, F. A. Bukhsh, M. van Keulen, Multilevel privacy assurance evaluation of healthcare
metadata, Applied Sciences 11 (2021) 10686.
[39] R. Mariani, F. Rossi, R. Cucchiara, M. Pavone, B. Simkin, A. Koene, J. Papenbrock, Trustworthy
ai—part 1, Computer 56 (2023) 14–18.
[40] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology,
Nucleic acids research 32 (2004) D267–D270.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>The</given-names>
            <surname>European</surname>
          </string-name>
          <string-name>
            <surname>Parliament</surname>
          </string-name>
          ,
          <source>Artificial Intelligence Act</source>
          ,
          <source>Oficial Journal of the European Union</source>
          ,
          <year>2024</year>
          . URL: https://artificialintelligenceact.eu/the-act/,
          <source>last accessed on January 30</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Overhage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Ryan</surname>
          </string-name>
          , C. G. Reich, A. G. Hartzema,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Stang</surname>
          </string-name>
          ,
          <article-title>Validation of a common data model for active safety surveillance research</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>19</volume>
          (
          <year>2012</year>
          )
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          . doi:
          <volume>10</volume>
          .1136/amiajnl- 2011- 000376.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <article-title>Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon</article-title>
          ,
          <source>Information, communication &amp; society 15</source>
          (
          <year>2012</year>
          )
          <fpage>662</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gegov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cocea</surname>
          </string-name>
          ,
          <article-title>Unified framework for control of machine learning tasks towards efective and eficient processing of big data, Data science and big data: An environment of computational intelligence (</article-title>
          <year>2017</year>
          )
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>A. De Mauro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Greco</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grimaldi</surname>
          </string-name>
          ,
          <article-title>What is big data? a consensual definition and a review of key research topics</article-title>
          ,
          <source>AIP Conference Proceedings</source>
          <volume>1644</volume>
          (
          <year>2015</year>
          )
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suthaharan</surname>
          </string-name>
          ,
          <article-title>Big data classification: Problems and challenges in network intrusion prediction with machine learning</article-title>
          ,
          <source>ACM SIGMETRICS Performance Evaluation Review</source>
          <volume>41</volume>
          (
          <year>2014</year>
          )
          <fpage>70</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Katal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wazid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Goudar</surname>
          </string-name>
          ,
          <article-title>Big data: issues, challenges, tools and good practices</article-title>
          ,
          <source>in: 2013 Sixth international conference on contemporary computing (IC3)</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Pedrycz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <source>Information granularity, big data, and computational intelligence</source>
          , Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Ooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reyad</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rafiki:</surname>
          </string-name>
          <article-title>Machine learning as an analytics service system</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>12</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Criscuolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dolci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salnitri</surname>
          </string-name>
          ,
          <article-title>Towards assessing data bias in clinical trials</article-title>
          ,
          <source>in: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Criscuolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dolci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salnitri</surname>
          </string-name>
          ,
          <article-title>Mitigating unfairness in machine learning: A taxonomy and an evaluation pipeline (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>D. R.-J. G.-J. Rydning</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reinsel</surname>
            ,
            <given-names>J. Gantz,</given-names>
          </string-name>
          <article-title>The digitization of the world from edge to core</article-title>
          ,
          <source>Framingham: International Data Corporation</source>
          <volume>16</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Tighe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bihorac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rashidi</surname>
          </string-name>
          ,
          <article-title>Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis</article-title>
          ,
          <source>IEEE journal of biomedical and health informatics 22</source>
          (
          <year>2017</year>
          )
          <fpage>1589</fpage>
          -
          <lpage>1604</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>U. FDA</surname>
          </string-name>
          ,
          <article-title>Framework for fda's real-world evidence program, Silver Spring, MD: US Department of Health and Human Services Food</article-title>
          and Drug
          <string-name>
            <surname>Administration</surname>
          </string-name>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Litjens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Bejnordi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Setio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciompi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghafoorian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Van Der Laak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Ginneken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning in medical image analysis</article-title>
          ,
          <source>Medical image analysis 42</source>
          (
          <year>2017</year>
          )
          <fpage>60</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. J.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elangovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. W.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Large language models in medicine</article-title>
          ,
          <source>Nature medicine 29</source>
          (
          <year>2023</year>
          )
          <fpage>1930</fpage>
          -
          <lpage>1940</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kraljevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bendayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hemingway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Idowu</surname>
          </string-name>
          , et al.,
          <article-title>Foresight-generative pretrained transformer (gpt) for modelling of patient timelines using ehrs</article-title>
          ,
          <source>arXiv preprint arXiv:2212.08072</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kondylakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Koumakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tsiknakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marias</surname>
          </string-name>
          ,
          <article-title>Implementing a data management infrastructure for big healthcare data</article-title>
          ,
          <source>in: 2018 IEEE EMBS International Conference on Biomedical &amp; Health Informatics (BHI)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>361</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Miloslavskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tolstoy</surname>
          </string-name>
          ,
          <article-title>Big data, fast data and data lake concepts</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>88</volume>
          (
          <year>2016</year>
          )
          <fpage>300</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ravat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Metadata management for data lakes</article-title>
          ,
          <source>in: New Trends in Databases and Information Systems</source>
          , ADBIS 2019
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <string-name>
            <surname>Workshops</surname>
            <given-names>BBIGAP</given-names>
          </string-name>
          , QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, September 8-
          <issue>11</issue>
          ,
          <year>2019</year>
          , Proceedings, volume
          <volume>1064</volume>
          of Communications in Computer and Information Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lagoze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Lynch</surname>
          </string-name>
          , R. Daniel Jr,
          <article-title>The Warwick Framework: A Container Architecture for Aggregating Sets ofMetadata</article-title>
          ,
          <source>Technical Report</source>
          , Cornell University,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>