<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Survey on Metadata for Machine Learning Models and Datasets: Standards, Practices, and Harmonization Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Genet-Asefa Gesese</string-name>
          <email>genet-asefa.gesese@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zongxiong Chen</string-name>
          <email>zongxiong.chen@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oussama Zoubia</string-name>
          <email>oussama.zoubia@uk-koeln.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fidan Limani</string-name>
          <email>f.limani@zbw.eu</email>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kanishka Silva</string-name>
          <email>kanishka.silva@gesis.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Asif Suryani</string-name>
          <email>asif.suryani@gesis.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Zapilko</string-name>
          <email>benjamin.zapilko@gesis.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leyla Jael Castro</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ekaterina Kutafina</string-name>
          <email>ekaterina.kutafina@uni-koeln.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dhwani Solanki</string-name>
          <email>solanki@zbmed.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heike Fliegl</string-name>
          <email>Heike.Fliegl@fiz-Karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonja Schimmler</string-name>
          <email>sonja.schimmler@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zeyd Boukhers</string-name>
          <email>zeyd.boukhers@fit.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe - Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer FOKUS</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute for Biomedical Informatics, Medical Faculty, University of Cologne</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>TU Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>ZBW - Leibniz Information Centre for Economics</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>9</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>The growing availability of machine learning (ML) models, datasets, and related artifacts across platforms, such as Hugging Face, GitHub, and Zenodo, has amplified the need for structured and standardized metadata. However, metadata practices remain highly heterogeneous, difering in schema design, vocabulary usage, and semantic expressiveness, posing significant challenges for tasks such as representation, extraction, alignment, and integration. This fragmentation impedes the development of infrastructures that depend on machine-actionable metadata to support discovery, provenance tracking, or cross-platform interoperability. While metadata is also foundational to enabling FAIR (Findable, Accessible, Interoperable, and Reusable) principles in ML, there is a lack of consolidated understanding of how existing standards support interoperability and alignment across platforms. In this survey, we review and compare a range of general-purpose and ML-specific metadata standards, evaluating their suitability for cross-platform alignment, discoverability, extensibility, and interoperability. We assess these standards based on defined criteria and analyze their potential to support unified, FAIR-compliant metadata infrastructures for ML, laying the groundwork for scalable and interoperable tooling in future ML ecosystems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Metadata</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Datasets</kwd>
        <kwd>FAIR</kwd>
        <kwd>Research Artifacts Harmonization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        The rapid growth of machine learning (ML) research has led to an explosion in the availability of ML
artifacts, such as models, datasets, and training code, which are now shared across a wide range of
platforms, including GitHub1, Hugging Face2, Zenodo3, or OpenML4. These platforms have become
an essential infrastructure for disseminating pre-trained models, experimental results, datasets, and
reproducible workflows. However, the scale and diversity of these artifacts have outpaced any consistent
metadata practices, resulting in fragmented, incompatible, and semantically shallow descriptions across
platforms [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
      </p>
      <p>
        Metadata provides structured descriptions of digital objects such as datasets, software, and models,
supporting both human understanding and machine interoperability. It also carries critical
supplementary information, including provenance, quality, licensing, versioning, and usage constraints that bring
additional context to a resource. However, significant heterogeneity exists in how metadata is designed
and applied across platforms. This includes diferences in schema design, vocabulary usage,
expressiveness, and machine readability. This lack of metadata standardization limits machine-actionability
and afects automated workflows, making it dificult, for instance, to discover models, link them to
related publications, or incorporate them into knowledge-driven systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Figure 1 illustrates this
progression from fragmentation toward semantic integration and FAIR (Findable, Accessible,
Interoperable, and Reusable)5 infrastructures. These dificulties become more evident in workflows that
rely on structured knowledge representations, such as Data Science (DS) and Artificial Intelligence (AI)
pipelines or Knowledge Graph (KG)-powered discovery tools.
      </p>
      <p>
        In ML contexts, metadata encompasses descriptive, administrative, structural, provenance, evaluation,
and ethical dimensions, each crucial for the reuse and interpretability of ML artifacts. Without such
contextual detail, those artifacts cannot be reliably discovered, interpreted, or reused. In this sense,
metadata is a cornerstone of FAIR infrastructures: in line with the FAIR principles, metadata alongside
data and infrastructure plays an important role, and its relevance extends beyond ML artifacts
themselves [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In scientific research, particularly in the fields of DS &amp; AI, metadata serves as the connecting
layer that describes and contextualizes digital artifacts. In order to address these foundational needs,
eforts to improve metadata quality and standardization in ML have emerged, often linked with FAIR
research, responsible AI, and data-centric workflows. In this paper, we use the term "standards" in
a broad sense, encompassing formal specifications, vocabularies and conceptual models, as well as
community practices. Notable initiatives include model cards6, dataset documentation frameworks,
benchmark metadata formats such as the MLCommons Model Index7, and the adoption of
general5https://www.go-fair.org/fair-principles/
6https://huggingface.co/docs/hub/en/model-cards
7https://github.com/mlcommons
purpose metadata standards, such as Schema.org8, DCAT9, or DataCite10. Despite these advances, these
initiatives remain largely siloed, often designed for specific platforms. These eforts, while valuable,
tend to address specific phases or artifact types and often neglect or have a limited focus on alignment
and integration challenges across the full ML life cycle. As a result, these platform-specific metadata
silos limit their interoperability in multi-repository or multi-domain applications, which is an essential
requirement for scalable scientific infrastructures.
      </p>
      <p>Adding to the issue, the wide range of metadata standards and schema – both general-purpose and
domain-specific, as well as the conceptual models, vocabulary terms, and representation formats they
adopt, create significant barriers to semantic interoperability, complicating integration eforts across
repositories. This leads to (terminology) incompatibility and fragmentation between diferent platforms.
Without systematic alignment and mappings between these heterogeneous metadata standards, it
becomes dificult to construct unified metadata layers that can support scenarios like reasoning, querying,
or KG construction across platforms. Despite isolated eforts to bridge metadata silos, the field still
lacks a shared framework or a consolidated understanding of how existing standards compare in terms
of alignment potential, extensibility, and machine-actionability.</p>
      <p>
        Multiple lines of research have focused on formalizing ML metadata. Li et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], for example,
proposed a unified representation to query model repositories. Other works survey scientific metadata
standards [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], data provenance in computational workflows [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], while Samuel, Löfler &amp; König-Ries
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Limani et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] focus on FAIRification of ML pipelines and that of ML models, respectively.
However, there is a need to address the ML artifacts from a metadata perspective, such as the suitability
of metadata frameworks for cross-platform alignment, semantic interoperability, and integration based
on KGs, linked to practical challenges such as dataset search [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and metadata inconsistencies [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The main contributions of this paper are summarized as follows:
• A review and comparative analysis of metadata practices in major ML platforms.
• A review of the existing ecosystem of metadata standards for ML artifacts and their suitability
for semantic integration.
• Identification and a detailed discussion of the challenges inherent in mapping, aligning, and
integrating heterogeneous ML metadata.
• Identification of key gaps, limitations, and research opportunities in the field of ML metadata
management and semantic integration.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Metadata Practices in Prominent ML Platforms</title>
      <p>This section reviews metadata practices in major ML platforms, focusing on their structure, granularity,
and machine-actionability.</p>
      <sec id="sec-2-1">
        <title>2.1. Criteria for Selecting Platforms (CSP)</title>
        <p>To ensure a representative and practical comparison, the following six criteria were used to select ML
platforms.
• CSP1: Popularity, Adoption, and Influence. The platform is widely used in the ML community,
as demonstrated by active contributors, hosted artifacts, GitHub metrics, or citations in academic
literature, or integration into major workflows in academia or industry.
• CSP2: Metadata Accessibility and Machine-Actionability. The platform exposes metadata in
structured or semi-structured formats (e.g., JSON, YAML, XML, or RDF), and provides programmatic
access via APIs for retrieving or exporting metadata. Preference is given to platforms whose metadata
supports parsing, automated extraction, and reuse without manual intervention.
• CSP3: ML Artifact Coverage. The platform supports multiple ML-specific artifact types, including
models, datasets, training code, and optionally notebooks or experiment traces, as well-described and
retrievable entities.
• CSP4: Open Access and Licensing Transparency. Metadata and artifacts are publicly accessible
without restrictive authentication or institutional barriers. Clear licensing supports metadata reuse,
redistribution, and integration into downstream applications.
• CSP5: Interoperability Potential. The platform is relevant to metadata alignment eforts and
provides metadata that can be mapped to standard schemas with minimal transformation efort.
• CSP6: Participation in Standards and Community-Driven Practices. The platform is involved
in or adopts emerging metadata standards, such as the MLCommons’ Model Index, or Hugging Face’s
model-index.yaml, or contributes to open science infrastructure through community initiatives.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Relevant Machine Learning Platforms</title>
        <p>
          Hugging Face Model &amp; Dataset Hubs is widely adopted for sharing pretrained models and datasets,
particularly in natural language processing (NLP) and computer vision (CSP1, CSP3). Metadata is
primarily provided through semi-structured README.md files, often containing structured YAML headers and
markdown descriptions, alongside auxiliary files such as config.json and dataset_infos.json
(CSP2). Additional efort has been undertaken to describe datasets using the Croissant ML extension to
Schema.org [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (CSP2). These elements support common fields like license, task, and language, and
often include links to publications such as arXiv papers (CSP5). However, schema adherence varies
across entries, and content inconsistencies hinder machine-actionability (CSP2). While the Hugging
Face Hub API supports metadata access, it does not enforce schema constraints (CSP6). Nevertheless,
community-driven practices such as the use of model-index.yaml and cross-publishing on
platforms like Zenodo reflect emerging support for structured metadata (CSP4, CSP6). FAIR-oriented tools
like MLentory [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] demonstrate ongoing eforts to extract structured metadata for integration into
knowledge infrastructures (CSP5).
        </p>
        <p>Zenodo is a general-purpose repository used to archive research artifacts, including datasets and
ML models (CSP1, CSP3). It adopts the well-established DataCite schema and assigns persistent DOIs,
ensuring long-term preservation and citation capabilities (CSP2,CSP6). Metadata is accessible in XML
and JSON formats, with public APIs and license declarations (CSP4). Although metadata for general
scientific artifacts is rich, Zenodo lacks expressiveness for ML-specific features, like model architecture
or training metrics (CSP5). Thus, integration with ML-specific standards requires supplementary
metadata or schema extensions. Despite this, its stability, openness, and alignment with FAIR principles
make it a valuable component in cross-platform metadata workflows involving Hugging Face and other
platforms (CSP5).</p>
        <p>GitHub is a ubiquitous platform for hosting ML-related content, such as serialized models, datasets,
and training code, often serving as the origin point for Hugging Face model repositories and Zenodo
archives (CSP1, CSP3). Metadata on GitHub is primarily unstructured, embeded within README.md,
LICENSE, or commit messages, without adherence to any formal schema (CSP2). While GitHub provides
version control and open access features supportive of reproducibility (CSP4), extracting structured
metadata often requires NLP-based or code-based analysis pipelines, limiting machine-actionability
and integration.</p>
        <p>Kaggle11 is a platform for hosting ML datasets and notebooks, widely used for competitions, and
educational purposes (CSP1, CSP3). Metadata for datasets include column descriptions, file formats,
and data types. For notebooks, execution environment, associated datasets, and runtime information
are captured in a structured form (CSP2). It provides a public API for accessing metadata, which is
generally well-structured and machine-actionable within its ecosystem (CSP2, CSP5). Despite its strong
internal schema, metadata remain tightly platform-specific and lack standardized vocabulary reuse,
limiting external interoperability (CSP5). Nevertheless, there is an ongoing efort to align metadata for
datasets to Croissant ML (CSP2).</p>
        <p>MLCommons is a collaborative initiative focused on improving reproducibility, benchmarking, and
metadata standardization in ML research and engineering (CSP1, CSP6). It provides a YAML-based
schema (model_index.yaml) describing model domains, evaluation metrics, and usage context in a
structured, machine-readable form (CSP2). Though the scope is currently limited to benchmarking,
the quality of structured metadata and community involvement make MLCommons a key actor in ML
metadata standardization (CSP5, CSP6). Artifacts are publicly available through repositories or GitHub
(CSP4). ML Commons is also the main driver behind Croissant ML (CSP2).</p>
        <p>Hugging Face Trending Papers succeeds the now-retired Papers with Code 12, continuing the
mission of bridging academic publications and code repositories. It is a discovery interface that
highlights recent and popular ML research papers, ranked based on community engagement and
GitHub star activity (CSP1, CSP3). While the interface supports paper-code linkage and improves
research visibility, the metadata is minimally structured and lacks alignment with formal schemas or
standardized vocabularies (CSP2). Consequently, its integration into structured metadata pipelines
remains limited. The feature functions primarily as a community-curated signal layer for research
discovery rather than a source of machine-actionable metadata (CSP6).</p>
        <p>OpenML is a collaborative platform designed for sharing datasets, ML tasks, and experiment results,
with a strong focus on traceability and reproducibility (CSP1, CSP3). OpenML provides a REST API and
a Python client library for programmatic access, ofering excellent machine actionability and semantic
transparency (CSP2, CSP4). It aligns closely with FAIR principles and supports metadata standards,
including the use of standardized vocabularies and experiment tracking formats (CSP5). It also integrates
with platforms like scikit-learn and supports schema extensions such as Croissant ML, reinforcing its
role in interoperable ML metadata ecosystems (CSP6).</p>
        <p>Summary. ML platforms vary widely in their metadata practices, reflecting difering goals,
communities, and technical architectures. OpenML and MLCommons support structured, FAIR-aligned
metadata, while GitHub and Hugging Face rely on unstructured formats, limiting interoperability and
automation without additional enrichment. Zenodo ofers openness and persistent identifiers but lacks
ML-specific schema support. Kaggle provides structured metadata within its ecosystem, though with
limited external integration. Hugging Face Trending Papers serves as a lightweight discovery interface
for recent ML research, but lacks the metadata depth and structure needed for integration into
interoperable systems. Overall, while foundational infrastructure exists, metadata practices across platforms
remain fragmented. Increased adoption of common schemas, shared vocabularies, and standardized
metadata pipelines is essential to enable reproducibility, discoverability, and cross-platform alignment
in the ML research ecosystem.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Existing Metadata Standards Relevant to ML</title>
      <p>Having examined how metadata is currently structured and exposed across major ML platforms, we
now turn to the metadata standards that underpin or could enhance these practices. In this section,
both general-purpose and ML-specific metadata standards are provided, assessing their applicability to
11https://kaggle.com/
12https://paperswithcode.com/
describing ML models and datasets, especially in terms of semantic interoperability and cross-platform
alignment. This dual focus matters in reconciling the maturity of general-purpose standards with the
specific needs of ML metadata. The selection of standards and initiatives presented here followed a
structured but pragmatic approach: we considered those (i) explicitly referenced or adopted by major ML
platforms, (ii) widely recognized in the broader research data management ecosystem, or (iii) discussed
in the existing literature and community initiatives on FAIR and reproducible ML.</p>
      <sec id="sec-3-1">
        <title>3.1. Criteria for Selecting Standards (CSS)</title>
        <p>To evaluate metadata standards for their suitability to describe ML artifacts, five criteria are defined.
These criteria reflect key technical and conceptual requirements for supporting structured, interoperable,
and FAIR-compliant metadata infrastructures in ML.13
• CSS1: Relevance to ML Artifacts refers to whether the standard is directly applicable to describing</p>
        <p>ML models, datasets, or experimental workflows.
• CSS2: Adoption in Research Platforms considers whether the standard is integrated into widely
used repositories, infrastructures, or policy frameworks.
• CSS3: Semantic Expressiveness reflects the degree to which the standard supports formal semantics
such as RDF, OWL, or Linked Data principles.
• CSS4: FAIR Alignment evaluates the extent to which the standard contributes to the achievement
of FAIRness, through persistent identifiers, license fields, access protocols, or reuse of vocabularies.
• CSS5: Maturity and Stability examines whether the standard is well-specified, actively maintained,
and supported by an established community. Considerations include specification completeness,
release frequency, and ecosystem support.</p>
        <p>These criteria are consistently applied when analyzing standards grouped under general-purpose and
ML-specific metadata categories.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. General-Purpose Metadata Standards</title>
        <p>These standards are widely used across disciplines and provide essential scafolding for metadata
representation.</p>
        <p>
          Schema.org is a widely adopted for annotating datasets, software, and publications on the web (CSS2,
CSS5). The use of JSON-LD enables integration with the Semantic Web (CSS3), and its popularity
supports interoperability across systems (CSS4) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. While Schema.org does not provide native support
for ML-specific entities, e.g., model architectures, training configurations, or evaluation results, it
is partially applicable to ML contexts due to its flexible class structure and extensibility ( ≈ CSS1; cf.
Figure 2). To address its ML limitations, there are extensions that aim to improve the coverage of
software (e.g., CodeMeta [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], maSMPs [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], Croissant ML for datasets [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and FAIR4ML for ML
models [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>DataCite is a prominent metadata schema designed for the citation and identification of research
artifacts, including datasets and software (CSS2, CSS5). It captures administrative and provenance
metadata such as creator, publisher, DOI and publication date (CSS4), but lacks technical or
ML-aware descriptors (¬CSS1). Its semantic depth is limited as it relies on key-value pairs (¬CSS3),
and its rigid schema complicates extensions for ML purposes. However, its integration with persistent
identifier infrastructures makes it essential for ensuring the citability and long-term preservation of
research artifacts.
13Notation: CSSi indicates full support for criterion ; ≈ CSSi indicates partial support; ¬CSSi indicates lack of support.
DCAT is a widely implemented vocabulary for describing digital resources and dataset catalogs (CSS2,
CSS5). It defines core classes such as dcat:Dataset, dcat:Catalog, and dcat:Distribution,
supporting discoverability and interoperability across data catalogs (CSS3, CSS4). However, it is not
designed for ML artifacts (¬CSS1), and do not accommodate extensions for tasks, models, or evaluation.
Therefore, its function better as bridging vocabularies than as standalone solutions for ML-specific
metadata integration.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. ML-Specific Metadata Standards</title>
        <p>The limitations of general-purpose standards in describing ML-specific artifacts have led to the
emergence of dedicated metadata standards designed to capture the unique semantics of ML, ofering greater
granularity, semantic expressiveness, and task-specific coverage.</p>
        <p>
          Croissant ML [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is a JSON-LD metadata specification developed by Google to describe ML
datasets (CSS1). It provides detailed structure descriptions for dataset components, including
features, files, licences, and schema types, enabling machine-actionable metadata and alignment with FAIR
priciples (CSS3, CSS4). Though still emerging (CSS2), it demonstrates strong extensibility and is built
upon mature vocabularies and schemas like Schema.org (≈ CSS5; see Figure 2).
        </p>
        <p>
          Model Cards [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] are semi-structured documents designed to communicate the usage scenarios,
limitations, evaluation metrics, and ethical considerations of ML models (CSS1, CSS4). They have
moderate adoption in applied ML communities (CSS2), especially in platforms like Hugging Face.
However, Model Cards are primarily designed for human interpretation and lack a formalized schema or
machine-actionable structure, limiting their semantic expressiveness and integration (¬CSS3, ¬CSS5).
FAIR4ML14 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is an ontology-based extension of Schema.org designed to enhance the FAIRness of
ML model documentation (CSS1, CSS3, CSS4). It introduces semantically precise terms for modeling
evaluation metrics, intended applications, and tasks, thereby enabling machine-actionable metadata
across ML workflows. Despite its rich semantic expressiveness, it is relatively new with limited adoption
in mainstream ML platforms (¬CSS2), and its tooling ecosystem is still developing (¬CSS5).
ML Schema15 is a lightweight, extensible vocabulary for describing ML experiments, models,
algorithms, and metrics (CSS1, CSS3). It enables semantic annotation of ML workflows and supports
integration with linked data infrastructures, aligning well with FAIR principles (CSS4). While
semantically expressive and conceptually accessible, ML Schema has seen limited adoption in major ML
platforms (¬CSS2), and its tooling ecosystem is still underdeveloped, with minimal support (¬ CSS5).
OntoDM-core [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] is a foundational ontology for the data mining domain that provides a formal
representation of core concepts such as data, tasks, algorithms, models, and results (CSS1, CSS3, CSS5).
It supports logical inference and reuse across domains. However, it has limited uptake in mainstream
ML platforms (¬ CSS2) and only partial alignment with FAIR practices (≈ CSS4).
        </p>
        <p>
          Expose Ontology [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] focuses on modeling the experimental design and execution of data analysis
processes, with a particular emphasis on provenance (CSS1, CSS3). It is semantically rich and
complements PROV-O 16, but has seen little adoption outside specific projects ( ¬ CSS2) and lacks broader
tooling support (¬ CSS5). Its alignment with FAIR is partial (≈ CSS4).
        </p>
        <p>
          DMOP [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] addresses the optimization dimension of data mining and ML workflows, especially in
relation to algorithm selection, hyperparameter tuning, and performance evaluation (CSS1, CSS3, CSS5).
DMOP is semantically expressive and extensible, making it suitable for modeling complex ML pipelines
and adaptations in domains like AutoML or meta-learning. However, its adoption is niche (¬ CSS2),
and its alignment with FAIR principles lacks suficient support ( ¬ CSS4).
14https://w3id.org/fair4ml
15https://ml-schema.github.io
16https://www.w3.org/TR/prov-o/
MEX [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] ofers a lightweight vocabulary for describing ML experiments, including datasets, algorithms,
hyperparameters, and results (CSS1, CSS3,CSS4). MEX has not seen significant adoption in mainstream
ML repositories or tools (¬CSS2). Its formal specification is available, but the ecosystem around
implementation, tooling, and maintenance remains limited (¬CSS5).
        </p>
        <p>Schema.org</p>
        <p>DataCite</p>
        <p>DCAT</p>
        <p>Croissant
Model Cards</p>
        <p>FAIR4ML</p>
        <p>ML Schema</p>
        <p>OntoDM-core
Expose Ontology</p>
        <p>DMOP</p>
        <p>MEX
Summary. This section outlines the distinction between general-purpose and ML-specific metadata
standards. Mature and widely adopted standards such as Schema.org, DataCite, and DCAT support
discoverability and citation but lack the semantic richness and extensibility required to describe
MLspecific artifacts, including models, training configurations, and evaluations. In contrast, ML-specific
standards, such as Croissant, FAIR4ML, ML Schema, and Model Cards address these gaps but remain
in early stages of adoption, with limited tooling. Ontology-driven approaches like OntoDM-core,
Expose, DMOP, and MEX provide formal representations and reasoning capabilities but vary in scope
and maturity. As expected, no single standard fully satisfies all requirements. Figure 2 presents a
comparative evaluation based on criteria defined in Section 3.1, showing that general-purpose standards
score well on adoption and maturity (CSS2, CSS5) but poorly on ML relevance (CSS1), while ML-specific
and ontology-driven approaches provide higher semantic expressiveness (CSS3) yet limited adoption (CSS2).
Moreover, a summary of these standards covering primary focus, semantic expressiveness,
machineactionability, support for provenance, support for ethical/bias information, and extensibility, is provided
in Table 1 in Appendix.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Metadata Harmonization</title>
      <p>The heterogeneity and fragmentation of metadata practices across ML platforms and standards introduce
substantial barriers to interoperability. Even when a structured metadata is used, it is often implemented
with divergent schemas, inconsistent vocabularies, and varying levels of detail. For example, platforms
such as GitHub, Zenodo, Hugging Face, and OpenML difer in how they represent authorship, tasks,
licensing, and performance metrics, making direct alignment dificult without additional normalization
or transformation. These inconsistencies hinder the seamless integration and reuse of metadata across
systems, complicate downstream applications that rely on unified metadata, and ultimately reduce
the discoverability, traceability, and reproducibility of ML artifacts. Addressing these issues requires
efective strategies for metadata extraction, mapping, and harmonization across heterogeneous sources.</p>
      <sec id="sec-4-1">
        <title>4.1. Challenges in Metadata Harmonization</title>
        <p>Harmonizing metadata from heterogeneous platforms introduces several challenges:
• Schema Heterogeneity: Diferent platforms adopt diferent data models, property labels, and
data types to describe the same concepts. For instance, the property referring to the creator of a
model may be labeled as author in Schema.org, creator in DataCite.
• Vocabulary Inconsistencies: Even when schemas are conceptually aligned or share a common
vocabulary, communities may interpret the same terms diferently, leading to semantic drift. For
example, the same ML task may be labeled as classification , categorization, or label prediction, or
conversely, the term accuracy may refer to diferent evaluation protocols depending on context.</p>
        <p>These inconsistencies complicate cross-platform querying, semantic alignment.
• Granularity Mismatch: Metadata varies in depth and detail across platforms. Some platforms
provide high-level descriptors (e.g., model family and task domain), while others ofer
finegrained specifications such as hyperparameters, environment settings, or evaluation protocol.
For example, a large language model may be referred to simply as LLaMA, or more precisely as
LLaMA-7B or LLaMA-13B, each with distinct architectures and training settings. Aligning across
these granularity levels requires careful abstraction or enrichment strategies.
• Semantic Ambiguity: Some commonly used terms are themselves semantically vague or
overloaded. For example, accuracy may refer to diferent evaluation metrics (e.g., Top-1 vs. Top-5),
data splits (e.g., test vs. validation), or output settings (e.g., single-label vs. multi-label), unless
explicitly defined.
• Unstructured Metadata: Crucial metadata is often embedded in free-text sources such as
README.md files, model cards, or publications. While information extraction (IE) techniques can
be applied, the process does not always yield structured outputs that are complete or reliable
enough for downstream integration tasks.
• Provenance Gaps: Many metadata records lack information about the origin and transformation
history of datasets and models, limiting trust and traceability.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Techniques for Metadata Harmonization</title>
        <p>
          Metadata Mapping and Crosswalks. A common approach to dealing with metadata heterogeneity
is via mapping concepts from one standard to another. Metadata mapping can be applied to a broad
set of cases, – from less semantic approaches (e.g., DataCite), to non-semantic ones (e.g., model cards
or schemas used internally in a specific platform), to semantic ones (ontologies). In line with this
idea, metadata crosswalks [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] are nowadays used to map metadata. Usually manually defined, they
map properties in diferent schemas, and are often managed and structured as spreadsheets. Metadata
crosswalks provide interpretability and flexibility, but they are labor intensive, error-prone, and dificult
to scale. Moreover, they lack formal semantics, limiting their utility in Linked Data and automated
environments, which limits their applicability.
        </p>
        <p>
          To improve traditional methods, structured mapping frameworks like the Simple Standard for Sharing
Ontology Mappings (SSSOM) [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] have been developed. SSSOM enables formal, machine-readable
mappings with metadata for provenance, alignment type (e.g., exact, broader), and confidence scores.
For instance, the concept DataCite:creator can be mapped as skos:exactMatch to the concept
schema:author. SSSOM-style mappings are being piloted in tools like the NFDI4DS QA[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] service 17
to enable structured integration across metadata schemes. A modified version is being used for the
update of CodeMeta 18 crosswalks.
        </p>
        <p>Despite these advances, large-scale adoption of semantic crosswalks remains limited. The creation of
high-quality semantically precise mappings still relies heavily on expert curation, which is a bottleneck
for scalability.</p>
        <p>
          Automated Extraction and Harmonization. Automated techniques enhance scalability and
consistency in metadata harmonization by reducing manual efort. NLP methods, such as named entity
recognition, relation extraction, and classification, are widely used to extract structured metadata
from sources like README files, model cards, and research papers [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. These approaches identify
entities (e.g., models, datasets), relationships (e.g., "trained on"), and attributes (e.g., license, metrics).
Schema matching and ontology alignment aim to reconcile concepts across metadata schemas using
lexical, structural, or instance-based similarity. Recent approaches incorporate embeddings and LLMs to
improve semantic alignment [
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ]. Following extraction and alignment, validation and normalization
ensure semantic consistency across metadata sources [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], using rule-based constraints or
ontologyaware checks. While automation improves eficiency, expert oversight remains essential for accuracy
and explainability in integrated metadata pipelines.
        </p>
        <p>Summary. Metadata harmonization is a persistent challenge in ML due to fragmented schemas,
inconsistent vocabularies, and missing provenance. Manual approaches like crosswalks and SSSOM
ofer structured mappings but require expert efort. Automated methods, such as NLP-based extraction,
schema matching, and validation, enhance scalability but still face issues of accuracy and explainability.
A hybrid approach that combines semantic precision, automation, and expert oversight is essential to
build interoperable and reusable metadata infrastructures.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Gaps, Limitations, and Research Opportunities</title>
      <p>Despite ongoing eforts to formalize metadata practices in ML, the current landscape remains fragmented
and underdeveloped in several critical areas, opening compelling directions for future research.</p>
      <sec id="sec-5-1">
        <title>5.1. Current Gaps and Limitations</title>
        <p>A major limitation is the absence of a unified, comprehensive metadata standard tailored to the full
lifecycle of ML artifacts. While eforts such as FAIR4ML and ML Schema have made notable progress,
no widely adopted standard exists yet that captures the full range of ML entities, including models,
datasets, evaluation metrics, workflows, and ethical dimensions, in an integrated and expressive way.
Another persistent challenge lies in the representation of dynamic metadata. ML models and datasets
are inherently mutable, often updated through processes such as fine-tuning, retraining, or automated
CI/CD pipelines. Existing metadata standards are largely static in design and tend to focus on fixed
snapshots of artifacts. As a result, they provide limited support for describing evolving provenance,
behavioral shifts, or version histories in a machine-actionable and consistent manner.</p>
        <p>Scalability presents an additional concern. Current integration strategies often rely on manual
curation or semi-automated tools that do not scale efectively to the growing volume and diversity of
ML artifacts across platforms. Furthermore, automated and robust metadata integration pipelines remain
in an early stage of development. Finally, while bias documentation is becoming more standardized,
broader aspects of ethical metadata, including privacy, safety, explainability, and societal impact, are
not yet consistently represented across standards or platforms. At this point, a general-purpose ethical
metadata vocabulary for ML is still lacking.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Future Research Directions</title>
        <p>Addressing these limitations calls for several targeted lines of investigation. First, there is a growing
need for methods to automate metadata extraction and generation. Future research should leverage
advances in NLP, code analysis, and LLMs to infer structured metadata from documentation, code
repositories, and execution traces, thereby reducing reliance on manual input. Second, the development
of semantic interoperability frameworks for ML is crucial. Such frameworks should combine extensible
terminology solutions (ontologies, vocabularies, etc.), possibly with Linked Data principles for machine
readability, to enable automated alignment and querying across heterogeneous platforms. Research is
particularly needed in automated ontology matching and mapping tailored to the ML domain, which
remains underexplored. Third, standardization eforts around ethical AI metadata should be expanded.
This includes formalizing descriptors for fairness, transparency, accountability, and explainability,
potentially as extensions to existing eforts like Model Cards. These standards should aim to be both
human-interpretable and machine-actionable. Lastly, future work must address the management of
dynamic and versioned metadata. Novel models and infrastructures are required to track the evolution
of ML artifacts over time, capturing temporal and contextual changes in training data, hyperparameters,
and performance outcomes.</p>
        <p>Together, these research directions represent a roadmap toward more robust, scalable, and ethically
grounded metadata ecosystems for ML.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The growing availability of ML models and datasets highlights the compelling need for structured,
standardized metadata supporting research artifacts across repositories. This survey analyzed metadata
practices and standards for ML artifacts, identifying challenges such as schema heterogeneity,
inconsistent granularity, and unstructured documentation that impede integration and semantic interoperability.
Strategies such as schema alignments, crosswalks, and the use of shared conceptual models enable
integration and semantic interoperability across platforms, supporting consistent interpretation of
metadata despite underlying heterogeneity. While initiatives like Model Cards and FAIR4ML show
promise, gaps remain in unified ontologies, dynamic metadata management, and automated tooling.
Addressing these limitations requires sustained community efort, not only in developing and adopting
robust metadata standards, but in establishing a uniform conceptualization of the ML life cycle. Such a
shared foundation would facilitate consistent mappings between standards and improve best practices
in applying them. This collective efort is essential for building scalable, FAIR-compliant metadata
infrastructures that support discovery, traceability, and reuse across the ML ecosystem.</p>
      <p>As a continuation of this survey, future work will focus on exploring the practical integration of
standardized metadata into downstream ML applications, such as KGs for semantic search and discovery.
This step will further enhance the insights provided by this survey by expanding the exploration of
how FAIR metadata supports automation, reproducibility, and knowledge discovery across the ML
ecosystem. This work will be addressed in a forthcoming version of the study with an extended scope.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgment</title>
      <p>This work has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research
Foundation), NFDI4DS (Grant number 460234259). Authors acknowledge the data sources and also
thank the individuals involved in this research.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check. After using this tool/service, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.</p>
      <p>A. A summary of Existing Standards
h h re h h h h h re
d d
o o
igh tea ih h w h h h h h h
g ig lo ig ig ig ig ig ig</p>
      <p>h h h h h h
,</p>
      <p>LD itsc igh )a te .)s )y )a
m ra n g
o lo em</p>
      <p>o h
O m e s
n s
m e o d</p>
      <p>x
m
e
s e
ep it
e a i</p>
      <p>k
d tr</p>
      <p>n
o is i</p>
      <p>l
n
e i
n s</p>
      <p>t
e ta
m ad JS e t</p>
      <p>s
g l r
v g
n i
s
a ed
d ab
n a e s</p>
      <p>m o
u r
a fo m L
t W
.
b e</p>
      <p>v
S o n s d
c a ,</p>
      <p>m ro t
g
a d p e w</p>
      <p>o
a o
ad (n
m l
ad ic r
o h l
r p u c
b a u
y re
a tu
g c
o o (
w e
lo d
o
b r
a t</p>
      <p>S
i c e
rP ire rse in</p>
      <p>s
n t</p>
      <p>m
o rk ir
o e
w xp
I m t</p>
      <p>a t
FA ,ts d n
e
i
m s
a ed</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          , K.-U. Sattler,
          <article-title>Management of machine learning lifecycle artifacts: A survey</article-title>
          ,
          <source>ACM SIGMOD Record</source>
          <volume>51</volume>
          (
          <year>2023</year>
          )
          <fpage>18</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          ,
          <article-title>Metadata representations for queryable ml model zoos</article-title>
          ,
          <source>arXiv preprint arXiv:2207.09315</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Labou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J. S.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baluja</surname>
          </string-name>
          ,
          <article-title>Toward enhanced reusability: A comparative analysis of metadata for machine learning objects and their characteristics in generalist and specialist repositories</article-title>
          ,
          <source>Journal of eScience Librarianship</source>
          <volume>13</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brambilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozzon</surname>
          </string-name>
          ,
          <article-title>Metadata representations for queryable repositories of machine learning models</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>125616</fpage>
          -
          <lpage>125630</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2023</year>
          .
          <volume>3330647</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Aalbersberg</surname>
          </string-name>
          , G. Appleton,
          <string-name>
            <given-names>M.</given-names>
            <surname>Axton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Blomberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-W.</given-names>
            <surname>Boiten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B. da Silva</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Bourne</surname>
          </string-name>
          , et al.,
          <article-title>The fair guiding principles for scientific data management and stewardship</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>D. I. Hillmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brady</surname>
          </string-name>
          ,
          <article-title>Metadata standards and applications</article-title>
          ,
          <source>The Serials Librarian</source>
          <volume>54</volume>
          (
          <year>2008</year>
          )
          <fpage>7</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <article-title>Provenance for computational tasks: A survey, Computing in science</article-title>
          &amp; engineering
          <volume>10</volume>
          (
          <year>2008</year>
          )
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Löfler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <article-title>Machine learning pipelines: Provenance, reproducibility and fair data principles</article-title>
          , in: B.
          <string-name>
            <surname>Glavic</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Braganholo</surname>
          </string-name>
          , D. Koop (Eds.),
          <source>Provenance and Annotation of Data and Processes</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Limani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tofik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Latif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tochtermann</surname>
          </string-name>
          ,
          <article-title>Fair for machine learning model: Principles and assessment metrics</article-title>
          ,
          <year>2024</year>
          . URL: https://doi.org/10.5281/zenodo.13835105. doi:
          <volume>10</volume>
          .5281/zenodo. 13835105.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <article-title>It took longer than i was expecting: Why is dataset search still so hard?</article-title>
          ,
          <source>in: Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, HILDA</source>
          <volume>24</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . URL: https://doi.org/10.1145/3665939.3665959. doi:
          <volume>10</volume>
          .1145/3665939.3665959.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsubara</surname>
          </string-name>
          ,
          <article-title>Estimating metadata of research artifacts to enhance their findability</article-title>
          ,
          <source>in: 2024 IEEE 20th International Conference on e-Science (e-Science)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1109/ e-Science62913.
          <year>2024</year>
          .
          <volume>10678684</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Benjelloun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Conforti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Foschini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Giner-Miguelez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gijsbers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karamousadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuchnik</surname>
          </string-name>
          , et al.,
          <article-title>Croissant: A metadata format for ml-ready datasets</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>37</volume>
          (
          <year>2024</year>
          )
          <fpage>82133</fpage>
          -
          <lpage>82148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Solanki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Quiñones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jael</surname>
          </string-name>
          ,
          <article-title>Mlentory, an fdo registry for machine learning models (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Verhey</surname>
          </string-name>
          ,
          <article-title>Schema. org for research data managers: a primer</article-title>
          ,
          <source>International Journal of Big Data Management</source>
          <volume>2</volume>
          (
          <year>2022</year>
          )
          <fpage>95</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. B. Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Boettiger</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Mayes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Slaughter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Niemeyer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fenner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hahnel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Crosas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sands</surname>
            ,
            <given-names>N. C.</given-names>
          </string-name>
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cruse</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Katz</surname>
          </string-name>
          , C. Goble,
          <article-title>CodeMeta: an exchange schema for software metadata</article-title>
          .
          <source>KNB Data Repository</source>
          ,
          <year>2016</year>
          . URL: https://raw.githubusercontent.com/codemeta/codemeta/1.0/codemeta.jsonld. doi:
          <volume>10</volume>
          .5063/ SCHEMA/CODEMETA-1.0, medium: application/ld+json.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Giraldo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Geist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Quiñones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Solanki</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Rebholz-Schuhmann,
          <source>machineactionable Software Management Plan Ontology (maSMP Ontology)</source>
          ,
          <year>2024</year>
          . URL: https://zenodo. org/records/10582073. doi:
          <volume>10</volume>
          .5281/zenodo.10582073, publisher: Zenodo.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Solanki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Ciuciu-Kiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Eklund</surname>
          </string-name>
          , G. Bharathy,
          <string-name>
            <given-names>R. D. A. F.</given-names>
            <surname>Task</surname>
          </string-name>
          , FAIR4ML-schema,
          <year>2024</year>
          . URL: https://zenodo.org/records/14002310. doi:
          <volume>10</volume>
          .5281/zenodo.14002310, publisher: Zenodo.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaldivar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vasserman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Spitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Raji</surname>
          </string-name>
          , T. Gebru,
          <article-title>Model cards for model reporting</article-title>
          ,
          <source>in: Proceedings of the conference on fairness, accountability, and transparency</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Panov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldatova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Džeroski</surname>
          </string-name>
          ,
          <article-title>Ontology of core data mining entities</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>28</volume>
          (
          <year>2014</year>
          )
          <fpage>1222</fpage>
          -
          <lpage>1265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          , L. Soldatova,
          <string-name>
            <surname>Exposé:</surname>
          </string-name>
          <article-title>An ontology for data mining experiments</article-title>
          , in: International workshop
          <article-title>on third generation data mining: Towards service-oriented knowledge discovery (SoKD2010</article-title>
          ),
          <year>2010</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C. M. Keet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ławrynowicz</surname>
          </string-name>
          , C.
          <string-name>
            <surname>d'Amato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kalousis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Palma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hilario</surname>
          </string-name>
          ,
          <article-title>The data mining optimization ontology</article-title>
          ,
          <source>Journal of web semantics 32</source>
          (
          <year>2015</year>
          )
          <fpage>43</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Esteves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ackermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Mex vocabulary: a lightweight interchange format for machine learning experiments</article-title>
          ,
          <source>in: Proceedings of the 11th International Conference on Semantic Systems</source>
          , ACM,
          <year>2015</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Martínková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Juty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Beltran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Le Franc</surname>
          </string-name>
          ,
          <article-title>Moving towards fair mappings and crosswalks</article-title>
          .,
          <source>in: FAIR principles for Ontologies and Metadata in Knowledge Management</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Balhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Bello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Callahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Chute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. D.</given-names>
            <surname>Duncan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Evelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gabriel</surname>
          </string-name>
          , et al.,
          <article-title>A simple standard for sharing ontological mappings (sssom</article-title>
          ),
          <source>Database</source>
          <year>2022</year>
          (
          <year>2022</year>
          )
          <article-title>baac035</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>H. B. Giglou</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          <string-name>
            <surname>Tafa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Abdullah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Usmanova</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Scholarly question answering using large language models in the nfdi4datascience gateway</article-title>
          ,
          <source>Natural Scientific Language Processing and Research Knowledge Graphs</source>
          (
          <year>2024</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Martinez-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Lopez-Arevalo, Information extraction meets the semantic web: a survey</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>255</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Ontoea:
          <article-title>Ontology-guided entity alignment via joint knowledge graph embedding</article-title>
          ,
          <source>arXiv preprint arXiv:2105.07688</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Higashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <article-title>Automated harmonization and large-scale integration of heterogeneous biomedical sample metadata using large language models</article-title>
          ,
          <source>bioRxiv</source>
          (
          <year>2024</year>
          )
          <fpage>2024</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Horbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ross-Hellauer</surname>
          </string-name>
          ,
          <article-title>Open science at the generative ai turn: An exploratory analysis of challenges and opportunities</article-title>
          ,
          <source>Quantitative science studies 6</source>
          (
          <year>2025</year>
          )
          <fpage>22</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>