<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>DOLAP</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Quality in Data Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudia P. Ayala</string-name>
          <email>claudia.ayala@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besim Bilalli</string-name>
          <email>besim.bilalli@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Gómez</string-name>
          <email>cristina.gomez@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose-Norberto Mazón</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Romero</string-name>
          <email>oscar.romero@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Data Spaces, Data Quality, Data Validation, Federated Data Management, Data Sharing</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>BarcelonaTech</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>27</volume>
      <abstract>
        <p>Data Spaces must preserve sovereignty and privacy while ensuring FAIR (Findable, Accessible, Interoperable and Reusable) principles. To do so, policy-based strategies have to be developed in order to describe the agreements reached in the Data Space. In this context, two open questions arise: how to define the right Data Space policies, as well as, how to enforce (and monitor) them. Despite the eforts towards defining and enforcing data access and usage policies, there is no solution to operationalize the enforcement of those considering data quality dimensions. However, data quality is becoming a hot topic due to the surge of federated learning and alternative analytical techniques, which require all providers to guarantee a data quality threshold in order to learn robust models. Currently, we have means to describe policies related to data quality rules (e.g., by combining standards such as ODRL and standard vocabularies) but we are missing means to elicit these policies from data providers and enforce them while preserving the data sovereignty. In this paper, we discuss the challenges and open questions that must be addressed in order to operationalize (and eventually, automate) data quality in Data Spaces, which span from requirements elicitation to data validation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data Spaces are federated ecosystems in which data
providers and consumers share data while preserving data
sovereignty and privacy. Currently, the Data Mesh
architecture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is at the core of current technological solutions,
since it provides a domain-decentralized paradigm that suits
the Data Space requirements [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Relevantly, the Data Mesh
defines the
      </p>
      <p>
        Data Product concept, which provides a
productoriented view of the providers’ data assets. In short, the data
product is a node that encapsulates three structural
components required to function: code for enforcing policies (i.e.,
the Data Space agreements), data (and its metadata) and
infrastructure [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. By definition, the providers’ data assets
can be heterogeneous both in the infrastructure used and
the data provided (in format and semantics).
      </p>
      <p>
        Behind the idea of Data Spaces is the objective of
extracting value from data sharing. This can be achieved in many
ways, but data analysis arises as prominent means to achieve
so, either by means of descriptive analysis (e.g.,
dashboarding and OLAP) or predictive analysis (e.g., learning models).
However, how to achieve data analysis in federated
environments is an open challenge, and federated learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
is currently the most widespread privacy-aware data
analysis technique. Many eforts have been devoted to develop
robust federated learning but little attention has been paid
to the role of data. Yet, the impact of the data quality (DQ)
from each provider on federated models learnt is huge [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>One of the biggest open problems in Data Spaces not
properly tackled is how the agreements reached (e.g., on
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT
∗Corresponding author.</p>
      <p>LGOBE</p>
      <p>https://futur.upc.edu/ClaudiaPatriciaAyalaMartinez (C. P. Ayala);
https://futur.upc.edu/BesimBilalli (B. Bilalli);
https://futur.upc.edu/CristinaGomezSeoane (C. Gómez);
https://s.ua.es/_MuH (J. Mazón);
https://futur.upc.edu/OscarRomeroMoral (O. Romero)</p>
      <p>0000-0002-6262-3698 (C. P. Ayala); 0000-0002-0575-2389 (B. Bilalli);
0000-0002-3872-0439 (C. Gómez); 0000-0001-7924-0880 (J. Mazón);
0000-0001-6350-8328 (O. Romero)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
      <p>
        DQ) at the Data Space federated layer (i.e., at the federated
-unique- view of the data ecosystem) can be enforced at
the providers’ data assets regardless of their heterogeneity
and preserving data ownership and privacy. Note that this
problem has been easily tackled in centralized environments
by having a central authority extracting, transforming and
preparing data for analysis. However, this is not possible in
settings where data is not meant to be shared raw. For
example, the minimum number of instances and the variances
of key attributes might be set as DQ criteria for all data
providers and should be automatically and locally validated
by executing a software service (specific for the provider
infrastructure) provided by the Data Space services catalog.
The result of the service execution should be communicated
to the Data Space. To our knowledge, there is no
architecture, framework or solution tackling this problem, despite
the myriad of standards and definitions blooming around
the Data Space concept (e.g., [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]).
      </p>
      <p>We focus on how to validate DQ agreements in the Data
Space and discuss the open challenges to make DQ happen
in Data Spaces to enact trustworthy federated learning.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenges and Vision</title>
      <p>
        Data Spaces require a governance model for specifying DQ
agreements that stakeholders must adhere to in order to
participate. Importantly, this governance model must also
specify DQ needs agreed among data consumers and providers
when developing specific uses cases. Therefore, our view
is that the governance model for Data Spaces should
distinguish two levels: 1) a Data Space level for agreements
among stakeholders of the Data Space authority from data
regulations and strategic issues, and 2) a use case level for
agreements among data providers and consumers to build
specific Data Products. Based on this view and to facilitate
the discussion, we propose a visionary framework with a
process for the Data Space and use case levels (see Fig. 1).
Our framework follows the Open Data Product
specification [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], thus splitting each process into two parts: one
declarative, at a higher-level of abstraction specifying what
(analysis phase), and another one at a lower-level specifying
how (design and implementation phases). The declarative
part defines the DQ dimensions and intended level. The
exCEUR
      </p>
      <p>ceur-ws.org
ecutable part contains the machine-readable “as code” rules,
provided as a service, to validate DQ dimensions. Next, we
describe both processes and their main challenges.</p>
      <p>DQ Requirements Engineering for Data Spaces.</p>
      <p>
        Requirements engineering (RE) for complex systems in
open and dynamic environments that extend beyond a
single organization is widely recognized as a challenging
endeavor [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. This is particularly true in the context of
Data Spaces, where the elicitation and management of
requirements must reconcile diverse perspectives, including
the strategic business vision, governance, compliance with
laws and regulations, infrastructure, scalability demands,
and DQ considerations. Our visionary framework proposes
applying RE practices to elicit, specify, and manage the Data
Space requirements. We advocate for the development and
use of a Catalogue of DQ Requirements at two levels: the
Data Space level and the use case level. These catalogs
promote knowledge sharing and requirements reuse, building a
robust repository of experiences and best practices. The
proposed process is aimed to: 1) Ensure a common
understanding of DQ dimensions by considering established standards;
2) Facilitate the elicitation of diverse DQ requirements from
diverse stakeholders to enable efective data sharing; 3)
Support the structured specification and management of DQ
requirements to ensure compliance and alignment between
the Data Space and use case levels for their subsequent
operationalization; and 4) Address trade-ofs between conflicting
DQ requirements. This approach aims to bridge the gap
between diverse stakeholder perspectives and the technical
requirements for robust DQ management in Data Spaces.
      </p>
      <p>Extraction and Customization of DQ Rules. The
complexity of DQ requirements and their textual or
semistructured formalization make their direct
operationalization challenging. With the aim of making DQ requirements
executable in an operational environment, our visionary
framework proposes to transform, in a semi-automated way
and using specific catalogues for supporting this
transformation, DQ requirements (at Data Space and use case levels)
into formalized DQ rules that may be easily implemented.
We propose to use a rule language with well-defined
semantics (e.g., ODRL), to formalize DQ rules. Several challenges
need to be tackled when performing this transformation: 1)
the identification of relevant and suitable stakeholders with
the specific knowledge for performing this activity in both
levels; 2) the definition of specific catalogues with reusable
transformation patterns for translating DQ requirements
into rules, preserving their semantics; 3) the definition of the
artifacts needed (e.g., specialized metamodels or new ODRL
profiles), for automating the extraction and customization
of DQ rules to the specific domain and level.</p>
      <p>Implementation available as a Service of DQ Rules.</p>
      <p>The inherent heterogeneity of providers in the context
of Data Spaces renders the process of translating formal
DQ rules into executable services a significant challenge.
The main goal of this activity is to avoid building and
maintaining custom solutions that are tightly coupled to specific
execution environments or platforms. To address this, we
propose an agnostic solution that leverages best practices
from software engineering, such as containerized solutions,
ensuring portability, scalability, and interoperability.
However, the intrinsic characteristics of Data Spaces introduce
several challenges that must be addressed: 1) dealing with
heterogeneity at the infrastructure level by abstracting the
diferences while ensuring consistent performance and
security across environments; 2) allowing for dynamic and
federated execution across multiple distributed nodes, ensuring
real-time validation without requiring data centralization.</p>
      <p>As conclusion, there is a need for further research to
enact DQ in Data Spaces, a must for qualitative federated
data analysis. In this sense, we have discussed a visionary
framework, its main phases and challenges to be tackled.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the
EUHORIZON program under GA.101135513 (CYCLOPS) and
by CIAICO/2022/019 project from Generalitat Valenciana.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goedegebuure</surname>
          </string-name>
          , I. Kumara,
          <string-name>
            <given-names>S.</given-names>
            <surname>Driessen</surname>
          </string-name>
          , W.-J. Van Den Heuvel, G. Monsieur,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Tamburri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Nucci</surname>
          </string-name>
          ,
          <article-title>Data mesh: a systematic gray literature review</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>57</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bacco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kocian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chessa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Crivello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barsocchi</surname>
          </string-name>
          ,
          <article-title>What are data spaces? systematic survey and future outlook, Data in Brief 57 (</article-title>
          <year>2024</year>
          )
          <fpage>110969</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , Data Mesh:
          <article-title>Delivering Data-driven Value at Scale</article-title>
          ,
          <source>O'Reilly</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>McMahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hampson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. A. y Arcas</surname>
          </string-name>
          ,
          <article-title>Communication-eficient learning of deep networks from decentralized data</article-title>
          , in: A.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>X. J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 20th International Conference on Artificial Intelligence and Statistics</source>
          , AISTATS
          <year>2017</year>
          ,
          <volume>20</volume>
          -
          <fpage>22</fpage>
          April 2017,
          <string-name>
            <given-names>Fort</given-names>
            <surname>Lauderdale</surname>
          </string-name>
          , FL, USA, volume
          <volume>54</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1273</fpage>
          -
          <lpage>1282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanjabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talwalkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Federated optimization in heterogeneous networks</article-title>
          ,
          <source>in: I. S. Dhillon</source>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Papailiopoulos</surname>
          </string-name>
          , V. Sze (Eds.),
          <source>Proceedings of the Third Conference on Machine Learning and Systems, MLSys</source>
          <year>2020</year>
          , Austin, TX, USA, March 2-
          <issue>4</issue>
          ,
          <year>2020</year>
          , mlsys.org,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nagalapatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guttula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mujumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Sharma</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Munigala</surname>
          </string-name>
          ,
          <article-title>Overview and importance of data quality for machine learning tasks</article-title>
          ,
          <source>in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, Association for Computing Machinery</source>
          ,
          <year>2020</year>
          , p.
          <fpage>3561</fpage>
          -
          <lpage>3562</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] Fiware for data spaces: Position paper</article-title>
          , https://www.fiware.org/wp-content/uploads/ FF_PositionPaper_FIWARE4DataSpaces.pdf,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -12-20.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <article-title>International data spaces association</article-title>
          , https://internationaldataspaces.org/why/ international-standards/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          - 12-20.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Data product specification</article-title>
          , https://opendataproducts. org/v3.
          <article-title>1/#optional-attributes-and-</article-title>
          <string-name>
            <surname>elements</surname>
          </string-name>
          ,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -12-20.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Malcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Viana</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. P.</surname>
          </string-name>
          dos Santos,
          <article-title>What do we know about requirements management in software ecosystems?</article-title>
          ,
          <source>Requir. Eng</source>
          .
          <volume>28</volume>
          (
          <year>2023</year>
          )
          <fpage>567</fpage>
          -
          <lpage>593</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hagenhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biehs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Otto</surname>
          </string-name>
          ,
          <article-title>Designing a reference architecture for collaborative condition monitoring data spaces: Design requirements and views</article-title>
          , in: M.
          <string-name>
            <surname>Mandviwalla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Söllner</surname>
          </string-name>
          , T. Tuunanen (Eds.),
          <source>Design Science Research for a Resilient Future - 19th International Conference on Design Science Research in Information Systems and Technology, DESRIST</source>
          <year>2024</year>
          , Trollhättan, Sweden, June 3-5,
          <year>2024</year>
          , Proceedings, volume
          <volume>14621</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>