=Paper=
{{Paper
|id=Vol-3931/panel1
|storemode=property
|title=Challenges to Enforce Data Quality in Data Spaces
|pdfUrl=https://ceur-ws.org/Vol-3931/panel1.pdf
|volume=Vol-3931
|authors=Claudia Ayala,Besim Bilalli,Cristina Gómez,Jose-Norberto Mazón,Oscar Romero
|dblpUrl=https://dblp.org/rec/conf/dolap/AyalaB0M025
}}
==Challenges to Enforce Data Quality in Data Spaces==
<pdf width="1500px">https://ceur-ws.org/Vol-3931/panel1.pdf</pdf>
<pre>
                         Challenges to Enforce Data Quality in Data Spaces
                         Claudia P. Ayala1 , Besim Bilalli1 , Cristina Gómez1 , Jose-Norberto Mazón2,∗ and Oscar Romero1
                         1
                             Universitat Politècnica de Catalunya, BarcelonaTech
                         2
                             Universitat d’Alacant / Universidad de Alicante


                                           Abstract
                                           Data Spaces must preserve sovereignty and privacy while ensuring FAIR (Findable, Accessible, Interoperable and Reusable) principles.
                                           To do so, policy-based strategies have to be developed in order to describe the agreements reached in the Data Space. In this context,
                                           two open questions arise: how to define the right Data Space policies, as well as, how to enforce (and monitor) them. Despite the
                                           efforts towards defining and enforcing data access and usage policies, there is no solution to operationalize the enforcement of those
                                           considering data quality dimensions. However, data quality is becoming a hot topic due to the surge of federated learning and alternative
                                           analytical techniques, which require all providers to guarantee a data quality threshold in order to learn robust models. Currently, we
                                           have means to describe policies related to data quality rules (e.g., by combining standards such as ODRL and standard vocabularies) but
                                           we are missing means to elicit these policies from data providers and enforce them while preserving the data sovereignty. In this paper,
                                           we discuss the challenges and open questions that must be addressed in order to operationalize (and eventually, automate) data quality
                                           in Data Spaces, which span from requirements elicitation to data validation.

                                           Keywords
                                           Data Spaces, Data Quality, Data Validation, Federated Data Management, Data Sharing


                         1. Introduction                                                                                              DQ) at the Data Space federated layer (i.e., at the federated
                                                                                                                                      -unique- view of the data ecosystem) can be enforced at
                         Data Spaces are federated ecosystems in which data                                                           the providers’ data assets regardless of their heterogeneity
                         providers and consumers share data while preserving data                                                     and preserving data ownership and privacy. Note that this
                         sovereignty and privacy. Currently, the Data Mesh archi-                                                     problem has been easily tackled in centralized environments
                         tecture [1] is at the core of current technological solutions,                                               by having a central authority extracting, transforming and
                         since it provides a domain-decentralized paradigm that suits                                                 preparing data for analysis. However, this is not possible in
                         the Data Space requirements [2]. Relevantly, the Data Mesh                                                   settings where data is not meant to be shared raw. For ex-
                         defines the Data Product concept, which provides a product-                                                  ample, the minimum number of instances and the variances
                         oriented view of the providers’ data assets. In short, the data                                              of key attributes might be set as DQ criteria for all data
                         product is a node that encapsulates three structural compo-                                                  providers and should be automatically and locally validated
                         nents required to function: code for enforcing policies (i.e.,                                               by executing a software service (specific for the provider
                         the Data Space agreements), data (and its metadata) and                                                      infrastructure) provided by the Data Space services catalog.
                         infrastructure [3]. By definition, the providers’ data assets                                                The result of the service execution should be communicated
                         can be heterogeneous both in the infrastructure used and                                                     to the Data Space. To our knowledge, there is no architec-
                         the data provided (in format and semantics).                                                                 ture, framework or solution tackling this problem, despite
                            Behind the idea of Data Spaces is the objective of extract-                                               the myriad of standards and definitions blooming around
                         ing value from data sharing. This can be achieved in many                                                    the Data Space concept (e.g., [7, 8]).
                         ways, but data analysis arises as prominent means to achieve                                                    We focus on how to validate DQ agreements in the Data
                         so, either by means of descriptive analysis (e.g., dashboard-                                                Space and discuss the open challenges to make DQ happen
                         ing and OLAP) or predictive analysis (e.g., learning models).                                                in Data Spaces to enact trustworthy federated learning.
                         However, how to achieve data analysis in federated envi-
                         ronments is an open challenge, and federated learning [4]
                         is currently the most widespread privacy-aware data analy-                                                   2. Challenges and Vision
                         sis technique. Many efforts have been devoted to develop
                         robust federated learning but little attention has been paid                                                 Data Spaces require a governance model for specifying DQ
                         to the role of data. Yet, the impact of the data quality (DQ)                                                agreements that stakeholders must adhere to in order to par-
                         from each provider on federated models learnt is huge [5, 6].                                                ticipate. Importantly, this governance model must also spec-
                            One of the biggest open problems in Data Spaces not                                                       ify DQ needs agreed among data consumers and providers
                         properly tackled is how the agreements reached (e.g., on                                                     when developing specific uses cases. Therefore, our view
                                                                                                                                      is that the governance model for Data Spaces should dis-
                         DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-                                        tinguish two levels: 1) a Data Space level for agreements
                         guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                      among stakeholders of the Data Space authority from data
                         2025, March 25, 2025, Barcelona, Spain
                         ∗
                              Corresponding author.
                                                                                                                                      regulations and strategic issues, and 2) a use case level for
                         Envelope-Open claudia.ayala@upc.edu (C. P. Ayala); besim.bilalli@upc.edu                                     agreements among data providers and consumers to build
                         (B. Bilalli); cristina.gomez@upc.edu (C. Gómez); jnmazon@ua.es                                               specific Data Products. Based on this view and to facilitate
                         (J. Mazón); oscar.romero@upc.edu (O. Romero)                                                                 the discussion, we propose a visionary framework with a
                         GLOBE https://futur.upc.edu/ClaudiaPatriciaAyalaMartinez (C. P. Ayala);                                      process for the Data Space and use case levels (see Fig. 1).
                         https://futur.upc.edu/BesimBilalli (B. Bilalli);
                         https://futur.upc.edu/CristinaGomezSeoane (C. Gómez);
                                                                                                                                      Our framework follows the Open Data Product specifica-
                         https://s.ua.es/_MuH (J. Mazón);                                                                             tion [9], thus splitting each process into two parts: one
                         https://futur.upc.edu/OscarRomeroMoral (O. Romero)                                                           declarative, at a higher-level of abstraction specifying what
                         Orcid 0000-0002-6262-3698 (C. P. Ayala); 0000-0002-0575-2389 (B. Bilalli);                                   (analysis phase), and another one at a lower-level specifying
                         0000-0002-3872-0439 (C. Gómez); 0000-0001-7924-0880 (J. Mazón);                                              how (design and implementation phases). The declarative
                         0000-0001-6350-8328 (O. Romero)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License   part defines the DQ dimensions and intended level. The ex-
                                       Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
         Figure 1: Visionary framework for considering DQ requirements in Data Spaces.


ecutable part contains the machine-readable “as code” rules,      We propose to use a rule language with well-defined seman-
provided as a service, to validate DQ dimensions. Next, we        tics (e.g., ODRL), to formalize DQ rules. Several challenges
describe both processes and their main challenges.                need to be tackled when performing this transformation: 1)
   DQ Requirements Engineering for Data Spaces.                   the identification of relevant and suitable stakeholders with
   Requirements engineering (RE) for complex systems in           the specific knowledge for performing this activity in both
open and dynamic environments that extend beyond a sin-           levels; 2) the definition of specific catalogues with reusable
gle organization is widely recognized as a challenging en-        transformation patterns for translating DQ requirements
deavor [10, 11]. This is particularly true in the context of      into rules, preserving their semantics; 3) the definition of the
Data Spaces, where the elicitation and management of re-          artifacts needed (e.g., specialized metamodels or new ODRL
quirements must reconcile diverse perspectives, including         profiles), for automating the extraction and customization
the strategic business vision, governance, compliance with        of DQ rules to the specific domain and level.
laws and regulations, infrastructure, scalability demands,           Implementation available as a Service of DQ Rules.
and DQ considerations. Our visionary framework proposes              The inherent heterogeneity of providers in the context
applying RE practices to elicit, specify, and manage the Data     of Data Spaces renders the process of translating formal
Space requirements. We advocate for the development and           DQ rules into executable services a significant challenge.
use of a Catalogue of DQ Requirements at two levels: the          The main goal of this activity is to avoid building and main-
Data Space level and the use case level. These catalogs pro-      taining custom solutions that are tightly coupled to specific
mote knowledge sharing and requirements reuse, building a         execution environments or platforms. To address this, we
robust repository of experiences and best practices. The pro-     propose an agnostic solution that leverages best practices
posed process is aimed to: 1) Ensure a common understand-         from software engineering, such as containerized solutions,
ing of DQ dimensions by considering established standards;        ensuring portability, scalability, and interoperability. How-
2) Facilitate the elicitation of diverse DQ requirements from     ever, the intrinsic characteristics of Data Spaces introduce
diverse stakeholders to enable effective data sharing; 3) Sup-    several challenges that must be addressed: 1) dealing with
port the structured specification and management of DQ            heterogeneity at the infrastructure level by abstracting the
requirements to ensure compliance and alignment between           differences while ensuring consistent performance and secu-
the Data Space and use case levels for their subsequent oper-     rity across environments; 2) allowing for dynamic and feder-
ationalization; and 4) Address trade-offs between conflicting     ated execution across multiple distributed nodes, ensuring
DQ requirements. This approach aims to bridge the gap             real-time validation without requiring data centralization.
between diverse stakeholder perspectives and the technical           As conclusion, there is a need for further research to
requirements for robust DQ management in Data Spaces.             enact DQ in Data Spaces, a must for qualitative federated
   Extraction and Customization of DQ Rules. The                  data analysis. In this sense, we have discussed a visionary
complexity of DQ requirements and their textual or semi-          framework, its main phases and challenges to be tackled.
structured formalization make their direct operationaliza-
tion challenging. With the aim of making DQ requirements
executable in an operational environment, our visionary           Acknowledgments
framework proposes to transform, in a semi-automated way
                                                                  This work has been partially supported by the EU-
and using specific catalogues for supporting this transfor-
                                                                  HORIZON program under GA.101135513 (CYCLOPS) and
mation, DQ requirements (at Data Space and use case levels)
                                                                  by CIAICO/2022/019 project from Generalitat Valenciana.
into formalized DQ rules that may be easily implemented.
References
 [1] A. Goedegebuure, I. Kumara, S. Driessen, W.-J. Van
     Den Heuvel, G. Monsieur, D. A. Tamburri, D. D. Nucci,
     Data mesh: a systematic gray literature review, ACM
     Computing Surveys 57 (2024) 1–36.
 [2] M. Bacco, A. Kocian, S. Chessa, A. Crivello, P. Barsoc-
     chi, What are data spaces? systematic survey and
     future outlook, Data in Brief 57 (2024) 110969.
 [3] Z. Dehghani, Data Mesh: Delivering Data-driven
     Value at Scale, O’Reilly, 2022.
 [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A.
     y Arcas, Communication-efficient learning of deep net-
     works from decentralized data, in: A. Singh, X. J. Zhu
     (Eds.), Proceedings of the 20th International Confer-
     ence on Artificial Intelligence and Statistics, AISTATS
     2017, 20-22 April 2017, Fort Lauderdale, FL, USA, vol-
     ume 54 of Proceedings of Machine Learning Research,
     PMLR, 2017, pp. 1273–1282.
 [5] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar,
     V. Smith, Federated optimization in heterogeneous
     networks, in: I. S. Dhillon, D. S. Papailiopoulos, V. Sze
     (Eds.), Proceedings of the Third Conference on Ma-
     chine Learning and Systems, MLSys 2020, Austin, TX,
     USA, March 2-4, 2020, mlsys.org, 2020.
 [6] A. Jain, H. Patel, L. Nagalapatti, N. Gupta, S. Mehta,
     S. Guttula, S. Mujumdar, S. Afzal, R. Sharma Mittal,
     V. Munigala, Overview and importance of data qual-
     ity for machine learning tasks, in: Proceedings of
     the 26th ACM SIGKDD International Conference on
     Knowledge Discovery & Data Mining, Association for
     Computing Machinery, 2020, p. 3561–3562.
 [7] Fiware for data spaces:                 Position paper,
     https://www.fiware.org/wp-content/uploads/
     FF_PositionPaper_FIWARE4DataSpaces.pdf, 2024.
     Accessed: 2024-12-20.
 [8] International         data      spaces        association,
     https://internationaldataspaces.org/why/
     international-standards/, 2024. Accessed: 2024-
     12-20.
 [9] Data product specification, https://opendataproducts.
     org/v3.1/#optional-attributes-and-elements, 2024. Ac-
     cessed: 2024-12-20.
[10] P. Malcher, E. Silva, D. Viana, R. P. dos Santos, What do
     we know about requirements management in software
     ecosystems?, Requir. Eng. 28 (2023) 567–593.
[11] P. Hagenhoff, S. Biehs, F. Möller, B. Otto, Design-
     ing a reference architecture for collaborative condi-
     tion monitoring data spaces: Design requirements
     and views, in: M. Mandviwalla, M. Söllner, T. Tuu-
     nanen (Eds.), Design Science Research for a Resilient
     Future - 19th International Conference on Design Sci-
     ence Research in Information Systems and Technology,
     DESRIST 2024, Trollhättan, Sweden, June 3-5, 2024,
     Proceedings, volume 14621 of Lecture Notes in Com-
     puter Science, Springer, 2024, pp. 355–369.

</pre>