<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Challenges to Enforce Data Quality in Data Spaces</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Claudia</forename><forename type="middle">P</forename><surname>Ayala</surname></persName>
							<email>claudia.ayala@upc.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Universitat Politècnica de Catalunya</orgName>
								<address>
									<settlement>BarcelonaTech</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Besim</forename><surname>Bilalli</surname></persName>
							<email>besim.bilalli@upc.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Universitat Politècnica de Catalunya</orgName>
								<address>
									<settlement>BarcelonaTech</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cristina</forename><surname>Gómez</surname></persName>
							<email>cristina.gomez@upc.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Universitat Politècnica de Catalunya</orgName>
								<address>
									<settlement>BarcelonaTech</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jose-Norberto</forename><surname>Mazón</surname></persName>
							<email>jnmazon@ua.es</email>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">Universitat d&apos;Alacant</orgName>
								<orgName type="institution" key="instit2">Universidad de Alicante</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oscar</forename><surname>Romero</surname></persName>
							<email>oscar.romero@upc.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Universitat Politècnica de Catalunya</orgName>
								<address>
									<settlement>BarcelonaTech</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Challenges to Enforce Data Quality in Data Spaces</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E5BDFE8363FFB807FF6C80A231CD75E8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Spaces, Data Quality, Data Validation, Federated Data Management, Data Sharing Orcid 0000-0002-6262-3698 (C. P. Ayala)</term>
					<term>0000-0002-0575-2389 (B. Bilalli)</term>
					<term>0000-0002-3872-0439 (C. Gómez)</term>
					<term>0000-0001-7924-0880 (J. Mazón)</term>
					<term>0000-0001-6350-8328 (O. Romero)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Data Spaces must preserve sovereignty and privacy while ensuring FAIR (Findable, Accessible, Interoperable and Reusable) principles. To do so, policy-based strategies have to be developed in order to describe the agreements reached in the Data Space. In this context, two open questions arise: how to define the right Data Space policies, as well as, how to enforce (and monitor) them. Despite the efforts towards defining and enforcing data access and usage policies, there is no solution to operationalize the enforcement of those considering data quality dimensions. However, data quality is becoming a hot topic due to the surge of federated learning and alternative analytical techniques, which require all providers to guarantee a data quality threshold in order to learn robust models. Currently, we have means to describe policies related to data quality rules (e.g., by combining standards such as ODRL and standard vocabularies) but we are missing means to elicit these policies from data providers and enforce them while preserving the data sovereignty. In this paper, we discuss the challenges and open questions that must be addressed in order to operationalize (and eventually, automate) data quality in Data Spaces, which span from requirements elicitation to data validation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Data Spaces are federated ecosystems in which data providers and consumers share data while preserving data sovereignty and privacy. Currently, the Data Mesh architecture <ref type="bibr" target="#b0">[1]</ref> is at the core of current technological solutions, since it provides a domain-decentralized paradigm that suits the Data Space requirements <ref type="bibr" target="#b1">[2]</ref>. Relevantly, the Data Mesh defines the Data Product concept, which provides a productoriented view of the providers' data assets. In short, the data product is a node that encapsulates three structural components required to function: code for enforcing policies (i.e., the Data Space agreements), data (and its metadata) and infrastructure <ref type="bibr" target="#b2">[3]</ref>. By definition, the providers' data assets can be heterogeneous both in the infrastructure used and the data provided (in format and semantics).</p><p>Behind the idea of Data Spaces is the objective of extracting value from data sharing. This can be achieved in many ways, but data analysis arises as prominent means to achieve so, either by means of descriptive analysis (e.g., dashboarding and OLAP) or predictive analysis (e.g., learning models). However, how to achieve data analysis in federated environments is an open challenge, and federated learning <ref type="bibr" target="#b3">[4]</ref> is currently the most widespread privacy-aware data analysis technique. Many efforts have been devoted to develop robust federated learning but little attention has been paid to the role of data. Yet, the impact of the data quality (DQ) from each provider on federated models learnt is huge <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>.</p><p>One of the biggest open problems in Data Spaces not properly tackled is how the agreements reached (e.g., on DQ) at the Data Space federated layer (i.e., at the federated -unique-view of the data ecosystem) can be enforced at the providers' data assets regardless of their heterogeneity and preserving data ownership and privacy. Note that this problem has been easily tackled in centralized environments by having a central authority extracting, transforming and preparing data for analysis. However, this is not possible in settings where data is not meant to be shared raw. For example, the minimum number of instances and the variances of key attributes might be set as DQ criteria for all data providers and should be automatically and locally validated by executing a software service (specific for the provider infrastructure) provided by the Data Space services catalog. The result of the service execution should be communicated to the Data Space. To our knowledge, there is no architecture, framework or solution tackling this problem, despite the myriad of standards and definitions blooming around the Data Space concept (e.g., <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>).</p><p>We focus on how to validate DQ agreements in the Data Space and discuss the open challenges to make DQ happen in Data Spaces to enact trustworthy federated learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Challenges and Vision</head><p>Data Spaces require a governance model for specifying DQ agreements that stakeholders must adhere to in order to participate. Importantly, this governance model must also specify DQ needs agreed among data consumers and providers when developing specific uses cases. Therefore, our view is that the governance model for Data Spaces should distinguish two levels: 1) a Data Space level for agreements among stakeholders of the Data Space authority from data regulations and strategic issues, and 2) a use case level for agreements among data providers and consumers to build specific Data Products. Based on this view and to facilitate the discussion, we propose a visionary framework with a process for the Data Space and use case levels (see Fig. <ref type="figure" target="#fig_0">1</ref>). Our framework follows the Open Data Product specification <ref type="bibr" target="#b8">[9]</ref>, thus splitting each process into two parts: one declarative, at a higher-level of abstraction specifying what (analysis phase), and another one at a lower-level specifying how (design and implementation phases). The declarative part defines the DQ dimensions and intended level. The ex- ecutable part contains the machine-readable "as code" rules, provided as a service, to validate DQ dimensions. Next, we describe both processes and their main challenges.</p><p>DQ Requirements Engineering for Data Spaces.</p><p>Requirements engineering (RE) for complex systems in open and dynamic environments that extend beyond a single organization is widely recognized as a challenging endeavor <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>. This is particularly true in the context of Data Spaces, where the elicitation and management of requirements must reconcile diverse perspectives, including the strategic business vision, governance, compliance with laws and regulations, infrastructure, scalability demands, and DQ considerations. Our visionary framework proposes applying RE practices to elicit, specify, and manage the Data Space requirements. We advocate for the development and use of a Catalogue of DQ Requirements at two levels: the Data Space level and the use case level. These catalogs promote knowledge sharing and requirements reuse, building a robust repository of experiences and best practices. The proposed process is aimed to: 1) Ensure a common understanding of DQ dimensions by considering established standards; 2) Facilitate the elicitation of diverse DQ requirements from diverse stakeholders to enable effective data sharing; 3) Support the structured specification and management of DQ requirements to ensure compliance and alignment between the Data Space and use case levels for their subsequent operationalization; and 4) Address trade-offs between conflicting DQ requirements. This approach aims to bridge the gap between diverse stakeholder perspectives and the technical requirements for robust DQ management in Data Spaces.</p><p>Extraction and Customization of DQ Rules. The complexity of DQ requirements and their textual or semistructured formalization make their direct operationalization challenging. With the aim of making DQ requirements executable in an operational environment, our visionary framework proposes to transform, in a semi-automated way and using specific catalogues for supporting this transformation, DQ requirements (at Data Space and use case levels) into formalized DQ rules that may be easily implemented.</p><p>We propose to use a rule language with well-defined semantics (e.g., ODRL), to formalize DQ rules. Several challenges need to be tackled when performing this transformation: 1) the identification of relevant and suitable stakeholders with the specific knowledge for performing this activity in both levels; 2) the definition of specific catalogues with reusable transformation patterns for translating DQ requirements into rules, preserving their semantics; 3) the definition of the artifacts needed (e.g., specialized metamodels or new ODRL profiles), for automating the extraction and customization of DQ rules to the specific domain and level.</p><p>Implementation available as a Service of DQ Rules.</p><p>The inherent heterogeneity of providers in the context of Data Spaces renders the process of translating formal DQ rules into executable services a significant challenge. The main goal of this activity is to avoid building and maintaining custom solutions that are tightly coupled to specific execution environments or platforms. To address this, we propose an agnostic solution that leverages best practices from software engineering, such as containerized solutions, ensuring portability, scalability, and interoperability. However, the intrinsic characteristics of Data Spaces introduce several challenges that must be addressed: 1) dealing with heterogeneity at the infrastructure level by abstracting the differences while ensuring consistent performance and security across environments; 2) allowing for dynamic and federated execution across multiple distributed nodes, ensuring real-time validation without requiring data centralization.</p><p>As conclusion, there is a need for further research to enact DQ in Data Spaces, a must for qualitative federated data analysis. In this sense, we have discussed a visionary framework, its main phases and challenges to be tackled.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Visionary framework for considering DQ requirements in Data Spaces.</figDesc><graphic coords="2,112.61,65.60,370.05,234.48" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been partially supported by the EU-HORIZON program under GA.101135513 (CYCLOPS) and by CIAICO/2022/019 project from Generalitat Valenciana.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>GLOBE https://futur.upc.edu/ClaudiaPatriciaAyalaMartinez (C. P. Ayala); https://futur.upc.edu/BesimBilalli (B. Bilalli); https://futur.upc.edu/CristinaGomezSeoane (C. Gómez); https://s.ua.es/_MuH (J. Mazón); https://futur.upc.edu/OscarRomeroMoral (O. Romero</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Data mesh: a systematic gray literature review</title>
		<author>
			<persName><forename type="first">A</forename><surname>Goedegebuure</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kumara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Driessen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heuvel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Monsieur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Tamburri</surname></persName>
		</author>
		<author>
			<persName><surname>Nucci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="1" to="36" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">What are data spaces? systematic survey and future outlook</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bacco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kocian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chessa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Crivello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data in Brief</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page">110969</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>Barsocchi</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Dehghani</surname></persName>
		</author>
		<title level="m">Data Mesh: Delivering Data-driven Value at Scale</title>
				<imprint>
			<publisher>O&apos;Reilly</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Communication-efficient learning of deep networks from decentralized data</title>
		<author>
			<persName><forename type="first">B</forename><surname>Mcmahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hampson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">A</forename><surname>Arcas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><forename type="middle">J</forename><surname>Zhu</surname></persName>
		</editor>
		<meeting>the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017<address><addrLine>Fort Lauderdale, FL, USA; PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-04-22">20-22 April 2017. 2017</date>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="1273" to="1282" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Federated optimization in heterogeneous networks</title>
		<author>
			<persName><forename type="first">T</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Sahu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zaheer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sanjabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Talwalkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Smith</surname></persName>
		</author>
		<ptr target="org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third Conference on Machine Learning and Systems, MLSys 2020</title>
				<editor>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Dhillon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Papailiopoulos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Sze</surname></persName>
		</editor>
		<meeting>the Third Conference on Machine Learning and Systems, MLSys 2020<address><addrLine>Austin, TX, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">March 2-4, 2020. 2020</date>
		</imprint>
	</monogr>
	<note>mlsys.</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overview and importance of data quality for machine learning tasks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Nagalapatti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guttula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mujumdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Afzal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sharma Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Munigala</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</title>
				<meeting>the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3561" to="3562" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<ptr target="https://www.fiware.org/wp-content/uploads/FF_PositionPaper_FIWARE4DataSpaces.pdf" />
		<title level="m">Fiware for data spaces</title>
				<imprint>
			<date type="published" when="2024-12-20">2024. 2024-12-20</date>
		</imprint>
	</monogr>
	<note>Position paper</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<ptr target="https://internationaldataspaces.org/why/international-standards/" />
		<title level="m">International data spaces association</title>
				<imprint>
			<date type="published" when="2024-12-20">2024. 2024-12-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<ptr target="Ac-cessed:2024-" />
		<title level="m">Data product specification</title>
				<imprint>
			<date type="published" when="1920">2024. 12-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">What do we know about requirements management in software ecosystems?</title>
		<author>
			<persName><forename type="first">P</forename><surname>Malcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Viana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Santos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Requir. Eng</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="567" to="593" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Designing a reference architecture for collaborative condition monitoring data spaces: Design requirements and views</title>
		<author>
			<persName><forename type="first">P</forename><surname>Hagenhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biehs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Möller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Otto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Design Science Research for a Resilient Future -19th International Conference on Design Science Research in Information Systems and Technology, DESRIST 2024</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">M</forename><surname>Mandviwalla</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Söllner</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Tuunanen</surname></persName>
		</editor>
		<meeting><address><addrLine>Trollhättan, Sweden</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">June 3-5, 2024. 2024</date>
			<biblScope unit="volume">14621</biblScope>
			<biblScope unit="page" from="355" to="369" />
		</imprint>
	</monogr>
	<note>Proceedings</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
