Toward Advanced Query Processing in Dataspaces Christoph Quix1,2 1 Hochschule Niederrhein University of Applied Sciences, Krefeld, Germany 2 Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany Abstract Dataspaces aim at enabling inter-organizational data exchange, emphasizing interoperability and data sovereignty of data assets. While current implementations focus on providing a foundational framework to enable secure, standards-based data sharing and sovereignty, they lack the robust query processing features needed to address emerging demands in distributed and federated data ecosystems. We present a vision for advancing dataspace technology by incorporating sophisticated query processing mechanisms and integrating features that ensure data sovereignty within traditional data management platforms such as data lakes. Keywords dataspaces, data integration, federated query processing, data sovereignty 1. The Need for Advanced Query on transparency, openness, and digital sovereignty. The priority for trust and data sovereignty is a significant Processing in Dataspaces strength, it also imposes limitations on the ability to sup- The original idea of dataspaces as envisioned by Franklin port data processing across a distributed data ecosystem [7]. et al. [1] emphasized lightweight data integration and in- These limitations become particularly evident in use cases cremental development of an integrated, linked personal requiring: dataspace. Dataspaces should provide basic data access and (a) Federated Query Processing: The capability to process interoperability between heterogeneous data sources while queries across multiple, independently managed datasets progressively enhancing integration through user-driven without compromising performance or accuracy. refinement and automated techniques. This approach fea- (b) Semantic Enrichment: Leveraging metadata and tures flexibility and usability, allowing users to interact with domain-specific ontologies to enable more precise and mean- partially integrated data while supporting iterative improve- ingful query results. ments in data organization and querying capabilities. (c) Granular Data Sovereignty: Enforcing fine-grained ac- In 2015, Fraunhofer in Germany started the Industrial cess control policies that align with legal and organizational Dataspace (IDS) initiative [2] which created a new view on requirements. dataspaces. Dataspaces were envisioned as a multi-sided A lack of these features constrains the practical utility of platform for secure and trusted data exchange, guaranteeing dataspaces in scenarios where data-driven decision-making data sovereignty with a decentralized architecture [3]. The depends on seamless and secure integration of data in a development is governed by an institutionalized alliance distributed data ecosystem. of diverse stakeholders, i.e., the International Data Spaces Association (IDSA)1 . First ideas of the IDS outline a datas- pace as platform or market for data and services, in which 2. Integrating Dataspace Features data is described semantically in (central) metadata reposi- into Modern Data Architectures tories. Data consumers can search the metadata for relevant datasets, invoke data services to integrate, transform or en- The evolution from data lakes to data meshes and data rich data as desired, and finally use the data according to fabrics reflects a significant transformation in how orga- defined usage policies [4]. However, the work in the IDS nizations approach data management to address issues of project focused on the deployment of a trusted and secure scalability, governance, and accessibility [8]. Data lakes, connector framework [5]. originally designed to store large volumes of structured Gaia-X has evolved from the IDS concept by extending and unstructured data in a centralized repository [9], of- its focus on data sharing and sovereignty into a broader ten faced challenges related to governance and usability. framework that integrates cloud services, edge computing, Without robust management and accessibility frameworks, and data ecosystems through standardized frameworks and many data lakes devolved into ‘data swamps’, where finding governance mechanisms [6]. Gaia-X is a European initiative meaningful insights became increasingly difficult [10]. aimed at creating a secure, federated, and interoperable data Data meshes emerged to solve these issues by decen- infrastructure. It builds on IDS principles, such as trust and tralizing data governance. This paradigm treats data as a compliance, while expanding the ecosystem to include de- product, where responsibility for the quality, usability, and centralized, federated infrastructures and a strong emphasis governance of data lies with domain-specific teams. This domain-driven ownership model ensures scalability while addressing the shortcomings of centralized approaches, such DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- guages and Analytical Processing of Big Data, co-located with EDBT/ICDT as those found in traditional data lakes. 2025, March 25, 2025, Barcelona, Spain In parallel, data fabrics focus on creating an intercon- $ christoph.quix@hs-niederrhein.de (C. Quix) nected layer that integrates metadata across disparate sys- € https: tems. By employing technologies such as AI, automation, //www.hs-niederrhein.de/elektrotechnik-informatik/personen/quix and knowledge graphs, data fabrics enable seamless data (C. Quix)  0000-0002-1698-4345 (C. Quix) discovery, improved lineage tracking, and enhanced inte- © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License gration across an organization’s diverse data landscape. Attribution 4.0 International (CC BY 4.0). 1 https://internationaldataspaces.org/ This approach prioritizes metadata-driven governance and CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings context-aware connectivity, providing a more dynamic and as in dataspaces are not yet covered. agile data ecosystem that supports advanced analytics and On the other hand, enhancing dataspace frameworks, decision-making processes. such as the Eclipse Dataspace Components3 , with more so- The concept of modern dataspaces aligns with these phisticated federated query processing for heterogeneous paradigms. Similar to the ‘data as a product’ philosophy in datasets could offer a better usability of dataspaces. Data data meshes, dataspaces emphasize treating data assets as scientists require easy solutions for creating a Pandas data shared, governed resources designed for collaboration and frame over heterogeneous data: an API as provided in interoperability. In both cases, the focus is on ensuring the Apache Spark, combined with Delta Sharing, and enriched quality, contextual relevance, and accessibility of data for with usage policies, could facilitate a true sovereign data specific stakeholders or applications. This shared empha- science framework that integrates heterogeneous data ac- sis underlines the mutual goal of enabling robust, domain- cess, data integration, and machine learning. Although it aware data collaboration across organizational boundaries. might be still challenging to merge all these features into However, sharing data products in a dataspace does not one platform, we can leverage large-language models to imply centralized governance or uniform data quality con- support users in executing their tasks in such a platform trol. Similar to data meshes, data governance should be orga- [14, 15]. SEDAR, as an open-source data lake platform, of- nized in a decentralized manner. Therefore, each participant fers a concrete foundation for these integrations [16]. En- should manage their own policies through self-governing hancing SEDAR with dataspace features could demonstrate data products. This approach aligns with the dataspace ar- the feasibility of such extensions and provide insights into chitecture, where data owners define and enforce usage performance trade-offs and usability. policies for their data assets [4]. Additionally, dataspaces as well as data fabrics rely on se- mantic models to support semantic interoperability. In data 3. Future Research Directions fabrics, knowledge graphs serve as a foundational tool for We advocate for a paradigm shift in dataspace technology modeling relationships and enriching metadata, allowing by prioritizing advanced query processing and seamless for enhanced data discovery, integration, and query capa- integration with traditional data management platforms. bilities. Similarly, dataspaces employ semantic models to Leveraging existing innovations, such as the Delta Sharing achieve interoperability among heterogeneous data sources Protocol, and extending platforms like SEDAR, can help and domains (e.g., the IDS information model [11]). These realize the vision of a unified, sovereignty-aware data man- models provide a shared understanding of data structures agement ecosystem. However, several research challenges and relationships, which is essential for enabling meaningful must still be addressed to fully enable this vision. cross-domain analytics and collaboration. Policy-aware query execution requires embedding data The interplay between these paradigms suggests a path to- sovereignty rules directly into query execution plans. ward convergence, where dataspaces could incorporate the Queries should be executed in compliance with access re- principles of both data meshes and data fabrics. By blend- strictions, data-sharing agreements, and computational con- ing the domain-centric ownership and product-oriented straints defined by data owners. User-centric interfaces data management of data meshes with the semantic and should allow non-expert users to interact effectively with automation-driven integration of data fabrics, dataspaces the dataspace. Many existing dataspaces suffer from poor could emerge as a comprehensive framework for addressing usability, limiting their adoption and accessibility. Further- modern data challenges. This evolution reflects a growing more, usage policies must be extended beyond basic access recognition of the need for distributed, interoperable, and se- control to include restrictions on query processing itself. mantically enriched data ecosystems capable of supporting Finally, data quality management remains a significant chal- diverse organizational and cross-domain needs. lenge in many dataspaces. A decentralized data quality The challenge lies in finding the optimal balance between framework, incorporating objective and standardized qual- unified semantic models for describing data assets and de- ity metrics, could help assess and improve data reliability centralized governance. In many dataspace projects, we while allowing participants to retain autonomy over their have observed that a centralized approach to defining the data assets. core information model significantly slows down the boot- strapping process. A decentralized approach, as envisioned in data meshes, could accelerate this process but comes with Acknowledgments the risk of diverging semantics. To mitigate these issues, collaborative ontology engineering methodologies need to This work has been sponsored by the German Federal be applied [12]. Ministry of Education and Research in the funding pro- Enhancing data lakes with dataspace-inspired features gram “Forschung an Fachhochschulen”, project I2 DACH can bridge the gap between centralized data repositories (grant no. 13FH557KX0) and in the funding program “KI- and the decentralized nature of dataspaces. Specifically, in- Anwendungshub Kunststoffverpackungen – nachhaltige tegrating features for data sovereignty and advanced query Kreislaufwirtschaft durch Künstliche Intelligenz”, project processing can yield transformative capabilities. By incorpo- KIOptiPack (grant no. 033KI111). rating mechanisms like usage policies [13], a data lake can AI Disclosure Statement During the preparation of this enforce access control, data provenance, and compliance work, the author used ChatGPT 4o in order to improve policies. Databricks has proposed Delta Sharing 2 , a protocol writing style, check grammar, and spelling. After using for sharing datasets between data lakes, or even between this tool, the author reviewed and edited the content as organizations. Although the protocol supports fine-grained needed and takes full responsibility for the content of the access control, usage policies to support data sovereignty publication. 2 3 https://github.com/delta-io/delta-sharing/ https://projects.eclipse.org/projects/technology.edc References spaces information model - an ontology for sovereign exchange of digital content, in: The Semantic Web [1] M. J. Franklin, A. Y. Halevy, D. Maier, From - ISWC 2020 - 19th International Semantic Web Con- databases to dataspaces: a new abstraction for in- ference, Athens, Greece, November 2-6, 2020, Pro- formation management, SIGMOD Rec. 34 (2005) ceedings, Part II, volume 12507 of Lecture Notes in 27–33. URL: https://doi.org/10.1145/1107499.1107502. Computer Science, Springer, 2020, pp. 176–192. URL: doi:10.1145/1107499.1107502. https://doi.org/10.1007/978-3-030-62466-8_12. doi:10. [2] B. Otto, Jürjens, J. Schon, S. Auer, N. Menz, S. Wenzel, 1007/978-3-030-62466-8\_12. J. Cirullies, Industrial Data Space - Digital Sovereignity [12] K. I. Kotis, G. A. Vouros, D. Spiliotopoulos, Ontol- over Data, Whitepaper, Fraunhofer-Gesellschaft, ogy engineering methodologies for the evolution of 2016. URL: https://www.fraunhofer.de/content/ living and reused ontologies: status, trends, findings dam/zv/de/Forschungsfelder/industrial-data-space/ and recommendations, Knowl. Eng. Rev. 35 (2020) Industrial-Data-Space_whitepaper.pdf. e4. URL: https://doi.org/10.1017/S0269888920000065. [3] B. Otto, M. Jarke, Designing a multi-sided data plat- doi:10.1017/S0269888920000065. form: findings from the international data spaces [13] D. M. Mustafa, A. Nadgeri, D. Collarana, B. T. Arnold, case, Electron. Mark. 29 (2019) 561–580. URL: https: C. Quix, C. Lange, S. Decker, From instructions //doi.org/10.1007/s12525-019-00362-x. doi:10.1007/ to ODRL usage policies: An ontology guided ap- S12525-019-00362-X. proach, in: Proceedings of Workshops at the 50th [4] C. Quix, A. Chakrabarti, S. Kleff, J. Pullmann, Busi- International Conference on Very Large Data Bases, ness process modelling for a data exchange platform, VLDB 2024, Guangzhou, China, August 26-30, 2024, in: Proceedings of the Forum and Doctoral Consor- VLDB.org, 2024. URL: https://vldb.org/workshops/ tium Papers Presented at the 29th International Confer- 2024/proceedings/LLM+KG/LLM+KG-15.pdf. ence on Advanced Information Systems Engineering, [14] S. Hoseini, A. Burgdorf, A. Paulus, T. Meisen, C. Quix, CAiSE 2017, Essen, Germany, June 12-16, 2017, volume A. Pomp, Towards llm-augmented creation of se- 1848 of CEUR Workshop Proceedings, CEUR-WS.org, mantic models for dataspaces, in: Proceedings of 2017, pp. 153–160. URL: https://ceur-ws.org/Vol-1848/ the Second International Workshop on Semantics CAiSE2017_Forum_Paper20.pdf. in Dataspaces (SDS 2024) co-located with the 21st [5] H. Pettenpohl, M. Spiekermann, J. R. Both, Inter- Extended Semantic Web Conference (ESWC 2024), national data spaces in a nutshell, in: Designing Hersonissos, Greece, May 26, 2024, volume 3705 of Data Spaces: The Ecosystem Approach to Compet- CEUR Workshop Proceedings, CEUR-WS.org, 2024. URL: itive Advantage, Springer, 2022, pp. 29–40. URL: https: https://ceur-ws.org/Vol-3705/paper03.pdf. //doi.org/10.1007/978-3-030-93975-5_3. doi:10.1007/ [15] S. Hoseini, M. Ibbels, C. Quix, Enhancing ma- 978-3-030-93975-5\_3. chine learning capabilities in data lakes with automl [6] H. Tardieu, Role of Gaia-X in the european and llms, in: Advances in Databases and Infor- data space ecosystem, in: Designing Data mation Systems - 28th European Conference, AD- Spaces: The Ecosystem Approach to Competitive BIS 2024, Bayonne, France, August 28-31, 2024, Pro- Advantage, Springer, 2022, pp. 41–59. URL: https: ceedings, volume 14918 of Lecture Notes in Computer //doi.org/10.1007/978-3-030-93975-5_4. doi:10.1007/ Science, Springer, 2024, pp. 184–198. URL: https:// 978-3-030-93975-5\_4. doi.org/10.1007/978-3-031-70626-4_13. doi:10.1007/ [7] S. Geisler, M. Vidal, C. Cappiello, B. F. Lóscio, A. Gal, 978-3-031-70626-4\_13. M. Jarke, M. Lenzerini, P. Missier, B. Otto, E. Paja, [16] S. Hoseini, A. Ali, H. Shaker, C. Quix, SEDAR: A B. Pernici, J. Rehof, Knowledge-driven data ecosys- semantic data reservoir for heterogeneous datasets, tems toward data transparency, ACM J. Data Inf. in: Proceedings of the 32nd ACM International Qual. 14 (2022) 3:1–3:12. URL: https://doi.org/10.1145/ Conference on Information and Knowledge Man- 3467022. doi:10.1145/3467022. agement, CIKM 2023, Birmingham, United King- [8] I. Blohm, F. Wortmann, C. Legner, F. Köbler, dom, October 21-25, 2023, ACM, 2023, pp. 5056– Data products, data mesh, and data fabric, Bus. 5060. URL: https://doi.org/10.1145/3583780.3614753. Inf. Syst. Eng. 66 (2024) 643–652. URL: https:// doi:10.1145/3583780.3614753. doi.org/10.1007/s12599-024-00876-5. doi:10.1007/ S12599-024-00876-5. [9] R. Hai, C. Koutras, C. Quix, M. Jarke, Data lakes: A survey of functions and systems, IEEE Trans. Knowl. Data Eng. 35 (2023) 12571–12590. URL: https: //doi.org/10.1109/TKDE.2023.3270101. doi:10.1109/ TKDE.2023.3270101. [10] R. Hai, S. Geisler, C. Quix, Constance: An intel- ligent data lake system, in: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp. 2097– 2100. URL: https://doi.org/10.1145/2882903.2899389. doi:10.1145/2882903.2899389. [11] S. R. Bader, J. Pullmann, C. Mader, S. Tramp, C. Quix, A. W. Müller, H. Akyürek, M. Böckmann, B. T. Imbusch, J. Lipp, S. Geisler, C. Lange, The international data