Toward Advanced Query Processing in Dataspaces
                         Christoph Quix1,2
                         1
                             Hochschule Niederrhein University of Applied Sciences, Krefeld, Germany
                         2
                             Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany


                                            Abstract
                                            Dataspaces aim at enabling inter-organizational data exchange, emphasizing interoperability and data sovereignty of data assets. While
                                            current implementations focus on providing a foundational framework to enable secure, standards-based data sharing and sovereignty,
                                            they lack the robust query processing features needed to address emerging demands in distributed and federated data ecosystems. We
                                            present a vision for advancing dataspace technology by incorporating sophisticated query processing mechanisms and integrating
                                            features that ensure data sovereignty within traditional data management platforms such as data lakes.

                                            Keywords
                                            dataspaces, data integration, federated query processing, data sovereignty


                         1. The Need for Advanced Query                                                                                on transparency, openness, and digital sovereignty.
                                                                                                                                          The priority for trust and data sovereignty is a significant
                            Processing in Dataspaces                                                                                   strength, it also imposes limitations on the ability to sup-
                         The original idea of dataspaces as envisioned by Franklin                                                     port data processing across a distributed data ecosystem [7].
                         et al. [1] emphasized lightweight data integration and in-                                                    These limitations become particularly evident in use cases
                         cremental development of an integrated, linked personal                                                       requiring:
                         dataspace. Dataspaces should provide basic data access and                                                    (a) Federated Query Processing: The capability to process
                         interoperability between heterogeneous data sources while                                                     queries across multiple, independently managed datasets
                         progressively enhancing integration through user-driven                                                       without compromising performance or accuracy.
                         refinement and automated techniques. This approach fea-                                                       (b) Semantic Enrichment: Leveraging metadata and
                         tures flexibility and usability, allowing users to interact with                                              domain-specific ontologies to enable more precise and mean-
                         partially integrated data while supporting iterative improve-                                                 ingful query results.
                         ments in data organization and querying capabilities.                                                         (c) Granular Data Sovereignty: Enforcing fine-grained ac-
                            In 2015, Fraunhofer in Germany started the Industrial                                                      cess control policies that align with legal and organizational
                         Dataspace (IDS) initiative [2] which created a new view on                                                    requirements.
                         dataspaces. Dataspaces were envisioned as a multi-sided                                                          A lack of these features constrains the practical utility of
                         platform for secure and trusted data exchange, guaranteeing                                                   dataspaces in scenarios where data-driven decision-making
                         data sovereignty with a decentralized architecture [3]. The                                                   depends on seamless and secure integration of data in a
                         development is governed by an institutionalized alliance                                                      distributed data ecosystem.
                         of diverse stakeholders, i.e., the International Data Spaces
                         Association (IDSA)1 . First ideas of the IDS outline a datas-
                         pace as platform or market for data and services, in which
                                                                                                                                       2. Integrating Dataspace Features
                         data is described semantically in (central) metadata reposi-                                                     into Modern Data Architectures
                         tories. Data consumers can search the metadata for relevant
                         datasets, invoke data services to integrate, transform or en-                                                 The evolution from data lakes to data meshes and data
                         rich data as desired, and finally use the data according to                                                   fabrics reflects a significant transformation in how orga-
                         defined usage policies [4]. However, the work in the IDS                                                      nizations approach data management to address issues of
                         project focused on the deployment of a trusted and secure                                                     scalability, governance, and accessibility [8]. Data lakes,
                         connector framework [5].                                                                                      originally designed to store large volumes of structured
                            Gaia-X has evolved from the IDS concept by extending                                                       and unstructured data in a centralized repository [9], of-
                         its focus on data sharing and sovereignty into a broader                                                      ten faced challenges related to governance and usability.
                         framework that integrates cloud services, edge computing,                                                     Without robust management and accessibility frameworks,
                         and data ecosystems through standardized frameworks and                                                       many data lakes devolved into ‘data swamps’, where finding
                         governance mechanisms [6]. Gaia-X is a European initiative                                                    meaningful insights became increasingly difficult [10].
                         aimed at creating a secure, federated, and interoperable data                                                    Data meshes emerged to solve these issues by decen-
                         infrastructure. It builds on IDS principles, such as trust and                                                tralizing data governance. This paradigm treats data as a
                         compliance, while expanding the ecosystem to include de-                                                      product, where responsibility for the quality, usability, and
                         centralized, federated infrastructures and a strong emphasis                                                  governance of data lies with domain-specific teams. This
                                                                                                                                       domain-driven ownership model ensures scalability while
                                                                                                                                       addressing the shortcomings of centralized approaches, such
                          DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-
                          guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                      as those found in traditional data lakes.
                          2025, March 25, 2025, Barcelona, Spain                                                                          In parallel, data fabrics focus on creating an intercon-
                          $ christoph.quix@hs-niederrhein.de (C. Quix)                                                                 nected layer that integrates metadata across disparate sys-
                           https:                                                                                                     tems. By employing technologies such as AI, automation,
                          //www.hs-niederrhein.de/elektrotechnik-informatik/personen/quix                                              and knowledge graphs, data fabrics enable seamless data
                          (C. Quix)
                           0000-0002-1698-4345 (C. Quix)
                                                                                                                                       discovery, improved lineage tracking, and enhanced inte-
                                        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License   gration across an organization’s diverse data landscape.
                                        Attribution 4.0 International (CC BY 4.0).
                         1
                             https://internationaldataspaces.org/                                                                      This approach prioritizes metadata-driven governance and

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
context-aware connectivity, providing a more dynamic and          as in dataspaces are not yet covered.
agile data ecosystem that supports advanced analytics and            On the other hand, enhancing dataspace frameworks,
decision-making processes.                                        such as the Eclipse Dataspace Components3 , with more so-
   The concept of modern dataspaces aligns with these             phisticated federated query processing for heterogeneous
paradigms. Similar to the ‘data as a product’ philosophy in       datasets could offer a better usability of dataspaces. Data
data meshes, dataspaces emphasize treating data assets as         scientists require easy solutions for creating a Pandas data
shared, governed resources designed for collaboration and         frame over heterogeneous data: an API as provided in
interoperability. In both cases, the focus is on ensuring the     Apache Spark, combined with Delta Sharing, and enriched
quality, contextual relevance, and accessibility of data for      with usage policies, could facilitate a true sovereign data
specific stakeholders or applications. This shared empha-         science framework that integrates heterogeneous data ac-
sis underlines the mutual goal of enabling robust, domain-        cess, data integration, and machine learning. Although it
aware data collaboration across organizational boundaries.        might be still challenging to merge all these features into
   However, sharing data products in a dataspace does not         one platform, we can leverage large-language models to
imply centralized governance or uniform data quality con-         support users in executing their tasks in such a platform
trol. Similar to data meshes, data governance should be orga-     [14, 15]. SEDAR, as an open-source data lake platform, of-
nized in a decentralized manner. Therefore, each participant      fers a concrete foundation for these integrations [16]. En-
should manage their own policies through self-governing           hancing SEDAR with dataspace features could demonstrate
data products. This approach aligns with the dataspace ar-        the feasibility of such extensions and provide insights into
chitecture, where data owners define and enforce usage            performance trade-offs and usability.
policies for their data assets [4].
   Additionally, dataspaces as well as data fabrics rely on se-
mantic models to support semantic interoperability. In data       3. Future Research Directions
fabrics, knowledge graphs serve as a foundational tool for
                                                                  We advocate for a paradigm shift in dataspace technology
modeling relationships and enriching metadata, allowing
                                                                  by prioritizing advanced query processing and seamless
for enhanced data discovery, integration, and query capa-
                                                                  integration with traditional data management platforms.
bilities. Similarly, dataspaces employ semantic models to
                                                                  Leveraging existing innovations, such as the Delta Sharing
achieve interoperability among heterogeneous data sources
                                                                  Protocol, and extending platforms like SEDAR, can help
and domains (e.g., the IDS information model [11]). These
                                                                  realize the vision of a unified, sovereignty-aware data man-
models provide a shared understanding of data structures
                                                                  agement ecosystem. However, several research challenges
and relationships, which is essential for enabling meaningful
                                                                  must still be addressed to fully enable this vision.
cross-domain analytics and collaboration.
                                                                     Policy-aware query execution requires embedding data
   The interplay between these paradigms suggests a path to-
                                                                  sovereignty rules directly into query execution plans.
ward convergence, where dataspaces could incorporate the
                                                                  Queries should be executed in compliance with access re-
principles of both data meshes and data fabrics. By blend-
                                                                  strictions, data-sharing agreements, and computational con-
ing the domain-centric ownership and product-oriented
                                                                  straints defined by data owners. User-centric interfaces
data management of data meshes with the semantic and
                                                                  should allow non-expert users to interact effectively with
automation-driven integration of data fabrics, dataspaces
                                                                  the dataspace. Many existing dataspaces suffer from poor
could emerge as a comprehensive framework for addressing
                                                                  usability, limiting their adoption and accessibility. Further-
modern data challenges. This evolution reflects a growing
                                                                  more, usage policies must be extended beyond basic access
recognition of the need for distributed, interoperable, and se-
                                                                  control to include restrictions on query processing itself.
mantically enriched data ecosystems capable of supporting
                                                                  Finally, data quality management remains a significant chal-
diverse organizational and cross-domain needs.
                                                                  lenge in many dataspaces. A decentralized data quality
   The challenge lies in finding the optimal balance between
                                                                  framework, incorporating objective and standardized qual-
unified semantic models for describing data assets and de-
                                                                  ity metrics, could help assess and improve data reliability
centralized governance. In many dataspace projects, we
                                                                  while allowing participants to retain autonomy over their
have observed that a centralized approach to defining the
                                                                  data assets.
core information model significantly slows down the boot-
strapping process. A decentralized approach, as envisioned
in data meshes, could accelerate this process but comes with      Acknowledgments
the risk of diverging semantics. To mitigate these issues,
collaborative ontology engineering methodologies need to          This work has been sponsored by the German Federal
be applied [12].                                                  Ministry of Education and Research in the funding pro-
   Enhancing data lakes with dataspace-inspired features          gram “Forschung an Fachhochschulen”, project I2 DACH
can bridge the gap between centralized data repositories          (grant no. 13FH557KX0) and in the funding program “KI-
and the decentralized nature of dataspaces. Specifically, in-     Anwendungshub Kunststoffverpackungen – nachhaltige
tegrating features for data sovereignty and advanced query        Kreislaufwirtschaft durch Künstliche Intelligenz”, project
processing can yield transformative capabilities. By incorpo-     KIOptiPack (grant no. 033KI111).
rating mechanisms like usage policies [13], a data lake can       AI Disclosure Statement During the preparation of this
enforce access control, data provenance, and compliance           work, the author used ChatGPT 4o in order to improve
policies. Databricks has proposed Delta Sharing 2 , a protocol    writing style, check grammar, and spelling. After using
for sharing datasets between data lakes, or even between          this tool, the author reviewed and edited the content as
organizations. Although the protocol supports fine-grained        needed and takes full responsibility for the content of the
access control, usage policies to support data sovereignty        publication.

2                                                                 3
    https://github.com/delta-io/delta-sharing/                        https://projects.eclipse.org/projects/technology.edc
References                                                             spaces information model - an ontology for sovereign
                                                                       exchange of digital content, in: The Semantic Web
 [1] M. J. Franklin, A. Y. Halevy, D. Maier,              From         - ISWC 2020 - 19th International Semantic Web Con-
     databases to dataspaces: a new abstraction for in-                ference, Athens, Greece, November 2-6, 2020, Pro-
     formation management, SIGMOD Rec. 34 (2005)                       ceedings, Part II, volume 12507 of Lecture Notes in
     27–33. URL: https://doi.org/10.1145/1107499.1107502.              Computer Science, Springer, 2020, pp. 176–192. URL:
     doi:10.1145/1107499.1107502.                                      https://doi.org/10.1007/978-3-030-62466-8_12. doi:10.
 [2] B. Otto, Jürjens, J. Schon, S. Auer, N. Menz, S. Wenzel,          1007/978-3-030-62466-8\_12.
     J. Cirullies, Industrial Data Space - Digital Sovereignity   [12] K. I. Kotis, G. A. Vouros, D. Spiliotopoulos, Ontol-
     over Data, Whitepaper, Fraunhofer-Gesellschaft,                   ogy engineering methodologies for the evolution of
     2016. URL: https://www.fraunhofer.de/content/                     living and reused ontologies: status, trends, findings
     dam/zv/de/Forschungsfelder/industrial-data-space/                 and recommendations, Knowl. Eng. Rev. 35 (2020)
     Industrial-Data-Space_whitepaper.pdf.                             e4. URL: https://doi.org/10.1017/S0269888920000065.
 [3] B. Otto, M. Jarke, Designing a multi-sided data plat-             doi:10.1017/S0269888920000065.
     form: findings from the international data spaces            [13] D. M. Mustafa, A. Nadgeri, D. Collarana, B. T. Arnold,
     case, Electron. Mark. 29 (2019) 561–580. URL: https:              C. Quix, C. Lange, S. Decker, From instructions
     //doi.org/10.1007/s12525-019-00362-x. doi:10.1007/                to ODRL usage policies: An ontology guided ap-
     S12525-019-00362-X.                                               proach, in: Proceedings of Workshops at the 50th
 [4] C. Quix, A. Chakrabarti, S. Kleff, J. Pullmann, Busi-             International Conference on Very Large Data Bases,
     ness process modelling for a data exchange platform,              VLDB 2024, Guangzhou, China, August 26-30, 2024,
     in: Proceedings of the Forum and Doctoral Consor-                 VLDB.org, 2024. URL: https://vldb.org/workshops/
     tium Papers Presented at the 29th International Confer-           2024/proceedings/LLM+KG/LLM+KG-15.pdf.
     ence on Advanced Information Systems Engineering,            [14] S. Hoseini, A. Burgdorf, A. Paulus, T. Meisen, C. Quix,
     CAiSE 2017, Essen, Germany, June 12-16, 2017, volume              A. Pomp, Towards llm-augmented creation of se-
     1848 of CEUR Workshop Proceedings, CEUR-WS.org,                   mantic models for dataspaces, in: Proceedings of
     2017, pp. 153–160. URL: https://ceur-ws.org/Vol-1848/             the Second International Workshop on Semantics
     CAiSE2017_Forum_Paper20.pdf.                                      in Dataspaces (SDS 2024) co-located with the 21st
 [5] H. Pettenpohl, M. Spiekermann, J. R. Both, Inter-                 Extended Semantic Web Conference (ESWC 2024),
     national data spaces in a nutshell, in: Designing                 Hersonissos, Greece, May 26, 2024, volume 3705 of
     Data Spaces: The Ecosystem Approach to Compet-                    CEUR Workshop Proceedings, CEUR-WS.org, 2024. URL:
     itive Advantage, Springer, 2022, pp. 29–40. URL: https:           https://ceur-ws.org/Vol-3705/paper03.pdf.
     //doi.org/10.1007/978-3-030-93975-5_3. doi:10.1007/          [15] S. Hoseini, M. Ibbels, C. Quix, Enhancing ma-
     978-3-030-93975-5\_3.                                             chine learning capabilities in data lakes with automl
 [6] H. Tardieu,        Role of Gaia-X in the european                 and llms, in: Advances in Databases and Infor-
     data space ecosystem,             in: Designing Data              mation Systems - 28th European Conference, AD-
     Spaces: The Ecosystem Approach to Competitive                     BIS 2024, Bayonne, France, August 28-31, 2024, Pro-
     Advantage, Springer, 2022, pp. 41–59. URL: https:                 ceedings, volume 14918 of Lecture Notes in Computer
     //doi.org/10.1007/978-3-030-93975-5_4. doi:10.1007/               Science, Springer, 2024, pp. 184–198. URL: https://
     978-3-030-93975-5\_4.                                             doi.org/10.1007/978-3-031-70626-4_13. doi:10.1007/
 [7] S. Geisler, M. Vidal, C. Cappiello, B. F. Lóscio, A. Gal,         978-3-031-70626-4\_13.
     M. Jarke, M. Lenzerini, P. Missier, B. Otto, E. Paja,        [16] S. Hoseini, A. Ali, H. Shaker, C. Quix, SEDAR: A
     B. Pernici, J. Rehof, Knowledge-driven data ecosys-               semantic data reservoir for heterogeneous datasets,
     tems toward data transparency, ACM J. Data Inf.                   in: Proceedings of the 32nd ACM International
     Qual. 14 (2022) 3:1–3:12. URL: https://doi.org/10.1145/           Conference on Information and Knowledge Man-
     3467022. doi:10.1145/3467022.                                     agement, CIKM 2023, Birmingham, United King-
 [8] I. Blohm, F. Wortmann, C. Legner, F. Köbler,                      dom, October 21-25, 2023, ACM, 2023, pp. 5056–
     Data products, data mesh, and data fabric, Bus.                   5060. URL: https://doi.org/10.1145/3583780.3614753.
     Inf. Syst. Eng. 66 (2024) 643–652. URL: https://                  doi:10.1145/3583780.3614753.
     doi.org/10.1007/s12599-024-00876-5. doi:10.1007/
     S12599-024-00876-5.
 [9] R. Hai, C. Koutras, C. Quix, M. Jarke, Data lakes:
     A survey of functions and systems, IEEE Trans.
     Knowl. Data Eng. 35 (2023) 12571–12590. URL: https:
     //doi.org/10.1109/TKDE.2023.3270101. doi:10.1109/
     TKDE.2023.3270101.
[10] R. Hai, S. Geisler, C. Quix, Constance: An intel-
     ligent data lake system, in: Proceedings of the
     2016 International Conference on Management of
     Data, SIGMOD Conference 2016, San Francisco, CA,
     USA, June 26 - July 01, 2016, ACM, 2016, pp. 2097–
     2100. URL: https://doi.org/10.1145/2882903.2899389.
     doi:10.1145/2882903.2899389.
[11] S. R. Bader, J. Pullmann, C. Mader, S. Tramp, C. Quix,
     A. W. Müller, H. Akyürek, M. Böckmann, B. T. Imbusch,
     J. Lipp, S. Geisler, C. Lange, The international data