<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. K. Pisipati);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Fabric Technologies, Modeling and Applications - A Review</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Radha Krishna Pisipati</string-name>
          <email>prkrishna@nitw.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamalakar Karlapalem</string-name>
          <email>kamal@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Satyanarayana R Valluri</string-name>
          <email>satya.valluri@databrics.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science and Analytics Centre</institution>
          ,
          <addr-line>IIIT-Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Databricks Inc.</institution>
          ,
          <addr-line>California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science and Engineering, National Institute of Technology Warangal</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>SCME</institution>
          ,
          <addr-line>Doctoral Consortium, Tutorials</addr-line>
          ,
          <institution>Project Exhibitions</institution>
          ,
          <addr-line>Posters and Demos</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Data Fabric is an amalgamation of various database system technologies, ofering extensive research opportunities for deploying end-to-end data management platform-based solutions. These platforms have seen advancements in middleware, advanced and powerful ETL pipelines, and generative AI-supported data pipelines with unified storage and compute to establish compliance and governance and reduce latency. Deployed systems using technologies such as data mesh, data lakes, data warehouses, and cloud databases serve as data sources, and the data fabric solution manages data, query, and analytics pipelines by leveraging distributed computing capabilities and dynamically routing queries for optimal performance without centralizing data storage. Understanding the interconnections (technology and applications) among source systems, data fabric, domain, and application is crucial for establishing correct and complete data fabric solutions for user applications. This paper presents a holistic view of data fabric technologies and addresses the importance of understanding the interconnections among source data systems, data fabric, domain, and application, focusing on metadata and application development. For metadata, we envisage an ER model solution to provide an overall conceptual data landscape for the underlying data systems for a data fabric.</p>
      </abstract>
      <kwd-group>
        <kwd>Data management</kwd>
        <kwd>distributed data sources</kwd>
        <kwd>data fabric technologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Researchers and practitioners developed various data platforms and storage technologies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], such as
data mesh, data lakes, data warehouses, and cloud and federated databases to manage flows of large
amounts of data for queries and analytics. Each technology serves distinct but complementary roles
within an organization’s data ecosystem. Though one considers these technologies as silos but the
data that they manage are interrelated and encompass the same organization’s multiple needs. Thus,
from the query and data analytics point of view, it is important to have a seamless and unified view
of data and the technologies (diferent perspectives) to query and manage the data. Data fabric, a
recent development in data management, ofers a comprehensive architecture to integrate disparate
data pipelines [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] seamlessly, reducing latency while ensuring robust data governance and compliance.
      </p>
      <p>The database became necessary because traditional file systems with applications created data
redundancy problems, no visible schema, and repeated functionality implementation. A distributed
database is an integrated, interrelated, multiple databases in diferent systems. The key idea behind the
technology is the complexity of the technology to support various levels of transparency in accessing
the distributed database. A distributed database system can have homogeneous or heterogeneous
database systems, an integrated system, or multi-database system solutions executed as per application.
A data warehouse is a specialized database that provides a subject-oriented, integrated, time-variant,
non-volatile database for online analytical processing (mostly business intelligence reports and data
mining). The core technology for a data warehouse is a data cube, a multidimensional relational
table that eficiently supports aggregates (data cube operations) across trillions of rows. A data lake
is a central repository of data that allows ingesting, storing, processing, and analyzing large
volumes of multi-modal data in real-time. This includes structured data (such as relational databases),</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
semi-structured data (like log files, CSV, XML, JSON), and unstructured data (such as text, images,
audio, and video). Unlike traditional data storage systems, data lakes store raw data in its native
format and structure and thus avoid (expensive) preprocessing or modeling before storing or
analyzing. Data lakes are often considered architecture-less, as they do not impose a specific structure or
schema on stored data. Creating and maintaining a functional data lake requires integrating
multiple technologies for data ingestion, processing, storage, and exploration tasks. A data Lakehouse, a
large-scale data storage management system, blends the flexibility of data lakes with the structured
data management capabilities of data warehouses. Data mesh emphasizes decentralization of data
ownership and governance, advocating for domain-oriented teams responsible for managing data
within their respective domains. Data lakes provide flexibility and agility for storing diverse data
types and formats. Data mesh and data lake are decentralized approaches to managing and organizing
data within an organization. Data warehouse adopts a centralized and schema-on-write approach.
A data fabric abstracts the intricate
tmecohvneimcaelnptr,otrcaensssefosrimnvaotlivoend, ainnddaitna- ADdamtainFisatrbarticor ADppesliicganteiorn ADpepvleicloaptieorn ASrocluhtitioecnt Data Analyst User Layer
tegration, ensuring universal data TransfDoartmaations Data Lineage MManeatagdemateant APIs &amp; Tools WoDrkaftlaows Fabric Layer
accessibility throughout the
enterprise. The key idea of Data fabric Source Layer
is to design data pipelines with the Data Lakes Distributed databases Real-time data streams Cloud Data
principle of loosely coupling data
across platforms and applications, Figure 1: A three-layer view of Data Fabric
facilitating seamless access to data
present across distributed and heterogeneous environments, including on-premises and cloud-based
systems. A data pipeline is supported by metadata to route the query to the appropriate data sources.
Once the metadata is in place, this routing can be done automatically without the application
programmer explicitly coding it. Data fabric architectures manage query and analytics pipelines by leveraging
distributed computing capabilities without moving data to a centralized location.</p>
      <p>Figure 1 shows a three-layered view of a data fabric. At its base lies the source layer, composed of
data lakes, distributed databases, real-time data feeds, cloud data, and other data sources. The User
layer consists of various stakeholders, including application designers, developers, solution architects,
and data analysts. The job of users includes: (i) understanding the data required by the application
using domain knowledge, (ii) creating a schema that developers can use to write queries against, and
(iii) defining ETL (Extract, Transform, Load) pipelines to extract the necessary data and populate the
tables defined by the schema. All these tasks necessitate metadata information, which enables them to
comprehend how to connect the actual data sources with the schema’s tables and write eficient ETL
pipelines to extract the required data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Modeling</title>
      <p>
        Modeling approaches can be broadly categorized into data-driven and query-driven methodologies
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Data-driven approaches initiate from a detailed analysis of the data sources, whereas query-driven
methodologies start from users’ requirements. Schema-on-read (e.g., data lakes) and schema-on-write
(e.g., data warehouse) concepts are used in handling data schema in data storage and management
systems. In a data lake, data is stored in its raw form without any predefined schema or structure
imposed upon it at the time of ingestion. Instead, the data is ingested into the data lake in its native
format, whether structured, semi-structured, or unstructured. The schema-on-read paradigm means that
the data schema is applied when it is accessed or queried, rather than when the data is initially stored.
This allows for flexibility in handling diverse data types and formats, as the data can be interpreted and
processed according to the requirements of specific analytical or processing tasks. The schema-on-read
approach enables organizations to store large volumes of raw data without needing upfront schema
design, promoting agility and adaptability in data analytics and exploration. Data modeling in a data
fabric involves creating a unified, flexible representation of diverse and distributed data sources. It
must support varying data types, evolving schemas, and real-time updates while ensuring governance
and integration across platforms. Further, the modeling must support analytics while maintaining
scalability and adaptability within the data fabric architecture.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Technologies and Applications</title>
      <p>
        Data fabric technologies facilitate the acquisition, access, transformation, management, intelligence,
orchestration, discovery, and governance of data without requiring explicit knowledge of data format.
Categorized into technical, operational, business, social, passive, and active types, Data Fabric metadata
plays a crucial role in understanding and profiling an organization’s data assets and facilitating seamless
integration across various platforms and technologies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Thus, metadata management is essential for
providing context and insights into the data stored within the fabric. It involves processes and tools for
capturing, storing, and managing metadata, including information about data lineage, quality, usage,
and relationships. Scalability and resilience are essential characteristics of data fabric that enable it
to adapt to evolving application needs. Further, understanding and applying conceptual relationships
such as keys, composite attributes, nary relationships, generalization, aggregation, and constraints
are essential for designing robust, eficient, and efective data fabric that aligns with the underlying
business logic and meets the desired functionality requirements of various applications over data fabric.
      </p>
      <p>
        Data fabric components are built using various technologies designed to handle the complexity and
scale of modern data environments. The widely used file formats for storing large data sources include
CSV, Parquet, ORC, Avro, and Feather. Many of these formats include metadata within their storage [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Parquet and ORC are columnar storage formats optimized for read-heavy workloads, ofering eficient
compression, and encoding schemes. Avro is a row-based format that supports schema evolution,
making it suitable for data serialization. Feather, developed by Apache Arrow, provides fast read and
write operations, ideal for data exchange across platforms. Metadata file formats (such as Hive, Iceberg,
Delta, and Hudi) manage large-scale data on distributed storage systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Hive provides a data
warehousing solution with SQL-like querying. Iceberg ofers table format capabilities for handling
high-performance data lakes with ACID transactions. Delta Lake provides similar functionality with
additional support for scalable metadata handling and schema enforcement. Apache Hudi supports
incremental data processing and stream ingestion, making it suitable for real-time data lake architectures.
      </p>
      <p>
        Data integration tools use mechanisms such as connectors, data ingestion pipelines, and ETL processes
to ensure data quality and consistency during integration. Schema matching [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] algorithms identify
semantic correspondences between the elements of two schemas using their structural and syntactic
pattern. Metadata extraction, classification, and tagging mechanisms are used to automate the creation
and updating of metadata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Data governance and security mechanisms provide access control
policies (e.g., Role-based Access Control, encryption techniques (e.g., Advanced Encryption Standard,
and compliance management systems to secure and comply with regulations. MapReduce, machine
learning algorithms, and real-time stream processing algorithms are used to process and analyze
data by leveraging distributed computing frameworks such as Apache Hadoop and Spark for eficient
parallel processing of large datasets. Data orchestration tools (e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) use workflows and event-driven
architectures to automate data movement through the pipeline, ensuring the data flows well and
eficiently. Architecting a data fabric involves designing data workflows and potentially restructuring
data to cater to varied user needs [
        <xref ref-type="bibr" rid="ref2 ref9">9, 2</xref>
        ]. Data fabric architectural frameworks integrate domain models
and metadata structures tailored for diverse applications, including additive manufacturing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
road transport [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] systems. A data fabric needs a complex landscape characterized by numerous
interconnected technologies, solutions, and diverse data and control flows for robust data management.
Within this environment, varying levels of abstraction often leave end users struggling to identify the
pertinent data and formulate relevant queries to meet application requirements—especially as these
needs evolve dynamically over time. Automated data management trends in the industry are moving
towards zero-ETL, No-Code data orchestration, Machine Learning pipelines, metadata discovery,
crossdomain schema matching, etc., to reduce latency and improve governance in the data fabric environment.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Open Problems</title>
      <p>Incorporating data governance, lineage, security, and privacy within a data fabric framework remains
an ongoing research area, particularly in developing standardized metadata management and data
cataloging practices. This is because comprehensively understanding the metadata for one or more data
lakes and their interrelationships is challenging. Achieving seamless consolidation across heterogeneous
and distributed data sources requires advanced modeling techniques to handle varying data types,
schemas, and formats and establish connections among them. Further, the dynamic nature of data
sources and pipelines necessitates models that can adapt in real time to changes in data structure and
volume while maintaining performance, scalability, and adaptability.</p>
      <p>Ensuring secure and non-harmful operations within a data fabric is crucial, as it involves managing
complex data pipelines, which in turn access multiple data sources. Moreover, it is essential to guarantee
that the data fabric is always performing its intended functions, for instance, ensuring users access
the right and complete data without getting lost among multiple data pipelines, which could obscure
end-user abstraction. Metadata reasoning is an open problem to support correct and on-the-fly data
integration and governance across varied platforms and applications and executing multiple data
pipelines in parallel. These challenges hinder fully realizing the data fabric’s potential, especially in
large-scale, enterprise-level applications where performance, reliability, and resilience are critical.</p>
      <p>Though LLMs are used to generate data pipelines and determine the metadata relevant to user
queries, to comprehend the metadata, that needs to be mapped to a conceptual ER model using LLMs.
Techniques to validate data pipelines across conceptual data models need to be developed. A toolkit to
support conceptual model-driven data pipeline establishment is required. The evaluation of whether
the data pipeline provides the complete result for a user query needs to be formulated along with
techniques to establish the completeness of the data pipeline result.</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase, and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hechler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weihrauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Data fabric and data mesh approaches with AI</article-title>
          , Berkeley, CA, USA: Apress Berkeley (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Data-fabric: A data fabric system based on metadata</article-title>
          ,
          <source>in: 2022 IEEE 5th International Conference on BDAI</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <article-title>Data lakes: A survey of functions and systems</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>12571</fpage>
          -
          <lpage>12590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barik</surname>
          </string-name>
          , Data fabric primer,
          <year>2022</year>
          . URL: https://www.globallogic.com/in/wp-content/uploads/sites/ 21/2023/12/
          <string-name>
            <surname>Paper-Data-</surname>
          </string-name>
          Fabric-primer.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <article-title>What goes around comes around</article-title>
          ... and around...,
          <source>ACM Sigmod Record</source>
          <volume>53</volume>
          (
          <year>2024</year>
          )
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          , E. Peukert,
          <article-title>Large-scale schema matching</article-title>
          ,
          <source>in: Encyclopedia of Big Data Technologies</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>1105</fpage>
          -
          <lpage>1110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherradi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Haddadi</surname>
          </string-name>
          ,
          <article-title>Ememodl: Extensible metadata model for big data lakes</article-title>
          ,
          <source>International Journal of Intelligent Engineering and Systems</source>
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>231</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <issue>K2View</issue>
          ,
          <article-title>Data transf</article-title>
          . &amp; orchestration,
          <year>2023</year>
          . URL: https://www.k2view.com/platform/ data-orchestration-tools/, accessed:
          <fpage>2025</fpage>
          -09-02.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>McSweeney</surname>
          </string-name>
          ,
          <source>Designing an enterprise data fabric</source>
          ,
          <year>2019</year>
          . URL: https://www.researchgate.net/ publication/333485699_Designing_
          <article-title>An_Enterprise_Data_Fabric.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>P.-O. Östberg</surname>
            , E. Vyhmeister,
            <given-names>G. G.</given-names>
          </string-name>
          <string-name>
            <surname>Castañé</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Meyers</surname>
            ,
            <given-names>J. Van Noten</given-names>
          </string-name>
          ,
          <article-title>Domain models and data modeling as drivers for data management: The assistant data fabric approach</article-title>
          , IFAC-PapersOnLine
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Rieyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R. K.</given-names>
            <surname>News</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T. J.</given-names>
            <surname>Zaarif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G. R.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hassan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ianni</surname>
            ,
            <given-names>G. Fortino,</given-names>
          </string-name>
          <article-title>An advanced data fabric architecture leveraging homomorphic encryption and federated learning</article-title>
          ,
          <source>Information Fusion</source>
          <volume>102</volume>
          (
          <year>2024</year>
          )
          <fpage>102004</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>