<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Rise of Open Table Formats: Diving into the Next Decade of Data Lakes (Panel Discussion)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Walaa Eldin Moustafa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Senior Staf Software Engineer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LinkedIn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Panel Members: Ryan Blue, Co-creator of Apache Iceberg. Rahul Potharaju, Senior Engineering Manager, Databricks. Nishith Agarwal, Co-creator of Apache Hudi. Dain Sundstrom, Co-creator of Trino. Justin Levandoski, Engineering Director, Google BigQuery. Jesus Camacho Rodriguez, Principal Research Engineering Manager</institution>
          ,
          <addr-line>Microsoft</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The evolution of data lakes and their constituent technologies has been marked by remarkable milestones that have reshaped the landscape of data management and processing. Fifteen years ago, Hadoop won the terabyte sort benchmark [1], an achievement that signified a historic shift in the data processing domain. Following this victory, numerous initiatives were undertaken to formulate and standardize data exchange protocols. File formats like Avro [2], ORC [3], and Parquet [4] emerged as the de facto standards, embodying eficient data serialization and deserialization protocols for large-scale processing. Building on the concept metadata as a foundation, the Hive engine [5] was introduced, coupled with the Hive metastore and the Hive table format, which abstracted these intricate file formats. In doing so, Hive granted big data users the powerful capability of leveraging a SQL abstraction atop vast data lakes, simplifying complex operations and making them more accessible to a broader audience. This marked a transformative period, setting the stage for the subsequent rise of further modern big data compute engines, such as Spark [6], Presto [7], Trino [8], Flink [9], BigQuery [10], Azure Synapse [11], and Redshift [ 12], along with an emerging set of open table formats like Iceberg [13], Delta Lake [14], and Hudi [15]. Modern open table formats represent pivotal advancements in the continuous evolution of data lakes. While early data lake technologies laid the foundation, these open table formats brought about a renaissance in how data practitioners interacted with massive datasets. One of the most transformative features introduced by these systems is their native support for ACID transactions, ensuring atomicity, consistency, isolation, and durability even at massive scales. This capability revamped the trustworthiness and reliability of big data operations, enabling consistent reads and writes. Furthermore, they integrated versioning at the core of their design, allowing users to maintain, access, and revert to historical data states, a crucial feature for auditability and reproducibility. Change Data Capture (CDC), another seminal feature, provided the means to capture and process granular data modifications, eliminating the cumbersome and ineficient batch-reload paradigms of the past. Additionally, the introduction of incremental compute facilitated more frequent and eficient updates by processing only the altered data segments, rather than entire datasets. Finally, modern table formats enhanced performance by utilizing fine-grained statistics and advanced indexing. These innovations allow for more eficient data access and querying, drastically reducing processing times and resource consumption. Collectively, these advancements not only bridged many of the traditional gaps in data lake architectures but also drastically elevated the user experience, making data processing more robust, performant, agile, and user-friendly. As we sail into the horizon of the next decade of data lakes, a myriad of compelling topics emerge, reflecting both the challenges and opportunities in this rapidly evolving domain. (i) The debate between the future of open and proprietary table formats arises, with proponents of each vouching for community-driven innovation, or enterprise-driven robustness. (ii) A central theme of data lake evolution is the direction of influence in API design between compute engines and table formats. The question at hand revolves around interdependence and interplay: Do the evolving needs and architectures of compute engines primarily shape the API design of table format specifications? Or is it the innovations and constraints in table format specifications that predominantly drive the API designs of engine connectors? Unpacking this interaction becomes pivotal in understanding the trajectory of future advancements and in harnessing the combined potential of compute engines and table formats.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>(iii) Interoperability between table formats, along with SQL interoperability between compute engines
and table format specifications, is paramount. Such interoperability ensures that data practitioners aren’t
locked into a particular ecosystem but can seamlessly transition and integrate diverse systems, fostering
a richer, more collaborative data environment.</p>
      <p>(iv) Security, audit, and lineage features are more critical than ever, as the world becomes more
data-centric and regulations grow stringent. Ensuring the traceability of data transformations, access
controls, and secure data handling mechanisms are at the forefront of discussions.</p>
      <p>(v) The intricacies of bookkeeping data lakes and the meticulous task of maintaining data lake
tables present challenges that require novel solutions, blending both technological innovation and best
practices.</p>
      <p>(vi) Lastly, the recent pervasiveness of AI introduces an array of new requirements for data lakes.
As machine learning and deep learning models become integral to business processes, data lakes must
adapt to cater to the specific needs of AI workflows. In essence, the proliferation of AI will surely define
a new chapter for data lakes, demanding architectural and functional enhancements to meet the unique
challenges and potentials of AI-driven analytics.</p>
      <p>Looking back at the last decade, it’s evident that the data community has made significant progress,
turning challenges into meaningful innovations. As we look forward to the next ten years, many
questions remain. However, one clear fact stands out: data will continue to be of an ever-growing
importance in our society, serving as a key catalyst in advancing education, healthcare, and economic
equity for wider communities and future generations.
[12] Amazon Redshift, https://aws.amazon.com/redshift/, 2023.
[13] Apache Iceberg, https://iceberg.apache.org/, 2023.
[14] Delta Lake, https://delta.io/, 2023.
[15] Apache Hudi, https://hudi.apache.org/, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Hadoop</given-names>
            <surname>Wins Terabyte Sort Benchmark</surname>
          </string-name>
          , https://hadoop.apache.org/news/2008-07
          <string-name>
            <surname>-</surname>
          </string-name>
          xxterabyte-sort.html,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Avro</surname>
          </string-name>
          , https://avro.apache.org/,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Apache</surname>
            <given-names>ORC</given-names>
          </string-name>
          , https://orc.apache.org/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Parquet</surname>
          </string-name>
          , https://parquet.apache.org/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Hive</surname>
          </string-name>
          , https://hive.apache.org/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Spark</surname>
          </string-name>
          , https://spark.apache.org/,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Presto</surname>
          </string-name>
          , https://prestodb.io/,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Trino</surname>
          </string-name>
          , https://trino.io/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Flink</surname>
          </string-name>
          , https://flink.apache.org/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Google</surname>
            <given-names>BigQuery</given-names>
          </string-name>
          , https://cloud.google.com/bigquery,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Azure</surname>
            <given-names>Synapse</given-names>
          </string-name>
          , https://azure.microsoft.com/en-us/products/synapse-analytics,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW'23) - Second International Workshop on Composable Data Management Systems (CDMS'23)</source>
          ,
          <source>August 28 - September 1</source>
          ,
          <year>2023</year>
          , Vancouver, Canada ©
          <year>2023</year>
          <article-title>Copyright for this paper by its authors</article-title>
          .
          <article-title>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4</article-title>
          .0).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          CPWrEooUrckResehdoinpgs IhStpN:/c1e6u1r3-w-0s.
          <source>o7r3g CEUR Workshop Proceedings (CEUR-WS.org)</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>