<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacopo Tagliabue</string-name>
          <email>jacopo.tagliabue@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ciro Greco</string-name>
          <email>ciro.greco@bauplanlabs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Bigon</string-name>
          <email>luca.bigon@bauplanlabs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Vancouver, Canada</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bauplan</institution>
          ,
          <addr-line>New York City</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In this paper we describe how we designed Bauplan</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tandon School of Engineering</institution>
          ,
          <addr-line>NYU, New York City</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>The recently proposed Data Lakehouse architecture is built on open file formats, performance, and first-class support for data transformation, BI and data science: while the vision stresses the importance of lowering the barrier for data work, existing implementations often struggle to live up to user expectations. At fulfill the Lakehouse vision. Since building from scratch is a challenge unfit for a startup, we started by re-using (sometimes unconventionally) existing projects, and then investing in improving the areas that would give us the highest marginal gains for the developer experience. In this work, we review user experience, high-level architecture and tooling decisions, and conclude by sharing plans for future development.</p>
      </abstract>
      <kwd-group>
        <kwd>data lakehouse</kwd>
        <kwd>data pipelines</kwd>
        <kwd>serverless</kwd>
        <kwd>reasonable scale</kwd>
        <kwd>containerized execution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
[2] argues that the popular data warehouse architecture
will soon be replaced by a new architectural pattern,
the Data Lakehouse (DLH). A DLH is built on open file
formats (e.g. Parquet), exceptional performance, and
ifrst-class support for engineering (data transformation),
analytics (BI) and inferential (data science) use cases. The
vision of such architecture is first and foremost about
lfexibility, making it possible for organizations to choose
diferent ways to operationalize data depending on data
straints.</p>
      <p>
        This is particularly valuable for large organizations
where data democratization is crucial to achieve agility
[
        <xref ref-type="bibr" rid="ref49">3</xref>
        ]: enabling easier access to and understanding of data
is the prerequisite for organizations to best leverage their
data. The heterogeneity of use cases is reflected in the
complexity of the underlying infrastructure (Fig. 2), with
some pieces coming from databases (query engines,
tables, data catalogs etc.), some from distributed systems
[
        <xref ref-type="bibr" rid="ref31 ref50 ref7">1</xref>
        ]. While our goals, timeline and methodology are diferent, our
work shares the underlying philosophy.
∗Corresponding author.
beyond traditional Big Data frameworks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. A Practitioner Perspective</title>
      <p>Aside from security and compliance, the biggest
argument in favor of the DLH is flexibility: diferent teams
can use diferent tools to process data for diferent use
cases. Practically, this implies that any DLH needs to
support two diferent use cases:
• Query and Wrangle (QW), referring to the
scenario where users need to explore data and
ask specific questions (e.g. counting how many
marketing emails were opened in the previous
month). Querying predominantly involves SQL,
while Wrangling is often performed in Python.
• Transform and Deploy (TD), referring to the
scenario where users need to construct
codedriven, reproducible data pipelines (DAGs) that
generate new artifacts for downstream
utilization. For instance, building a dashboard exposing
the performances of marketing emails across
different user demographics. Due to the distinct
strengths and weaknesses of SQL and Python,
the combination of both is often optimal.</p>
      <p>Importantly, depending on the phase in which
developers find themselves in the development cycle, their way to
interact with the data can be either Synchronous or
Asynchronous. While QW is de facto always synchronous, TD
tend to be more nuanced and need to support both.
• Synchronous is when a user issues a command
(a SQL query, or a DAG run) and awaits for the
results to come back. In this scenario, simplicity
and fast feedback loop are the key goals [4];
• Asynchronous is when a command is issued
(often by another system, such as an orchestrator)
and the user is involved in monitoring the
outcome at a later time. In this scenario, reliability,
resilience and infrastructure ergonomics are the
key goals.
possibility of interacting with data (QW vs. TD) in both
synchronous and asynchronous ways.</p>
      <p>To achieve this, we designed Bauplan with the
following general design principles in mind:
• Serverless experience: to fully leverage the
separation of storage and compute, developers
should deal with as little infrastructure as possible.
We propose to decouple data logic from
execution to enable a “serverless” experience based on
a declarative approach; furthermore, since data
pipelines are functional in nature (output of
parent nodes is input for children), a
function-as-aservice deployment is prima facie a natural fit. 1
• Software development patterns : it is often the
case that the only developers who can bring data
applications to production are those who possess
a special data engineering skill set. Empowering
developers with more general coding skills to do
impactful work on data is a fundamental piece of
the DLH vision. Systems should allow users to
use only familiar tools like SQL, standard Python,
the CLI and Git.
• Reproducibility and versioning: because the
primary factor for building data products involves
reproducible and versioned code pipelines, we
embrace the idea that code provides a (mostly
declarative) way to build data. Likewise, data is
treated as code, adopting a life cycle including
branching, committing, and merging.
• Full Auditability: cloud clusters with long
startup time and complex configurations
encourage developers to resort to local development to
expedite the feedback loop. However, this pattern
exacerbates the challenges of software
development (e.g. dependency management) while
introducing potential security issues. We advocate
for a cloud-first approach, ensuring that all work
and access are centralized, auditable, and aligned
with security and governance policies.</p>
      <sec id="sec-2-1">
        <title>The interplay between use cases and modalities is summarized in Table 1. A DLH needs to provide a coherent developer experience across the diferent phases of their development cycle (Dev vs. Prod) while supporting the</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Departing from Spark</title>
      <sec id="sec-3-1">
        <title>Before delving into the specifics of our design for</title>
        <p>Bauplan, we wish to explain the rationale behind
departing from Spark, which is widely regarded as the industry
standard for analytics at scale and holds a significant
position in numerous DLH implementations.</p>
        <p>Given our discussion about the ideal DLH developer
experience, we believe Spark falls short for several
structural reasons. For instance, slow startup and execution
makes Spark sub-optimal for synchronous operations,
1Note that we purposely use the term with some flexibility (Section
4.5).
such QW. At the same time, the system has a notoriously
steep learning curve [5, 6], both from an API and a
debugging perspective: when thinking about TD, it is often
hard to reason about it [7, 8]. If the DLH vision is truly
about enabling a broader set of practitioners to perform
data transformations, these systems are not necessarily
the best design choice.
2To fully anonymize the dataset, we used the powerlaw package [14]
for distribution fitting: final data are then generated by sampling
from the distribution.
3https://ourworldindata.org/grapher/
historical-cost-of-computer-memory-and-storage?time=2010.
.latest&amp;facet=metric.
• the data lake: while not obvious from the code
itself, there is an object storage layer containing
the raw data we are starting from: from the
developer perspective, users would only interact with
logical constructs, such as taxi_table; from an
implementation standpoint, handling persistent</p>
      </sec>
      <sec id="sec-3-2">
        <title>4https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page</title>
        <sec id="sec-3-2-1">
          <title>4.2. Table Format</title>
          <p>While a data lake is ultimately made of files, we wish
to provide table-like abstractions to our users: by
decoupling the actual storage of the data (the file
s3://mybucket/taxifile.parquet ) from their logical function (the
list of taxi trips in NYC), we can reuse the same code
across data versions: every command that points to
taxi_table can be executed over diferent versions of
the table with just a configuration change (Section 4.3).</p>
          <p>After considering Delta Lake, Hudi8, and Iceberg9 as
possible formats to give table-like semantics to the object
storage, we chose Iceberg mainly for three reasons: larger
community support, full support for time-travel and
versioning semantics, limited but increasing compatibility
with Python10.</p>
          <p>At the time of writing, major formats have full read /
write support only for JVM engines (e.g. Spark, Presto,
Dremio). Considering our working hypothesis about the
Reasonable Scale and Bauplan focus on a serverless
experience, two major tasks had to be completed to overcome
object transparently is a huge component of the these constraints. First, when running a query over an
DLH (Sections 4.2 and 4.3 below). Iceberg table, our code intelligence module needs to first
• declarative data assets: we subscribe to the one- parse SQL into a table scan to obtain a dataframe-like
query, one-artifact pattern popularized by dbt- object (Section 4.4.2); second, when materializing a data
style transformations5: users define artifacts one asset from the DAG back to the data catalog, a Spark
by one as SQL queries, and the platform builds up session is created to handle the Iceberg INSERT:
followthe DAG based on parsing and naming conven- ing our no infrastructure principle, we created custom
tion (Section 4.4). Importantly, no imperative- containers (Section 4.5) optimized for starting a Spark
style DAG construction is needed: insofar as command with 300 milliseconds latency – as a result, the
users implicitly link together parent and children materialization step looks no slower than running any
nodes through their code, functions “are all you other Python function (as opposed to waiting for a Spark
need”6; cluster to launch).
• data expectations: it is best practice to test the
tables produced by a DAG for statistical anomalies. 4.3. Data catalog and versioning
This provides the foundation of the
transformaudit-write pattern for data development (Section
4.3): just as in software we can debug, test and
even run diferent versions of an application in
parallel against production, automated testing
and versioning becomes the foundation of the
same approach for data pipelines.7</p>
          <p>Software development best practices and tooling allow
developers to work on code (new feature, bug fixing,
debugging etc.) in a consistent and sand-boxed way:
production code can be cloned, run, modified by developers,
but running development code won’t leak into a
production environment. Bauplan provides the same best
practices for data pipelines, enforcing a
transform-auditwrite pattern for all transformations. In particular, we
picked Nessie 11 to provide a git-like semantics: Nessie
versions an entire catalog at a time, so it is ideal for
transformation use cases when multiple artifacts are afected
at each run. Fig.4 depicts the basic versioning mechanism
in the platform:
1. the user checkouts through Git a new branch in</p>
          <p>his project (feat_1), to develop a new pipeline;
8https://hudi.apache.org/
9https://github.com/apache/iceberg
10https://py.iceberg.apache.org/
11https://projectnessie.org/
5https://github.com/dbt-labs/dbt-core
6See also the Appendix for the full code example.
7Following the software analogy further, expectations are akin to
integration tests, where a new component is embedded in an
existing system, and checks are made to ensure the desired output is
achieved. A related but diferent concept is unit tests, which instead
work on manually fabricated input-output pairs, to test edge cases
or important scenarios irrespective of the system actually seeing
this input. Given our abstractions, Bauplan can easily accommodate
both types of tests, especially considering that Python primitives
for creating unit tests for tables are better than SQL.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2. in the context of a bauplan run command,</title>
        <p>Bauplan detects the Git context and creates a
Nessie branch with the same name, feat_1,
starting from the current production data in the lake
main branch (grey node); now both the code
(through Git) and the data artifacts (through</p>
        <p>Nessie) are production-like and sandboxed;
3. Bauplan executes the DAG into an ephemeral
branch (run_12): by executing each run
“atomically” we can avoid persisting dirty DAGs – only
when all steps and tests are executed successfully,
we are allowed to merge the data into the current
branch, making the artifacts 1 and 3 visible to
any user with branch access (the obvious analogy
here is the concept of transaction in databases);
4. when the merge on feat_1 is committed, the
ephemeral branch run_12 is deleted.</p>
        <p>There is a natural tension in modularity between code
and compute: modular code is easier to test, re-use,
rea</p>
        <p>Once again, we remark that we chose to base the
developer experience only on Git and the CLI. While we son about, on the other side, monolith compute is easier
expect users to be familiar with Git, all the data version- to spin up, manage, orchestrate. The no infrastructure
ing is handled behind the scene transparently. The user principle provides guidance on how to navigate the
tradeis not expected to master Nessie or any of the technolo- ofs: on the code side, we subscribe to full modularity (e.g.
gies involved. Instead, they are provided with a sand- dbt-style transformations), so that each node in the DAG
box environment for data development with a familiar corresponds to one file that is runnable and testable in
software-like semantics. isolation; on the compute-side, we let the system opting
for modularity or monolith depending on the
circum4.4. Code intelligence stances. In other words, the user is exposed directly only
to the top layer in Fig. 3: it is the job of the code
intelligence module (Fig. 2) to take as input the queries and
functions defining a pipeline, together with parameters
from the CLI, and produce as output first a logical plan of
operations, and finally a physical plan to run the desired
transformations.
4.4.1. From code to the logical plan
After the pipeline code is ingested (e.g. Section A), the
full project is snapshotted in an object storage and
fingerprinted in a Postgres database, not dissimilarly from
what happens for runs in Metaflow [15]: by assigning
an id and immutable artifacts to each run, we guarantee
reproducibility for auditing and debugging purposes –
following the code is data principle, the same code on the
same data version will produce identical results. After
versioning, SQL and Python files are parsed: first,
logical dependencies are extracted from implicit references
– in our example, pickups is build out of another
table (SELECT .. FROM trips), so we need to materialize
nodes in the right order; second, environment details for
Python functions are extracted – in our purely functional
implementation, a decorator such as @requirements can
be used to pin down the needed packages: because of
our serverless setup (Section 4.5), the OS, container, and
environment layers are handled by the system, leaving
packages as the only degree of freedom left to control to
ensure full reproducibility.</p>
        <p>Finally, in our example Python is used only to run an
expectation. There is no reasons why Python could not
be used to also declare new tables starting from existing
ones. In essence, transformations are functional mappers
from set of tuples (rows in the “parent table”) to set of
tuples (rows in the “child table”): as long as two languages
can speak a common dialect over those tuples, they can
operate together.
4.4.2. The execution plan
The output of the parsing step is a logical plan (Fig. 3),
so that the system knows which artifacts depends on
existing Iceberg tables, which tests need to pass to
consider the pipeline healthy, and what needs to be written
back into the catalog as a result of running the DAG. The
ifrst Bauplan version for executing such a plan was the
simplest possible idea, i.e. just mapping the plan to an
isomorphic execution, in which each node is executed
by one (serverless and stateless) function. However, this
naive implementation doesn’t optimize around an
important feature of data workloads: at RS, computing artifacts
is pretty fast, and the bottleneck is often moving data
around. To make a concrete example, consider again
our sample pipeline: there, the Python expectation is a
Pandas function taking a DataFrame as input (the data
artifact we are testing), and returning a boolean. Instead
of running an Iceberg command first, a SQL query and
then a Python function as three separate executions, we
pushed down WHERE filters to obtain a smaller in-memory
table, then run in-place the SQL logic and the Python
expectation. This optimization results in 5x faster feedback
loop even with small datasets, and avoid unnecessary
spillover to object storage: notably, the user is not
required to know any of the underlying implementation
details.</p>
        <sec id="sec-3-3-1">
          <title>4.5. Serverless runtimes</title>
          <p>
            When the execution plan is finalized, the computation
needs to happen in a fast, reliable, scalable way.
Following the functional definitions of pipelines, a serverless
runtime is the natural choice in terms of abstraction:
the user specifies what needs to happen, the Bauplan
platform runs the code in an optimized environment
where OS, container, and runtime are under its control
[16]. In recent years, serverless has become an overloaded
term, used to vaguely denote a cluster of features not
necessarily related [17, 18] and not necessarily important
for (or even, at odds with) data pipelines: scale-to-zero,
price-per-second, “infinite” and instantaneous
concurrency, stateless execution model [19]. We identified few
essential properties for our serverless platform:
• multi-language support with flexible
dependencies (Fig. 2): considering SQL code can be run in
a Python interpreter connected to object storage
(see duckdb below), the requirement can be
satisfied by a Python runtime allowing an arbitrary
combination of interpreter version and
dependencies12;
• runtime hardware allocation: the same
transformation logic should run with 10GB or 20GB of
memory depending on the underlying artifacts;
• data locality: given that data pipelines are first
and foremost about moving data, we need to
maintain function isolation at the runtime level but
allow for shared resources at the artifacts level
- moving data is slow and expensive, and object
storage should be treated as a last resort [
            <xref ref-type="bibr" rid="ref48">20</xref>
            ];
• pausing functions: since a fresh Spark context
takes a while to be created, it is typically re-used
in a stateful manner. However, since “freezing” a
container after initialization would make startup
time negligible, we could run stateless commands
over ephemeral containers.
          </p>
          <p>We evaluated AWS Lambda13, OpenWhisk14 and
OpenLambda15 as of-the-shelf frameworks, but none
12Note how the function-first approach provides a level of control
– i.e. specifying packages per function – that is impossible in
conventional Spark applications.
13https://aws.amazon.com/lambda/
14https://openwhisk.apache.org/
15https://github.com/open-lambda/open-lambda
of them fully satisfied the desiderata above: as typi- • bauplan run: asynchronous, DAG-long
intercal use cases for serverless are micro-services and glue actions are handled through run; starting from
code in cloud infrastructure, it is not surprising that the pipeline code in the IDE, issuing run starts
existing tools would be sub-optimal for our scenarios. the intelligence and execution processes depicted
Steps in data DAGs have almost opposite requirements in Fig. 3. As DAGs are modular and
snapshotwhen compared to typical functions-as-a-service: startup ted at each execution, additional arguments
altime is somewhat important, but since the bottleneck is low to replay an arbitrary DAG for debugging
data reading and processing, we play in the 200-1000 and inspection: for example, -run-id 12 -m
ms regime, not 0-200 ms; on the other hand, resources pickups+ will re-execute in a sandboxed way the
required to compute aggregations require more fine- same code over the same data as the run with
grained tuning. For these reasons, we invested, as a difer-  = 12 , starting from the pickups artifacts and
entiating feature, in building an orchestration and mem- running all its children.
ory management layer to support workloads in which
horizontal scalability is less important than vertical elas- With the goal of truly lowering the bar for data work,
ticity and eficient data processing. the CLI-first approach is easy to learn and easy to
ex</p>
          <p>
            To support SQL, we leverage duckdb [
            <xref ref-type="bibr" rid="ref34">21</xref>
            ] as our tend: in fact, the semantics of run mirrors tools that are
query engine, given its performance, flexibility and full- popular in our user base (dbt and Metaflow ). Moreover,
compatibility with our formats16; to support Python, we CLI commands are easy for machines to execute as well:
built custom containerized runtimes and a container man- since querying and visualizing data in the terminal is not
ager: furthermore, we were able to exploit the power-law ideal with large datasets, it is trivial to wrap commands
in package utilization [22] to limit overall download times in an application layer users are comfortable with, e.g. a
with an eficient local, disk-based cache. 17 Our solution dashboard or a Python notebook.
allows for fast startup time (300ms), complete runtime
isolation at the function level, and customizable sharing 5. Conclusion and Future work
policies within the functions in a single DAG execution:
as our target deployment model is initially “Bring Your
Own Cloud”, the usual security concerns of multi-tenant
virtualization do not apply [23].
          </p>
          <p>Finally, we wish to stress that containerization is an
active area of research, with exciting possibilities ofered
by new frameworks such as WASM [24]: through an
ongoing collaboration with the research group behind
SOCK [22], we are actively iterating on this component.</p>
          <p>
            We started our journey designing Bauplan by
considering – and dismissing – two ways to build towards the
DLH vision: re-purposing existing Big Data tools, or
building a new platform from scratch. Mirroring the
Firebolt experience [
            <xref ref-type="bibr" rid="ref31 ref50 ref7">1</xref>
            ], we found that re-using existing
open source components as initial “Lego bricks” can be a
powerful third way to getting closer to the goal, without
necessarily breaking the bank. While the “lean startup”
playbook [25] of rapid market-driven pivots is not
read4.6. Interacting with the platform ily applicable to data platforms, re-using components
allowed the team to converge quicker to a working
endSimilar to other popular data tools, interactions between to-end system, test its strength and weaknesses with
Bauplan users and the platform happen through the CLI, early adopters, and place more informed bets on which
as pipelines get written in the IDE of choice. With the in- features are responsible for the greater marginal value.
tention of satisfying first the semantics implied by the sce- There are obviously many other interesting areas that
narios in Table 1, the CLI experience is centered around remain to be addressed, e.g. securing data through
seamtwo main commands, query and run: less, yet secure authentication, parallelizing SQL
execu• bauplan query -q "SELECT * FROM trips": tion, using logs and machine learning to further optimize
synchronous, point-wise interactions with the experience behind the scenes. Moreover, truly
manipre-built artifacts are handled through query. As festing the DLH vision in the product is a long journey:
discussed, time-travel is a first-class abstraction, starting from open source tools was the right choice, but
so the same command takes an additional as the platform progresses it is likely we will wander far
argument to specify the intended branch (if not more into the unknowns to better meet market demands.
current): -b feat_1. As Rome was indeed not linted, tested, built nor deployed
in a day, we look forward to sharing with the community
the next steps of our adventure in future publications.
16An example of running serverless queries has been open-sourced
          </p>
          <p>at https://github.com/BauplanLabs/quack-reduce.
17We plan to release a Lambda-based generic runtime for Python</p>
          <p>functions that leverages object storage for caching.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Sample data pipeline</title>
      <p>We report the full code for the running example of
this paper (Section 4.1), as schematically depicted in
Fig. 3. Please note that steps are transformed into a
DAG thanks to a simple naming convention: children
tables refer to parents (Step 3 below referring to Step 1
table), while Python testing functions comply with the
table_expectation syntax.</p>
      <p>Step 1 (trips): read raw data (as stored under an
Iceberg table taxi_table) for a target time window, and
extract important columns into a new trips table.
SELECT
p i c k u p _ l o c a t i o n _ i d ,
p a s s e n g e r _ c o u n t as count ,
d r o p o f f _ l o c a t i o n _ i d
FROM</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1089/big.
          <year>2013</year>
          .
          <volume>0037</volume>
          . doi:
          <volume>10</volume>
          .1089/big.
          <year>2013</year>
          .
          <volume>0037</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          arXiv:https://doi.org/10.1089/big.
          <year>2013</year>
          .
          <volume>0037</volume>
          ,
          <article-title>We are immensely grateful to the open source and data pMID: 27447254. community, and we plan to continue our contributions</article-title>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Casto open source and open science in this new venture as sani, B</article-title>
          .
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <string-name>
            <surname>Chia</surname>
          </string-name>
          ,
          <year>Sigir 2021</year>
          e
          <article-title>-commerce workwell. In particular, we wish to thank the PyIceberg, Open shop data challenge</article-title>
          ,
          <source>in: SIGIR eCom</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
          <article-title>Lambda and Nessie teams, with whom we have been col-</article-title>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Attanasio, laborating in the past few months while starting Bauplan</article-title>
          . C. Greco,
          <string-name>
            <given-names>G. d. S. P.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Chia</surname>
          </string-name>
          , Evalrs: Finally,
          <article-title>we wish to thank Tyler Caraza-Harter and Ryan a rounded evaluation of recommender systems, Vilim for precious feedback on a previous version of this 2022</article-title>
          . URL: https://arxiv.org/abs/2207.05772. doi:10. work. 48550/ARXIV.2207.05772.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <article-title>You do not need a bigger boat: RecReferences ommendations at reasonable scale in a (mostly)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          serverless and open stack,
          <source>RecSys '21</source>
          , Association
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>???? URL: https://doi.org/10.1145/3460231.3474604.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>doi:10.1145/3460231</source>
          .3474604. [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasumansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wagner</surname>
          </string-name>
          , Assembling a query
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>(Eds.)</source>
          , 1st International Workshop on Composable
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Data</given-names>
            <surname>Management</surname>
          </string-name>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          ,
          <source>CDMS@VLDB</source>
          <year>2022</year>
          , [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Murray</surname>
          </string-name>
          , Scalability!
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Sydney</surname>
          </string-name>
          , Australia, September 9,
          <year>2022</year>
          ,
          <year>2022</year>
          . URL: but at what cost?, in: USENIX Workshop on Hot
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          https://cdmsworkshop.github.io/2022/Proceedings/ Topics in Operating Systems,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>ShortPapers/Paper1_MoshaPasumansky.pdf</article-title>
          . [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Alstott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Bullmore</surname>
          </string-name>
          , D. Plenz, powerlaw: A [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Armbrust, python package for analysis of heavy-tailed distri-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>Lakehouse: A new generation of open platforms butions</article-title>
          ,
          <source>PLoS ONE 9</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>that unify data warehousing and advanced analyt-</article-title>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bowne-Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tuulos</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Research</surname>
          </string-name>
          ,
          <year>2021</year>
          .
          <article-title>chine learning with open-source metaflow</article-title>
          ,
          <source>ArXiv</source>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Data</given-names>
            <surname>Mesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc.,
          <source>abs/2303</source>
          .11761 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          2022. URL: https://www.oreilly.com/library/view/ [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hendrickson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sturdevant</surname>
          </string-name>
          , T. Harter,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          data-mesh/9781492092384/. V.
          <string-name>
            <surname>Venkataramani</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Arpaci-Dusseau</surname>
            , R. H. [4]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Hellerstein</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <string-name>
            <surname>Arpaci-Dusseau</surname>
          </string-name>
          ,
          <article-title>Serverless computation with</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Parameswaran</surname>
          </string-name>
          , Operationalizing machine learn- openlambda, in
          <source>: Proceedings of the 8th USENIX</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ing: An interview study</source>
          ,
          <year>2022</year>
          . URL: https: Conference on Hot Topics in Cloud Computing,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          //arxiv.org/abs/2209.09125. doi:
          <volume>10</volume>
          .48550/ARXIV. HotCloud'16,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , p.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2209.
          <fpage>09125</fpage>
          .
          <fpage>33</fpage>
          -
          <lpage>39</lpage>
          . [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chambers</surname>
          </string-name>
          , Spark: the definitive [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          , J. Cheng,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          , The
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Media</surname>
          </string-name>
          , Inc.,
          <year>2018</year>
          . URL: https://www.oreilly.
          <article-title>com/ for design architecture</article-title>
          ,
          <source>ACM Comput. Surv. 54</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          library/view/spark-the-definitive/9781491912201/. (
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3508360. doi:10. [6]
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Damji</surname>
          </string-name>
          <string-name>
            <surname>Jules</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Brooke</given-names>
            <surname>Wenig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Learning
          <volume>1145</volume>
          /3508360.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Spark: Lightning-Fast Data Analytics</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Reilly</surname>
            Me- [18]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schleier-Smith</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Sreekanti</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Khandelwal,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>dia</surname>
          </string-name>
          , Inc.,
          <year>2020</year>
          . J.
          <string-name>
            <surname>Carreira</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          <string-name>
            <surname>Yadwadkar</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Popa</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Gon</surname>
            [7]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Understanding the challenges and as- zalez, I.</article-title>
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Patterson</surname>
          </string-name>
          , What server-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>tions</surname>
          </string-name>
          ,
          <source>2021 IEEE/ACM 43rd International Confer- phase of cloud computing, Commun. ACM 64</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>ence on Software Engineering: Companion</source>
          Pro- (
          <year>2021</year>
          )
          <fpage>76</fpage>
          -
          <lpage>84</lpage>
          . URL: https://doi.org/10.1145/3406011.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>ceedings (ICSE-Companion)</surname>
          </string-name>
          (
          <year>2021</year>
          )
          <fpage>132</fpage>
          -
          <lpage>134</lpage>
          . doi:
          <volume>10</volume>
          .1145/3406011. [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , A survey on spark [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jangda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinckney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Brun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guha</surname>
          </string-name>
          , Formal
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>chine learning, and applications</article-title>
          ,
          <source>IEEE Transac- Program. Lang</source>
          .
          <volume>3</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>tions on Knowledge and Data Engineering</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <volume>1145</volume>
          /3360575. doi:
          <volume>10</volume>
          .1145/3360575.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          71-
          <fpage>91</fpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2020</year>
          .
          <volume>2975652</volume>
          . [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahgoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klimovic</surname>
          </string-name>
          , [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Junqué de Fortuny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Provost</surname>
          </string-name>
          , Predic- S. Chaterji, S. Bagchi, SONIC: Application-aware
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>Big Data</source>
          <volume>1</volume>
          (
          <year>2013</year>
          )
          <fpage>215</fpage>
          -
          <lpage>226</lpage>
          . URL: https://doi.org/10. in: 2021 USENIX Annual Technical Conference
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>(USENIX ATC 21)</source>
          , USENIX Association,
          <year>2021</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          285-
          <fpage>301</fpage>
          . URL: https://www.usenix.org/conference/ WHERE
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>atc21/presentation/mahgoub. p i c k u p _ a t &gt;= '</article-title>
          <year>2019</year>
          −04 −01 ' [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          , Duckdb: An em-
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>beddable analytical database</article-title>
          ,
          <source>in: Proceedings Step</source>
          <volume>2</volume>
          (trips_expectation):
          <source>we take Step</source>
          <volume>1</volume>
          <fpage>output</fpage>
          - a
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>of the 2019 International Conference on Manage- table named trips -, convert it to a DataFrame and run</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <article-title>ment of Data, SIGMOD '19, Association for Com- a statistical check using Python</article-title>
          . Similar to declarative
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>puting Machinery</surname>
          </string-name>
          , New York, NY, USA,
          <year>2019</year>
          , p.
          <source>data science frameworks such as Metaflow</source>
          [
          <volume>15</volume>
          ], Python
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          1981-
          <fpage>1984</fpage>
          . URL: https://doi.org/10.1145/3299869.
          <article-title>decorators are used to express directly in code constraints</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          3320212. doi:
          <volume>10</volume>
          .1145/3299869.3320212.
          <article-title>on the target runtime</article-title>
          . [22]
          <string-name>
            <given-names>E.</given-names>
            <surname>Oakes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Houck</surname>
          </string-name>
          , T. Harter, @
          <article-title>r e q u i r e m e n t s ( { ' p a n d a s ' : ' 2 . 0</article-title>
          . 0 ' } )
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          containers,
          <source>in: Proceedings of the 2018 USENIX</source>
          r e t u r n m &gt;
          <volume>10</volume>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>ence</surname>
          </string-name>
          ,
          <source>USENIX ATC '18</source>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , USA,
          <article-title>Step 3 (pickups): we take Step 1 output - a table</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <year>2018</year>
          , p.
          <fpage>57</fpage>
          -
          <lpage>69</lpage>
          . named trips -,
          <article-title>and produce a new table pickups by [23]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Agache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brooker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iordache</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Liguori, aggregating and sorting trip data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <article-title>tions, in: 17th USENIX Symposium on Networked d r o p o f f _ l o c a t i o n _ i d ,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <given-names>Systems</given-names>
            <surname>Design</surname>
          </string-name>
          and
          <article-title>Implementation (NSDI 20), COUNT ( ∗ ) AS c o u n t s</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , Santa Clara, CA,
          <year>2020</year>
          , pp.
          <source>FROM</source>
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          419-
          <fpage>434</fpage>
          . URL: https://www.usenix.org/conference/ t r i p s
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          nsdi20/presentation/agache. GROUP BY [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossberg</surname>
          </string-name>
          ,
          <article-title>WebAssembly Core Specifica- p i c k u p _ l o c a t i o n _ i d ,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <source>W3C</source>
          (
          <year>2019</year>
          ). URL: https://www.w3.org/TR/ d r o p o
          <article-title>f f _ l o c a t i o n _ i d</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <article-title>wasm-core-1/</article-title>
          . ORDER BY [25]
          <string-name>
            <surname>E. Ries,</surname>
          </string-name>
          <article-title>The lean startup : how constant c o u n t s DESC</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>York</surname>
          </string-name>
          ,
          <year>2011</year>
          . URL: http://www.amazon.de/
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          dp/0670921602/ref=sr_
          <article-title>1_2?ie=UTF8&amp;qid=</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          1396199893&amp;sr=
          <fpage>8</fpage>
          -
          <lpage>2</lpage>
          &amp;keywords=eric+ries.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>