-

September

Jacopo Tagliabue

jacopo.tagliabue@nyu.edu 0 1 2

Ciro Greco

ciro.greco@bauplanlabs.com 0 1

Luca Bigon

luca.bigon@bauplanlabs.com 0 1

Vancouver, Canada

0 Bauplan , New York City , United States 1 In this paper we describe how we designed Bauplan 2 Tandon School of Engineering , NYU, New York City , United States

2023

1 2023

The recently proposed Data Lakehouse architecture is built on open file formats, performance, and first-class support for data transformation, BI and data science: while the vision stresses the importance of lowering the barrier for data work, existing implementations often struggle to live up to user expectations. At fulfill the Lakehouse vision. Since building from scratch is a challenge unfit for a startup, we started by re-using (sometimes unconventionally) existing projects, and then investing in improving the areas that would give us the highest marginal gains for the developer experience. In this work, we review user experience, high-level architecture and tooling decisions, and conclude by sharing plans for future development.

data lakehouse data pipelines serverless reasonable scale containerized execution

1. Introduction [2] argues that the popular data warehouse architecture will soon be replaced by a new architectural pattern, the Data Lakehouse (DLH). A DLH is built on open file formats (e.g. Parquet), exceptional performance, and ifrst-class support for engineering (data transformation), analytics (BI) and inferential (data science) use cases. The vision of such architecture is first and foremost about lfexibility, making it possible for organizations to choose diferent ways to operationalize data depending on data straints.

This is particularly valuable for large organizations where data democratization is crucial to achieve agility [ 3 ]: enabling easier access to and understanding of data is the prerequisite for organizations to best leverage their data. The heterogeneity of use cases is reflected in the complexity of the underlying infrastructure (Fig. 2), with some pieces coming from databases (query engines, tables, data catalogs etc.), some from distributed systems [ 1 ]. While our goals, timeline and methodology are diferent, our work shares the underlying philosophy. ∗Corresponding author. beyond traditional Big Data frameworks.

2. A Practitioner Perspective

Aside from security and compliance, the biggest argument in favor of the DLH is flexibility: diferent teams can use diferent tools to process data for diferent use cases. Practically, this implies that any DLH needs to support two diferent use cases: • Query and Wrangle (QW), referring to the scenario where users need to explore data and ask specific questions (e.g. counting how many marketing emails were opened in the previous month). Querying predominantly involves SQL, while Wrangling is often performed in Python. • Transform and Deploy (TD), referring to the scenario where users need to construct codedriven, reproducible data pipelines (DAGs) that generate new artifacts for downstream utilization. For instance, building a dashboard exposing the performances of marketing emails across different user demographics. Due to the distinct strengths and weaknesses of SQL and Python, the combination of both is often optimal.

Importantly, depending on the phase in which developers find themselves in the development cycle, their way to interact with the data can be either Synchronous or Asynchronous. While QW is de facto always synchronous, TD tend to be more nuanced and need to support both. • Synchronous is when a user issues a command (a SQL query, or a DAG run) and awaits for the results to come back. In this scenario, simplicity and fast feedback loop are the key goals [4]; • Asynchronous is when a command is issued (often by another system, such as an orchestrator) and the user is involved in monitoring the outcome at a later time. In this scenario, reliability, resilience and infrastructure ergonomics are the key goals. possibility of interacting with data (QW vs. TD) in both synchronous and asynchronous ways.

To achieve this, we designed Bauplan with the following general design principles in mind: • Serverless experience: to fully leverage the separation of storage and compute, developers should deal with as little infrastructure as possible. We propose to decouple data logic from execution to enable a “serverless” experience based on a declarative approach; furthermore, since data pipelines are functional in nature (output of parent nodes is input for children), a function-as-aservice deployment is prima facie a natural fit. 1 • Software development patterns : it is often the case that the only developers who can bring data applications to production are those who possess a special data engineering skill set. Empowering developers with more general coding skills to do impactful work on data is a fundamental piece of the DLH vision. Systems should allow users to use only familiar tools like SQL, standard Python, the CLI and Git. • Reproducibility and versioning: because the primary factor for building data products involves reproducible and versioned code pipelines, we embrace the idea that code provides a (mostly declarative) way to build data. Likewise, data is treated as code, adopting a life cycle including branching, committing, and merging. • Full Auditability: cloud clusters with long startup time and complex configurations encourage developers to resort to local development to expedite the feedback loop. However, this pattern exacerbates the challenges of software development (e.g. dependency management) while introducing potential security issues. We advocate for a cloud-first approach, ensuring that all work and access are centralized, auditable, and aligned with security and governance policies.

The interplay between use cases and modalities is summarized in Table 1. A DLH needs to provide a coherent developer experience across the diferent phases of their development cycle (Dev vs. Prod) while supporting the 3. Departing from Spark Before delving into the specifics of our design for

Bauplan, we wish to explain the rationale behind departing from Spark, which is widely regarded as the industry standard for analytics at scale and holds a significant position in numerous DLH implementations.

Given our discussion about the ideal DLH developer experience, we believe Spark falls short for several structural reasons. For instance, slow startup and execution makes Spark sub-optimal for synchronous operations, 1Note that we purposely use the term with some flexibility (Section 4.5). such QW. At the same time, the system has a notoriously steep learning curve [5, 6], both from an API and a debugging perspective: when thinking about TD, it is often hard to reason about it [7, 8]. If the DLH vision is truly about enabling a broader set of practitioners to perform data transformations, these systems are not necessarily the best design choice. 2To fully anonymize the dataset, we used the powerlaw package [14] for distribution fitting: final data are then generated by sampling from the distribution. 3https://ourworldindata.org/grapher/ historical-cost-of-computer-memory-and-storage?time=2010. .latest&facet=metric. • the data lake: while not obvious from the code itself, there is an object storage layer containing the raw data we are starting from: from the developer perspective, users would only interact with logical constructs, such as taxi_table; from an implementation standpoint, handling persistent

4https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page 4.2. Table Format

While a data lake is ultimately made of files, we wish to provide table-like abstractions to our users: by decoupling the actual storage of the data (the file s3://mybucket/taxifile.parquet ) from their logical function (the list of taxi trips in NYC), we can reuse the same code across data versions: every command that points to taxi_table can be executed over diferent versions of the table with just a configuration change (Section 4.3).

After considering Delta Lake, Hudi8, and Iceberg9 as possible formats to give table-like semantics to the object storage, we chose Iceberg mainly for three reasons: larger community support, full support for time-travel and versioning semantics, limited but increasing compatibility with Python10.

At the time of writing, major formats have full read / write support only for JVM engines (e.g. Spark, Presto, Dremio). Considering our working hypothesis about the Reasonable Scale and Bauplan focus on a serverless experience, two major tasks had to be completed to overcome object transparently is a huge component of the these constraints. First, when running a query over an DLH (Sections 4.2 and 4.3 below). Iceberg table, our code intelligence module needs to first • declarative data assets: we subscribe to the one- parse SQL into a table scan to obtain a dataframe-like query, one-artifact pattern popularized by dbt- object (Section 4.4.2); second, when materializing a data style transformations5: users define artifacts one asset from the DAG back to the data catalog, a Spark by one as SQL queries, and the platform builds up session is created to handle the Iceberg INSERT: followthe DAG based on parsing and naming conven- ing our no infrastructure principle, we created custom tion (Section 4.4). Importantly, no imperative- containers (Section 4.5) optimized for starting a Spark style DAG construction is needed: insofar as command with 300 milliseconds latency – as a result, the users implicitly link together parent and children materialization step looks no slower than running any nodes through their code, functions “are all you other Python function (as opposed to waiting for a Spark need”6; cluster to launch). • data expectations: it is best practice to test the tables produced by a DAG for statistical anomalies. 4.3. Data catalog and versioning This provides the foundation of the transformaudit-write pattern for data development (Section 4.3): just as in software we can debug, test and even run diferent versions of an application in parallel against production, automated testing and versioning becomes the foundation of the same approach for data pipelines.7

Software development best practices and tooling allow developers to work on code (new feature, bug fixing, debugging etc.) in a consistent and sand-boxed way: production code can be cloned, run, modified by developers, but running development code won’t leak into a production environment. Bauplan provides the same best practices for data pipelines, enforcing a transform-auditwrite pattern for all transformations. In particular, we picked Nessie 11 to provide a git-like semantics: Nessie versions an entire catalog at a time, so it is ideal for transformation use cases when multiple artifacts are afected at each run. Fig.4 depicts the basic versioning mechanism in the platform: 1. the user checkouts through Git a new branch in

his project (feat_1), to develop a new pipeline; 8https://hudi.apache.org/ 9https://github.com/apache/iceberg 10https://py.iceberg.apache.org/ 11https://projectnessie.org/ 5https://github.com/dbt-labs/dbt-core 6See also the Appendix for the full code example. 7Following the software analogy further, expectations are akin to integration tests, where a new component is embedded in an existing system, and checks are made to ensure the desired output is achieved. A related but diferent concept is unit tests, which instead work on manually fabricated input-output pairs, to test edge cases or important scenarios irrespective of the system actually seeing this input. Given our abstractions, Bauplan can easily accommodate both types of tests, especially considering that Python primitives for creating unit tests for tables are better than SQL.

2. in the context of a bauplan run command,

Bauplan detects the Git context and creates a Nessie branch with the same name, feat_1, starting from the current production data in the lake main branch (grey node); now both the code (through Git) and the data artifacts (through

Nessie) are production-like and sandboxed; 3. Bauplan executes the DAG into an ephemeral branch (run_12): by executing each run “atomically” we can avoid persisting dirty DAGs – only when all steps and tests are executed successfully, we are allowed to merge the data into the current branch, making the artifacts 1 and 3 visible to any user with branch access (the obvious analogy here is the concept of transaction in databases); 4. when the merge on feat_1 is committed, the ephemeral branch run_12 is deleted.

There is a natural tension in modularity between code and compute: modular code is easier to test, re-use, rea

Once again, we remark that we chose to base the developer experience only on Git and the CLI. While we son about, on the other side, monolith compute is easier expect users to be familiar with Git, all the data version- to spin up, manage, orchestrate. The no infrastructure ing is handled behind the scene transparently. The user principle provides guidance on how to navigate the tradeis not expected to master Nessie or any of the technolo- ofs: on the code side, we subscribe to full modularity (e.g. gies involved. Instead, they are provided with a sand- dbt-style transformations), so that each node in the DAG box environment for data development with a familiar corresponds to one file that is runnable and testable in software-like semantics. isolation; on the compute-side, we let the system opting for modularity or monolith depending on the circum4.4. Code intelligence stances. In other words, the user is exposed directly only to the top layer in Fig. 3: it is the job of the code intelligence module (Fig. 2) to take as input the queries and functions defining a pipeline, together with parameters from the CLI, and produce as output first a logical plan of operations, and finally a physical plan to run the desired transformations. 4.4.1. From code to the logical plan After the pipeline code is ingested (e.g. Section A), the full project is snapshotted in an object storage and fingerprinted in a Postgres database, not dissimilarly from what happens for runs in Metaflow [15]: by assigning an id and immutable artifacts to each run, we guarantee reproducibility for auditing and debugging purposes – following the code is data principle, the same code on the same data version will produce identical results. After versioning, SQL and Python files are parsed: first, logical dependencies are extracted from implicit references – in our example, pickups is build out of another table (SELECT .. FROM trips), so we need to materialize nodes in the right order; second, environment details for Python functions are extracted – in our purely functional implementation, a decorator such as @requirements can be used to pin down the needed packages: because of our serverless setup (Section 4.5), the OS, container, and environment layers are handled by the system, leaving packages as the only degree of freedom left to control to ensure full reproducibility.

Finally, in our example Python is used only to run an expectation. There is no reasons why Python could not be used to also declare new tables starting from existing ones. In essence, transformations are functional mappers from set of tuples (rows in the “parent table”) to set of tuples (rows in the “child table”): as long as two languages can speak a common dialect over those tuples, they can operate together. 4.4.2. The execution plan The output of the parsing step is a logical plan (Fig. 3), so that the system knows which artifacts depends on existing Iceberg tables, which tests need to pass to consider the pipeline healthy, and what needs to be written back into the catalog as a result of running the DAG. The ifrst Bauplan version for executing such a plan was the simplest possible idea, i.e. just mapping the plan to an isomorphic execution, in which each node is executed by one (serverless and stateless) function. However, this naive implementation doesn’t optimize around an important feature of data workloads: at RS, computing artifacts is pretty fast, and the bottleneck is often moving data around. To make a concrete example, consider again our sample pipeline: there, the Python expectation is a Pandas function taking a DataFrame as input (the data artifact we are testing), and returning a boolean. Instead of running an Iceberg command first, a SQL query and then a Python function as three separate executions, we pushed down WHERE filters to obtain a smaller in-memory table, then run in-place the SQL logic and the Python expectation. This optimization results in 5x faster feedback loop even with small datasets, and avoid unnecessary spillover to object storage: notably, the user is not required to know any of the underlying implementation details.

4.5. Serverless runtimes

When the execution plan is finalized, the computation needs to happen in a fast, reliable, scalable way. Following the functional definitions of pipelines, a serverless runtime is the natural choice in terms of abstraction: the user specifies what needs to happen, the Bauplan platform runs the code in an optimized environment where OS, container, and runtime are under its control [16]. In recent years, serverless has become an overloaded term, used to vaguely denote a cluster of features not necessarily related [17, 18] and not necessarily important for (or even, at odds with) data pipelines: scale-to-zero, price-per-second, “infinite” and instantaneous concurrency, stateless execution model [19]. We identified few essential properties for our serverless platform: • multi-language support with flexible dependencies (Fig. 2): considering SQL code can be run in a Python interpreter connected to object storage (see duckdb below), the requirement can be satisfied by a Python runtime allowing an arbitrary combination of interpreter version and dependencies12; • runtime hardware allocation: the same transformation logic should run with 10GB or 20GB of memory depending on the underlying artifacts; • data locality: given that data pipelines are first and foremost about moving data, we need to maintain function isolation at the runtime level but allow for shared resources at the artifacts level - moving data is slow and expensive, and object storage should be treated as a last resort [ 20 ]; • pausing functions: since a fresh Spark context takes a while to be created, it is typically re-used in a stateful manner. However, since “freezing” a container after initialization would make startup time negligible, we could run stateless commands over ephemeral containers.

We evaluated AWS Lambda13, OpenWhisk14 and OpenLambda15 as of-the-shelf frameworks, but none 12Note how the function-first approach provides a level of control – i.e. specifying packages per function – that is impossible in conventional Spark applications. 13https://aws.amazon.com/lambda/ 14https://openwhisk.apache.org/ 15https://github.com/open-lambda/open-lambda of them fully satisfied the desiderata above: as typi- • bauplan run: asynchronous, DAG-long intercal use cases for serverless are micro-services and glue actions are handled through run; starting from code in cloud infrastructure, it is not surprising that the pipeline code in the IDE, issuing run starts existing tools would be sub-optimal for our scenarios. the intelligence and execution processes depicted Steps in data DAGs have almost opposite requirements in Fig. 3. As DAGs are modular and snapshotwhen compared to typical functions-as-a-service: startup ted at each execution, additional arguments altime is somewhat important, but since the bottleneck is low to replay an arbitrary DAG for debugging data reading and processing, we play in the 200-1000 and inspection: for example, -run-id 12 -m ms regime, not 0-200 ms; on the other hand, resources pickups+ will re-execute in a sandboxed way the required to compute aggregations require more fine- same code over the same data as the run with grained tuning. For these reasons, we invested, as a difer- = 12 , starting from the pickups artifacts and entiating feature, in building an orchestration and mem- running all its children. ory management layer to support workloads in which horizontal scalability is less important than vertical elas- With the goal of truly lowering the bar for data work, ticity and eficient data processing. the CLI-first approach is easy to learn and easy to ex

To support SQL, we leverage duckdb [ 21 ] as our tend: in fact, the semantics of run mirrors tools that are query engine, given its performance, flexibility and full- popular in our user base (dbt and Metaflow ). Moreover, compatibility with our formats16; to support Python, we CLI commands are easy for machines to execute as well: built custom containerized runtimes and a container man- since querying and visualizing data in the terminal is not ager: furthermore, we were able to exploit the power-law ideal with large datasets, it is trivial to wrap commands in package utilization [22] to limit overall download times in an application layer users are comfortable with, e.g. a with an eficient local, disk-based cache. 17 Our solution dashboard or a Python notebook. allows for fast startup time (300ms), complete runtime isolation at the function level, and customizable sharing 5. Conclusion and Future work policies within the functions in a single DAG execution: as our target deployment model is initially “Bring Your Own Cloud”, the usual security concerns of multi-tenant virtualization do not apply [23].

Finally, we wish to stress that containerization is an active area of research, with exciting possibilities ofered by new frameworks such as WASM [24]: through an ongoing collaboration with the research group behind SOCK [22], we are actively iterating on this component.

We started our journey designing Bauplan by considering – and dismissing – two ways to build towards the DLH vision: re-purposing existing Big Data tools, or building a new platform from scratch. Mirroring the Firebolt experience [ 1 ], we found that re-using existing open source components as initial “Lego bricks” can be a powerful third way to getting closer to the goal, without necessarily breaking the bank. While the “lean startup” playbook [25] of rapid market-driven pivots is not read4.6. Interacting with the platform ily applicable to data platforms, re-using components allowed the team to converge quicker to a working endSimilar to other popular data tools, interactions between to-end system, test its strength and weaknesses with Bauplan users and the platform happen through the CLI, early adopters, and place more informed bets on which as pipelines get written in the IDE of choice. With the in- features are responsible for the greater marginal value. tention of satisfying first the semantics implied by the sce- There are obviously many other interesting areas that narios in Table 1, the CLI experience is centered around remain to be addressed, e.g. securing data through seamtwo main commands, query and run: less, yet secure authentication, parallelizing SQL execu• bauplan query -q "SELECT * FROM trips": tion, using logs and machine learning to further optimize synchronous, point-wise interactions with the experience behind the scenes. Moreover, truly manipre-built artifacts are handled through query. As festing the DLH vision in the product is a long journey: discussed, time-travel is a first-class abstraction, starting from open source tools was the right choice, but so the same command takes an additional as the platform progresses it is likely we will wander far argument to specify the intended branch (if not more into the unknowns to better meet market demands. current): -b feat_1. As Rome was indeed not linted, tested, built nor deployed in a day, we look forward to sharing with the community the next steps of our adventure in future publications. 16An example of running serverless queries has been open-sourced

at https://github.com/BauplanLabs/quack-reduce. 17We plan to release a Lambda-based generic runtime for Python

functions that leverages object storage for caching.

A. Sample data pipeline

We report the full code for the running example of this paper (Section 4.1), as schematically depicted in Fig. 3. Please note that steps are transformed into a DAG thanks to a simple naming convention: children tables refer to parents (Step 3 below referring to Step 1 table), while Python testing functions comply with the table_expectation syntax.

Step 1 (trips): read raw data (as stored under an Iceberg table taxi_table) for a target time window, and extract important columns into a new trips table. SELECT p i c k u p _ l o c a t i o n _ i d , p a s s e n g e r _ c o u n t as count , d r o p o f f _ l o c a t i o n _ i d FROM

1089/big. 2013 . 0037 . doi: 10 .1089/big. 2013 . 0037 .

arXiv:https://doi.org/10.1089/big. 2013 . 0037 , We are immensely grateful to the open source and data pMID: 27447254. community, and we plan to continue our contributions [10]

Tagliabue ,

Greco ,

J.-F.

Roy ,

Bianchi , G. Casto open source and open science in this new venture as sani, B . Yu , P. J. Chia , Sigir 2021 e -commerce workwell. In particular, we wish to thank the PyIceberg, Open shop data challenge , in: SIGIR eCom 2021 , 2021 . Lambda and Nessie teams, with whom we have been col- [11]

Tagliabue ,

Bianchi ,

Schnabel , G. Attanasio, laborating in the past few months while starting Bauplan . C. Greco,

G. d. S. P.

Moreira ,

P. J.

Chia , Evalrs: Finally, we wish to thank Tyler Caraza-Harter and Ryan a rounded evaluation of recommender systems, Vilim for precious feedback on a previous version of this 2022 . URL: https://arxiv.org/abs/2207.05772. doi:10. work. 48550/ARXIV.2207.05772.

[12]

Tagliabue , You do not need a bigger boat: RecReferences ommendations at reasonable scale in a (mostly)

serverless and open stack, RecSys '21 , Association

???? URL: https://doi.org/10.1145/3460231.3474604.

doi:10.1145/3460231 .3474604. [1]

Pasumansky ,

Wagner , Assembling a query

(Eds.) , 1st International Workshop on Composable

Data

Management

Systems , CDMS@VLDB 2022 , [13]

McSherry ,

Isard ,

D. G.

Murray , Scalability!

Sydney , Australia, September 9, 2022 , 2022 . URL: but at what cost?, in: USENIX Workshop on Hot

https://cdmsworkshop.github.io/2022/Proceedings/ Topics in Operating Systems, 2015 .

ShortPapers/Paper1_MoshaPasumansky.pdf . [14]

Alstott ,

E. T.

Bullmore , D. Plenz, powerlaw: A [2]

M. A.

Zaharia ,

Ghodsi ,

Xin , M.

Armbrust, python package for analysis of heavy-tailed distri-

Lakehouse: A new generation of open platforms butions , PLoS ONE 9 ( 2013 ).

that unify data warehousing and advanced analyt- [15]

Tagliabue ,

Bowne-Anderson ,

Tuulos ,

Research , 2021 . chine learning with open-source metaflow , ArXiv [3]

Dehghani ,

Data

Mesh ,

'Reilly Media , Inc., abs/2303 .11761 ( 2023 ).

2022. URL: https://www.oreilly.com/library/view/ [16]

Hendrickson ,

Sturdevant , T. Harter,

data-mesh/9781492092384/. V. Venkataramani , A. C.

Arpaci-Dusseau , R. H. [4] S.

Shankar , R.

Garcia , J. M.

Hellerstein , A. G.

Arpaci-Dusseau , Serverless computation with

Parameswaran , Operationalizing machine learn- openlambda, in : Proceedings of the 8th USENIX

ing: An interview study , 2022 . URL: https: Conference on Hot Topics in Cloud Computing,

//arxiv.org/abs/2209.09125. doi: 10 .48550/ARXIV. HotCloud'16,

USENIX

Association , USA, 2016 , p.

2209. 09125 . 33 - 39 . [5]

Zaharia ,

Chambers , Spark: the definitive [17]

Li ,

Guo , J. Cheng,

Chen ,

He ,

Guo , The

Media , Inc., 2018 . URL: https://www.oreilly. com/ for design architecture , ACM Comput. Surv. 54

library/view/spark-the-definitive/9781491912201/. ( 2022 ). URL: https://doi.org/10.1145/3508360. doi:10. [6]

T. D.

Damji Jules ,

Brooke

Wenig ,

Lee , Learning 1145 /3508360.

Spark: Lightning-Fast Data Analytics , O

'Reilly Me- [18] J.

Schleier-Smith , V.

Sreekanti , A . Khandelwal,

dia , Inc., 2020 . J. Carreira , N. J.

Yadwadkar , R. A.

Popa , J. E.

Gon [7] Z.

Wang , Understanding the challenges and as- zalez, I.

Stoica , D. A.

Patterson , What server-

tions , 2021 IEEE/ACM 43rd International Confer- phase of cloud computing, Commun. ACM 64

ence on Software Engineering: Companion Pro- ( 2021 ) 76 - 84 . URL: https://doi.org/10.1145/3406011.

ceedings (ICSE-Companion) ( 2021 ) 132 - 134 . doi: 10 .1145/3406011. [8]

Tang ,

He ,

Yu ,

Li ,

Li , A survey on spark [19]

Jangda ,

Pinckney ,

Brun ,

Guha , Formal

chine learning, and applications , IEEE Transac- Program. Lang . 3 ( 2019 ). URL: https://doi.org/10.

tions on Knowledge and Data Engineering 34 ( 2022 ) 1145 /3360575. doi: 10 .1145/3360575.

71- 91 . doi: 10 .1109/TKDE. 2020 . 2975652 . [20]

Mahgoub ,

Shankar ,

Mitra ,

Klimovic , [9]

Junqué de Fortuny ,

Martens ,

Provost , Predic- S. Chaterji, S. Bagchi, SONIC: Application-aware

Big Data 1 ( 2013 ) 215 - 226 . URL: https://doi.org/10. in: 2021 USENIX Annual Technical Conference

(USENIX ATC 21) , USENIX Association, 2021 , pp.

285- 301 . URL: https://www.usenix.org/conference/ WHERE

atc21/presentation/mahgoub. p i c k u p _ a t >= '

2019 −04 −01 ' [21]

Raasveldt ,

Mühleisen , Duckdb: An em-

beddable analytical database , in: Proceedings Step 2 (trips_expectation): we take Step 1 output - a

of the 2019 International Conference on Manage- table named trips -, convert it to a DataFrame and run

ment of Data, SIGMOD '19, Association for Com- a statistical check using Python . Similar to declarative

puting Machinery , New York, NY, USA, 2019 , p. data science frameworks such as Metaflow [ 15 ], Python

1981- 1984 . URL: https://doi.org/10.1145/3299869. decorators are used to express directly in code constraints

3320212. doi: 10 .1145/3299869.3320212. on the target runtime . [22]

Oakes ,

Yang ,

Zhou ,

Houck , T. Harter, @ r e q u i r e m e n t s ( { ' p a n d a s ' : ' 2 . 0 . 0 ' } )

containers, in: Proceedings of the 2018 USENIX r e t u r n m > 10

ence , USENIX ATC '18 ,

USENIX

Association , USA, Step 3 (pickups): we take Step 1 output - a table

2018 , p. 57 - 69 . named trips -, and produce a new table pickups by [23]

Agache ,

Brooker ,

Iordache , A. Liguori, aggregating and sorting trip data .

tions, in: 17th USENIX Symposium on Networked d r o p o f f _ l o c a t i o n _ i d ,

Systems

Design and Implementation (NSDI 20), COUNT ( ∗ ) AS c o u n t s

USENIX

Association , Santa Clara, CA, 2020 , pp. FROM

419- 434 . URL: https://www.usenix.org/conference/ t r i p s

nsdi20/presentation/agache. GROUP BY [24]

Rossberg , WebAssembly Core Specifica- p i c k u p _ l o c a t i o n _ i d ,

tion , W3C ( 2019 ). URL: https://www.w3.org/TR/ d r o p o f f _ l o c a t i o n _ i d

wasm-core-1/ . ORDER BY [25] E. Ries,

The lean startup : how constant c o u n t s DESC

York , 2011 . URL: http://www.amazon.de/

dp/0670921602/ref=sr_ 1_2?ie=UTF8&qid=

1396199893&sr= 8 - 2 &keywords=eric+ries.