1. Introduction

Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs

Stefan Krawczyk

Elijah ben Izzy

Danielle Quinn

0 0 Stitch Fix , 1 Montgomery Tower, Suite 1500, 94104, San Francisco, California , USA

41 50

While data science, as a high level consumer of and producer to data ecosystems, has grown in prevalence within organizations, software engineering practices for data science code bases have not. Stereotypical data science code is not known for unit testing coverage, ease of documentation, reuseability, or enabling quick incremental development as it grows. Over time, this lack of software engineering quality impacts the maintainers ability to make progress within a data ecosystem. The data platform team at Stitch Fix created Hamilton to solve these software engineering pain points with respect to data transformations. It does this by requiring a programming paradigm change that enables straightforward specification and execution of dataflow graphs. Hamilton has enabled data science teams at Stitch Fix to scale their code bases to support 4000+ data transformations, by ensuring that transformation code is always unit testable, documentation friendly, easily curated, reuseable, and amenable to fast incremental development. Hamilton also enables transparently scaling computation onto distributed systems such as Dask, Ray, and Spark, without requiring a rewrite of data transform logic. Hamilton therefore represents a novel approach to modeling dataflows that is decoupled from materialization concerns, and presents an industry pragmatic avenue for building a simpler user experience for high level data ecosystem practitioners. Hamilton is available as open source code.

1. Introduction

lineage grew in dificulty with the number of transforms.

The Hamilton framework[2] was therefore conceived With the shift to "Full Stack Data Science"[1], data scien- to mitigate the FED team’s software engineering pain tists are expected to not only do data science, but also points. Specifically, Hamilton enables a simpler paradigm engineer and manage data pipelines for their production to create, maintain, and execute code for data engineermodels. This additional responsibility places burdens on ing, especially in the case of highly complex data transdata scientists, who no longer hand of their ideas of to a formation dependency chains. Hamilton does this by software engineering team for implementation and main- deriving a directed acyclic graph (DAG) of dependencies tenance. This burden becomes especially acute in the using specially defined Python functions that describe the domain of time-series forecasting, where data transfor- user’s intended dataflow. Altogether, Hamilton makes mation needs involve creating an ever increasing number incremental development, code reuse, unit testing, deterof features (columns) in a dataframe (table) for use with mining lineage, and documentation natural and straightmodel fitting/forecasting. To create better time-series forward. Furthermore, it provides avenues to quickly forecasts, one is continually seeking to add more features and easily scale computation onto various distributed by incorporating new data, updating existing features, computation frameworks, e.g. Ray[ 3 ]/Spark[4]/Dask[5], and deriving new features from existing ones. The ma- without changing much code. jority of features are the product of a chain of transfor- We will first provide some examples of typical softmations over other features. At Stitch Fix, the Forecast- ware engineering pain points with data transformations ing, Estimation, and Demand (FED) team had curated a at Stitch Fix, then talk about related tooling, and spend code base over the course of several years, to produce a the rest of this report diving into Hamilton’s programdataframe for fitting time-series models with thousands ming paradigm. We will show the benefits this paradigm of such features. Unfortunately, maintaining and adding brings, provide a lightweight evaluation of the framefeatures to the code base had become burdensome to the work, and finish with a summary and a description of point where their delivery of work slowed significantly. future work.

Unit-testing was virtually non-existent, documentation was scattered and inconsistent, and determining feature 2. Software engineering pain points with data transformations pain points we encountered at Stitch Fix. It demonstrates creating data transforms that represent features to fit a time-series model. 1 # create_features.py 2import pandas as pd 3from library import loader, is_holiday, is_uk_holiday 4 5 def compute_bespoke_feature(df: pd.DataFrame) -> pd.

Series: 6 """Some documentation explaining what this is""" 7 return (df[’A’] - df[’B’] + df[’C’]) * loader.

get_weights() 8 9 def multiply_columns(col1: pd.Series, 10 col2: pd.Series) -> pd.Series: 11 """Some documentation explaining what this is""" 12 return col1 * col2 13 14 def run(dates, config): 15 df = loader.load_actuals(dates) # e.g. spend,

signups 16 if config[’region’] == ’UK’: 17 df[’holidays’] = is_uk_holiday(df[’year’], df

[’week’]) 18 else: 19 df[’holidays’] = is_holiday(df[’year’], df[’

week’]) 20 df[’avg_3wk_spend’] = df[’spend’].rolling(3).mean

() 21 df[’acquisition_cost’] = df[’spend’] / df[’

signups’] 22 df[’spend_shift_3weeks’] = df[’spend’].shift(3) 23 df[’special_feature1’] = compute_bespoke_feature(

df) 24 df[’spend_b’] = multiply_columns(df[’

acquisition_cost’], df[’B’]) 25 save_df(df, "some_location") 26 if __name__ == ’__main__’: 27 run(dates=..., config=...)

At only twenty-seven lines, the code in Listing 1 looks innocuous. However, scaling this script from six to 1000+ data transforms (as occurred at Stitch Fix) presents the following problems:

2.1. Inconsistent unit test coverage

Only three of the derived features lend themselves towards straightforward unit testing. One cannot unit test the inline dataframe manipulations without running the entire script, so the code base inevitably has minimal, if any, test coverage. In such a codebase, it is dificult to determine behavioral changes when code changes.

2.2. Code readability and documentation

Well organized code with documentation is critical for a maintainer to understand and contribute to a codebase. It ensures information is not siloed in the original developer’s mind, and that newcomers to the codebase can quickly become productive. In Listing 1, code readability and documentation is tragically lost between inline manipulations, functions, and the organization of the run function. Identifying the logic used to derive a feature is far from trivial, even with the best developer tools.

2.3. Dificulty in tracing data lineage

Listing 1 demonstrates the highly heterogenous nature of data transformation code. The run function: Listing 1: Example script that loads data, transforms data into features, and saves them

At six features, tracing lineage of inputs to a data transform is not particularly dificult. At 1000+ data transforms, however, this is a challenging task. At Stitch Fix, there are chains of transformation that span over fourteen such functions, with the average transformation chain length just over five.

In order to add a new data transform, a developer has to make a decision as to where to put it. It could be at the end of the run function, or ideally near some 1. loads some data into a central dataframe object logical grouping of transforms. However there is no (line 15). forcing function for a developer to do so, which inevitably 2. adds and derives features through various means: leads to critical transform code spread throughout the a) inline code that directly alters the dataframe entire codebase. A "spaghetti" codebase like this results in (lines 20-22). slow and frustrating debug cycles, requiring the cognitive b) a function that takes the whole dataframe burden of internalizing a mental map of computation in and assigns the result to a new column order to identify and fix problems. Ability to debug is (lines 5, 23). then heavily correlated with tenure on the team! c) a function that uses columns from the central dataframe and assigns the result to a 2.4. Integration testing requires new column (lines 9, 17, 19, 24). calculating all data transforms d) a conditional branch that changes the implementation used to compute a column based on some configuration (lines 16-19). 3. contains only sporadic documentation.

While feature generating scripts such as Listing 1 are initially quick to execute, they grow into a large monolith.

In order to test the integration of a new feature, one has to run the entire script. As the script inevitably grows with the increasing complexity of a problem space, it that help validate data quality expectations over large takes longer to run, and thus longer to iterate on, fix datasets. After a dataset has been constructed, the user bugs, and improve. defines expectations over that data, that are then checked via execution on Apache Spark. 2.5. Code Reuse & Duplication Great Expectations, like Deequ, is also a heavy-weight framework, but is more broadly applicable to python.

Because transform logic is not well encapsulated, code It allows one to validate, document, and profile data to reuse is dificult to achieve outside of the current context ensure data quality. It follows a similar implementation of the script. Good software engineering practices advise pattern to Deequ, as one needs to explicitly integrate it consistently refactoring code for reuse, however this is after dataset construction into a dataflow. easy to skip. It is simpler for a data scientist to instead None of the frameworks are meant to be run like unit ifnd the relevant code and cut & paste it to their new tests, and thus are not designed for testing transform context, especially when they are scarce for time. Left logic. unchecked, this behavior creates more monolithic scripts As for the user experience, one has to explicitly add and propagates the problem. data quality test(s) into a dataflow. Determining how to add tests, when to add tests, and how to maintain 3. Related tooling them as dataflows evolve causes extra burden on the dataflow developer. For example, it is possible to change In industry, there are a few tools that come to mind when data transform logic and forget to update data quality discussing some of the pain points above. expectations if they are defined in separate steps of the dataflow, located in a diferent file in the code base, or stored externally in a datastore. Analogously, if a data 3.1. Lineage/Data Catalogs quality check fails, it can be similarly dificult to determine what source code generated the data, if one does not link the data quality test appropriately via naming or documentation.

OpenLineage[7] is an framework for data lineage collection and analysis. It aims to provide an open standard to enable disjoint tools to emit lineage metadata that can then be centrally tracked and curated. It requires a oner to implement the standard, as well as maintain infras- 3.3. Orchestration Frameworks tructure to collect the emitted lineage metadata. It is designed for tracking materialization of whole data sets. Similar in approach to Hamilton are orchestration frameIt cannot track lineage at a columnar level. works [13, 14, 15, 16]. They too model their operations

Data catalogs like Datahub[ 8 ] and Amundsen[9], are via a DAG, however their focus is modeling a user’s end systems of record with which one can emit and store to end workflow at a macro-level. Specifically, they model lineage and other metadata (e.g. for GDPR purposes). discrete steps at each of which an artifact is created and They require one to explicitly integrate with their APIs data is materialized. For example, in one step, raw data to capture this information. They are only as useful as the is ingested and transformed and saved as a table, and in information provided to them, so a developer needs to ex- a subsequent step, a machine learning model is trained plicitly consider integration as part of their development on that data and that model is saved. workflow. These frameworks also do not try to address any software engineering pain points a data transformation developer might have.

3.2. Data Quality

When one thinks about data transformations and testing data, one often thinks of Pandera[ 10 ], Deequ[ 11 ], or Great Expectations[12].

Pandera is a stateless lightweight API for performing data validation on Pandas dataframes (i.e. in memory tables). Its focus is to provide a quick mechanism to define expectations in code to create robust data processing pipelines. It has a small python dependency footprint so is easy to install and embed within a pipeline, enabling it to live close to transform logic.

Deequ is a stateful, heavy-weight framework, that requires peripheral services to operate. It is built on top of Apache Spark and aims to define "unit tests for data"

4. Hamilton Framework

The Hamilton framework alleviates the pain points described in Section 2 through three distinct concepts: • Hamilton functions: the low-level unit of work developers use to encode dataflow components. • Function DAG: The representation of the dataflow’s dependency structure, built by combining function definitions. • Driver code: the code used to execute Hamilton functions by specifying the functions used to

4.1. Hamilton Functions

Hamilton functions force a novel programming paradigm on the user. Like regular Python functions, they encapsulate computational logic. However, the user is not responsible for invoking functions and assigning the results to a variable. Instead, this is encoded in the structure of the function itself in a declarative manner. The function name serves to specify, or declare, the intended output variable, and the function input parameters (as well as their type-annotations) map to expected input variables, i.e. declared dependencies. In the context of creating a dataframe, the function name serves as the intended output column name, and the function input parameters serve as the expected input columns/values. Type annotations on the function and the variables are required by the Hamilton Framework.

Note (1), Hamilton can be used to model any python object creation. For the remainder of this paper, we will stick to the context of creating pandas dataframes. Note (2), if Hamilton functions have wildly diferent python dependency requirements, using Hamilton is still possible, one would just partition DAG execution into multiple steps matching the diferent python dependency requirements. 1 # rather than 2 df[’acquisition_cost’] = df[’spend’] / df[’signups’] 3 4 # a user would instead write 5 def acquisition_cost( 6 signups: pd.Series, spend: pd.Series) -> pd.

Series: 7 """Example showing a simple Hamilton function""" 8 return spend / signups Listing 2: the core Hamilton programming paradigm with dataframes

4.1.1. Verbosity

This approach increases the lines of code required to describe simple operations. However, the benefits outweigh the cost. Inputs are clearly specified, and logic is automatically encapsulated in named functions. build the DAG, the inputs to execution, and the parts of the DAG to run. As Hamilton functions contain well encapsulated logic For those eager to see a simple Hello World we direct and clearly specify inputs, all data transform code is unit readers to Listing 5 in the Appendix. testable!

In an efort to encapsulate operational concerns and re

Listing 2 shows an example of the Hamilton paradigm duce repetitive function logic, Hamilton comes with a and what it is replacing. Hamilton’s breakdown of the variety of decorators. Decorators primarily fulfill one of example function’s components is demonstrated in Table the following purposes: 1. By defining functions in this manner, the developer specifies their intended dataflow. This method of writing Python functions has a variety of implications:

4.1.3. Code readability and documentation

1. Encapsulating feature logic in functions implies a natural location for documentation (namely the Python docstring). 2. Coupling the name of the function with a reusable downstream artifact forces more meaningful naming. It is trivial to determine the definition of a feature and locate its usage. One needs to simply search the code base for a function with that name or which has that as an input parameter.

4.1.4. Vector friendly computation

In the case of creating dataframes, the Hamilton programming paradigm pushes a user to write a function to create a single column, with inputs as columns as well. This naturally leads the developer to write logic that can utilize vector computation, which often speeds up execution.

4.1.5. Functions as the core interface

Python functions have well defined boundaries; inputs go in, and one output comes out. They can be serialized, inspected, and executed. Therefore, functions are used as a universal interface and building block for both the user experience and the framework. A user does not need to implement nor understand a special interface to use the core Hamilton features. Similarly, the framework, without knowing the exact shape of the function beforehand, has a clear object with which to work with, where it can wrap a user’s functions to inject operational concerns via decorators (see 4.2), or at run time (see 4.3.3). 4.2. Advanced Hamilton Functions 1. Determining whether a function should exist. if else blocks are dropped in favor of readable annotations (e.g. @config in listing 4). 2. Parameterizing function creation. A single func

tion can create multiple nodes. 3. Simplifying function logic by promoting reuse.

Syntactic sugar can help reduce verbosity and repeated code (e.g. @extract_columns in listing 4. 4. Modeling operational concerns in a modular man- 4.2.2. @tag ner. For example, adding metadata for GDPR purposes, or specifying run time data expectations.

As data systems and environments change over time, diferent metadata needs arise. Rather than requiring

Hamilton decorators are extensible and can also be explicit integrations with metadata systems, or enforcing layered to enable highly expressive functions. a specific schema, Hamilton enables a lightweight way to

Note, as functions are the core interface (see 4.1.5), the annotate functions with such concerns. @tag() takes in abstraction provided by Hamilton’s decorator system en- string key value pairs, and is thus amenable to annotatables, a platform team for example, a clear and decoupled ing functions with anything relevant to your particular way to plug into the user’s function writing experience, data ecosystem. E.g. ownership, source table names, while providing a clear way to manage and service their GDPR concerns, project names, etc. These tags are then decorator implementations. Done correctly, user func- attached to nodes in the DAG, which then can be used as tion definitions remains static to platform changes. a basis for querying for nodes, or asking graph questions

With respect to data ecosystems, we will explain two of the DAG. See listing 4 for an example of usage. relevant Hamilton decorators: @check_output() and @tag(). We direct readers to the Hamilton documenta- 4.3. The Function DAG tion [2] for more information on other decorators.

The function DAG is the framework’s representation of the nodes that should be executed and the dependencies between them. 4.2.1. @check_output In machine learning (ML) dataflows, data quality issues are a common cause of model problems. It is a best prac- 4.3.1. Node Creation tice to setup data expectations to mitigate these problems. However, as explained in section 3.2, one needs Hamilton resolves the mapping of functions (e.g. listing to additionally integrate such a concern into a dataflow 2) to nodes. In the case of Hamilton functions annotated explicitly. With Hamilton, integrating data quality expec- with one or more decorators, a resolution step occurs to tations are less burdensome, as this takes the form of a determine how many nodes to create (e.g. in case of a lightweight Python decorator @check_output(), with parameterized function), and what the nodes should be which one can simply annotate their Hamilton functions. named. Functions beginning with _ are presumed to be Doing so enables transform logic and data expectations helper functions and thus excluded from inclusion in the to be co-located, without cluttering the user’s dataflow. DAG.

There is no need to maintain separate code bases and data stores, or manually integrate checks as an explicit step 4.3.2. Constructing the DAG of a dataflow. Therefore maintenance and operational Hamilton compiles the DAG from a list of Python modcosts are low for adding runtime data quality checks to a ules containing Hamilton functions and optional condataflow. ifguration. It collects the relevant functions to create

At DAG construction time, Hamilton automatically nodes, determines node dependencies, and assigns edges adds nodes to the DAG to check the output of the dec- between them. Any dependency that does not map to a orated function. At run time, after executing the user known node is marked as a required input for execution. function, Hamilton validates the provided expectations, surfacing data quality errors to the dataflow developer via logging, or stopping execution altogether if desired. 4.3.3. Walking the DAG See listing 4 for an example of usage.

Given desired outputs, a topological sorting of the DAG is performed to determine the execution order. As the DAG is walked, additional operational concerns are injected, e.g. checking inputs and matching against function input types, delegating function computation, and constructing the final object returned from execution.

4.4. Driver Code

Driver code steers execution of the Function DAG, providing a convenient abstraction layer. Thus the developer never has to interact with the DAG itself, and instead utilizes the driver to run and manage their dataflow. It handles the following:

4.4.1. DAG Instantiation

The Driver directs construction of the Function DAG. Creation of the driver is as simple as the following: 1from hamilton import driver 2from funcs import spend_forecast, spend_data_loader 3 4config={...} 5modules = [spend_data_loader, spend_forecast] 6 dr = driver.Driver(config, *modules, adapter=...) UD: actuals signups

spend B

acquisition_cost spend_b

Listing 3: Sample Driver code to instantiate a DAG

4.4.2. DAG Execution

The driver has two primary methods: 25 26 def spend_shift_3weeks(spend: pd.Series) -> pd.Series : return spend.shift(3) 27 28 29 def special_feature1(A: pd.Series, B: pd.Series, C: pd.Series, weights: pd.Series) -> pd.Series: """Some documentation explaining what this is""" return (A - B + C) * weights

4.5. Benefits of Hamilton

With respect to a data scientist’s workflow, we have found the following benefits when using Hamilton.

4.5.1. Incremental Development

Rather than requiring execution of a monolithic script, Hamilton pushes the dataoflw creator towards incremental, test-driven, development. As dataflows are composed of discrete, unit-testable components, modifications to produce new data can be started locally by conducting test-driven development on the function itself. As node execution only requires running upstream dependencies, integrating with the full dataflow is straightforward. The developer need only request computation of the new node via the Hamilton driver to integration test the new addition. code-difing, breakpoints, and bisection) gain in value due to Hamilton’s logical mapping of code to produced data. For example, to debug spend_b from our contrived example (listing 1), it is straightforward to visualize it’s execution path, Figure 1, and thus determine what needs to be debugged.

4.5.3. Documentation

The confluence of: • using function documentation strings • one-to-one mapping of outputs to functions • the ability to visualize the DAG and execution paths • the @tag() decorator for adding extra metadata enables a clear and straightforward means to document transform logic in a standardized way. The function documentation string is perfect for long form explanations, and can be exposed via tooling such as sphinx[ 17 ]. The mapping of function names to outputs ensures that function names and input parameters are meaningful while also enabling one to quickly locate the definition of an output. The ability to visualize the DAG and execution paths helps provide a big picture mental model for those learning the code base. The @tag() decorator makes it easy to add additional metadata concerns, without cluttering the transform logic itself.

4.5.4. Central Definition Store

A common problem for machine learning practitioners is that of leveraging other’s work. Most industry solutions target materialized data, e.g. [18], rather than the code itself. As the code in Hamilton maps directly to outputs, module organization is highly incentivized. Curating all modules into a single repository (as the FED team did at Stitch Fix) provides a straightforward approach for a team to refer to and reuse work.

4.5.5. Transparent Scaling

Most distributed computation frameworks follow a lazy execution model e.g. Dask, Ray, and Spark. They build a DAG of the computation required prior to distributing execution. As Hamilton’s Function DAG is structured using the same approach, it can provide a layer of indirection between dataflow definition and method of execution. In 4.5.2. Debugging practice, this means that most Hamilton Functions do not need modification to run on these distributed compuHamilton makes debugging dataflows simpler by provid- tation systems, unless the data type they operate over is ing a standard methodical approach. One can isolate bugs not supported by that system. For example, both Spark by determining the erroneous output, finding the same- and Dask implement the Pandas dataframe API, so a user name function definition, debugging that logic, and if would not have to change their Pandas code to scale to no error is found, repeat tracing through each upstream a Dask or Spark cluster, other than changing how they dependency. Standard debugging procedures (such as load data for execution.

4.5.6. Source Code Based Lineage

The declarative nature of Hamilton enables an entire end to end ML workflow to be modeled. Column level lineage from source, to machine learning feature, to model that consumes it, generally requires additional integra- 5.3. Qualitative assessment tion work to ensure it’s emission and storage, e.g. with The initial success criteria for the Hamilton project were Amundsen. With Hamilton, no such integration or sys- all qualitative measures. Namely, that a core data scitem is required. The declarative functions can model ence team adopted the tooling, enjoyed using it, and this entire process with any tooling that is python based, were able to deliver on their business objectives. On all as the function source code becomes the source of truth. accounts, Hamilton delivered successfully, without any To build a standalone lightweight lineage system, one detractors. Since then, two and a half years in production need to only pair the function definitions, driver code have passed and the same qualitative measures still hold. and configuration, with a source code version control The team manages over 4000 data transforms, which system (e.g. git) to snapshot the code (e.g. git commit) represents almost a decade of work, written by at least when an artifact is created, to enable reconstruction of iffteen diferent team members. the DAG for lineage querying purposes. take a whole day for a team member to complete prior to Hamilton. After Hamilton, this task takes no more than two hours, which represents a 4x improvement!

6. Summary 4.5.7. Lineage for Data Privacy/Provenance Concerns

Hamilton is a novel dataflow framework that makes data Hamilton unlocks the ability to provide fine grained lin- transformation engineering in Python straightforward. eage of computation. With the growth of privacy con- By representing dataflows as a series of simple Python cerns and data regulation, organizations need to know functions, Hamilton produces code that is easy to read what data comes in, where it goes, and how it is used. and decoupled from execution. This results in transHamilton functions can be marked (via @tag() with form logic that is always unit testable and documentaprivacy or regulation concerns, e.g. that it contains Per- tion friendly, provides lineage out of the box, enables sonally Identifiable Information (PII), enabling one to lightweight run time data quality checks, and unlocks easily surface answers to questions of data usage and fast iteration and debug cycles. It has enabled the FED data impact from the structure of the DAG. team at Stitch Fix to scale, managing over 4000 data transforms that create features for time-series modeling. 5. Evaluation In addition, Hamilton provides a layer of indirection that transparently scales computation onto various dis5.1. Adoption tributed computation frameworks (such as Ray, Spark, and Dask) as materialization is decoupled from function transform definitions. This opens the door for exciting future work.

To enjoy the benefits of Hamilton, one must use the paradigm. For existing systems, this means a migration needs to occur, which has been the largest friction point to adopting Hamilton. Internally, teams with active feature development for time-series forecasting have been the most prolific adopters, as they are the willing to pay the migration/adoption cost to reap the paradigm’s beneifts. Externally (since October 2021), at minimum, teams using Pandas and wanting to improve software engineering hygiene have been Hamilton’s best adopters.

5.2. Quantitative assessment

A quantitative assessment of Hamilton’s benefits to a team is challenging, as one would have to construct a tightly controlled experiment, e.g. like [19]. In an industry environment, however, it is hard to secure resourcing for such an endeavor. That said, anecdotally, for the FED team, a monthly feature engineering task of adding and adjusting data transformations for model fitting used to

7. Future Work

Here we highlight three avenues of future work. For more, see open issues in Hamilton’s github repository.

7.1. Source code based data governance

With Hamilton, one can encode a rich repository of metadata (see section 4.5.7) into the source code directly. Because source code is required to perform data transformations, keeping transform logic synchronized with tags, data quality checks, and documentation is a simpler proposition than having that metadata in separate independent steps of a dataflow or separate systems. Therefore the source code itself could conceivably be used as a reliable base for data governance.

However, how to expose this information for consumption requires more thought. Does one build directly on top of the source code? Or does one emit this information to an existing system, such as a data catalog? For the former, a new system would need to be built. For the latter, one could integrate a continuous integration system that publishes changes when source code is snapshot (i.e. committed), or augment the Hamilton driver/DAG walking methodology to emit this information at DAG instantiation/execution time.

Similarly, data access/use policies could also be a target for source code based governance. By tagging functions that ingest data sources with appropriate data policies, one could, prior to DAG execution, walk the DAG to ensure the requesting user and requested DAG execution meets the policy requirements for those data sources.

7.2. Compiling to an orchestration framework

A common problem with ML tooling is choosing an orchestration system. This is a big decision, because companies rarely change this infrastructure. As Hamilton functions do not define or set materialization concerns, it cannot be used in place of an orchestration framework such as Airflow[ 15], where computation is split into discrete steps and materialized to a data store in between steps. If one were to provide node groupings and a materialization function, then it would be straightforward to compile the Hamilton Function DAG into any existing framework. Programmatically defining orchestration would also unlock the possibility for low cost infrastructure migrations, while avoiding vendor lock in.

7.3. Modeling your entire data warehouse independently of materialization concerns

Common industry data tools and orchestration frameworks leak materialization concerns into the user experience. For example, using SQL, the end user has to think in tables. This naturally cascades to how data is materialized and transferred between workflows. What if, instead, one could model the dependencies of one’s data transforms, independently of how and where the data is stored? The declarative nature of Hamilton unlocks this possibility. [1] Eric Colson, Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through

A. A full Hamilton Hello World Example

48 49 50 51 52 ] 53 # by default execution returns a dataframe 54 df = dr.execute(output_columns) 55print(df.to_string()) 56 57 # To visualize do ‘pip install sf-hamilton[ visualization]‘ if you want these to work 58 dr.visualize_execution(output_columns, ’./my_dag.dot’ , {}) 59 dr.display_all_functions(’./my_full_dag.dot’) 21 22 23 24 def spend_std_dev(spend: pd.Series) -> float: 25 """Function that computes the standard deviation of the spend column.""" return spend.std() 16 17 18 19 def spend_zero_mean(spend: pd.Series, spend_mean: float) -> pd.Series: 20 """Shows function that takes a scalar. In this case to zero mean spend.""" return spend - spend_mean 31 32 33 ## in run.py 34import pandas as pd 35from hamilton import driver 36import my_functions # we import user functions here 37 38initial_columns = { # load from actuals or wherever -- this is our initial data we use as input. 1 ## --- in my_functions.py 2import pandas as pd 3 4 def avg_3wk_spend(spend: pd.Series) -> pd.Series: 5 """Rolling 3 week average spend.""" 6 return spend.rolling(3).mean() 7 8 9 def spend_per_signup(spend: pd.Series, signups: pd.

Series) -> pd.Series: """The cost per signup in relation to spend.""" return spend / signups 10 11 12 13 14 def spend_mean(spend: pd.Series) -> float: 15 """Shows function creating a scalar. In this case it computes the mean of the entire column.""" return spend.mean()

Listing 5: A full hello world example.

function , 2019 . URL: https://multithreaded.stitchfix.

com/blog/2019/03/11/FullStackDS-Generalists/. [2]

Stefan

Krawczyk , Elijah ben Izzy, Danielle Quinn,

defining dataflows , 2021 . URL: https://github.com/

stitchfix/hamilton. [3]

Moritz , Ray: A Distributed Execution Engine

Berkeley , 2019 . URL: http://www2.eecs.berkeley.

edu/Pubs/TechRpts/2019/EECS-2019-124.html. [4]

Zaharia ,

R. S.

Xin ,

Wendell , T. Das , M. Arm-

processing , Commun. ACM 59 ( 2016 ) 56 - 65 . URL:

http://doi.acm. org/10 .1145/2934664. doi: 10 .1145/

2934664. [5] Various , Dask: Library for dynamic task scheduling,

2016. URL: https://dask.org. [6] Pandas dev . team, pandas-dev/pandas: Pandas,

2020. URL: https://doi.org/10.5281/zenodo.3509134.

doi:10 .5281/zenodo.3509134. [7] Various , An open framework for data lineage col-

lection and analysis , 2017 . URL: https://openlineage.

io/. [8] Various , Datahub, 2020 . URL: https://github.com/

datahub-project/datahub . [9] Various , Amundsen, 2019 . URL: https://github.com/

amundsen-io/amundsen . [10] Niels

Bantilan

, pandera: Statistical Data Valida-

(Eds.), Proceedings of the 19th Python in Science

Conference , 2020 , pp. 116 - 124 . doi: 10 .25080/

Majora- 342d178e-010. [11] S.

Schelter , D.

Lange , P.

Schmidt , M.

Celikel , F. Biess-

dowment 11 ( 2018 ) 1781 - 1794 . [12] Various , Great expectations, 2017 . URL:

expectations. [13] Various , Metaflow: a framework for real-life

data science , 2020 . URL: https://github.com/Netflix/

metaflow. [14] Various , Prefect workflow management system,

2017. URL: https://github.com/PrefectHQ/prefect. [15] Various , Apache airflow, 2015 . URL: https://github.

com/apache/airflow. [16] Various , Dagster:

An orchestration platform for the

assets , 2020 . URL: https://github.com/dagster-io/

dagster. [17] Georg

Brandl

, Sphinx documentation, 2008 . URL:

https://www.sphinx-doc.org/en/master/. 39 [18]

Kakantousis ,

Kouzoupis ,

Buso , G. Berthou,

Dowling ,

Haridi , Horizontally scalable ml 40

pipelines with a feature store , in: Proc. 2nd SysML 41

Conf. , Palo

Alto , USA, 2019 . 43 # instantiate the DAG - multiple modules can be [19] D . L. Moody, Cognitive load efects on end user passed

understanding of conceptual models: An experi- 44 dr = driver.Driver(initial_columns, my_functions)

mental analysis , in: A. Benczúr , J. Demetrovics, 45 # we need to specify what we want in the final

46output_columns = [

mation Systems , Springer Berlin Heidelberg, Berlin, 47 'spend',

Heidelberg , 2004 , pp. 129 - 143 . 'signups',

'avg_3wk_spend',