Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs Stefan Krawczyk1,* , Elijah ben Izzy1 and Danielle Quinn1 1 Stitch Fix, 1 Montgomery Tower, Suite 1500, 94104, San Francisco, California, USA Abstract While data science, as a high level consumer of and producer to data ecosystems, has grown in prevalence within organizations, software engineering practices for data science code bases have not. Stereotypical data science code is not known for unit testing coverage, ease of documentation, reuseability, or enabling quick incremental development as it grows. Over time, this lack of software engineering quality impacts the maintainers ability to make progress within a data ecosystem. The data platform team at Stitch Fix created Hamilton to solve these software engineering pain points with respect to data transformations. It does this by requiring a programming paradigm change that enables straightforward specification and execution of dataflow graphs. Hamilton has enabled data science teams at Stitch Fix to scale their code bases to support 4000+ data transformations, by ensuring that transformation code is always unit testable, documentation friendly, easily curated, reuseable, and amenable to fast incremental development. Hamilton also enables transparently scaling computation onto distributed systems such as Dask, Ray, and Spark, without requiring a rewrite of data transform logic. Hamilton therefore represents a novel approach to modeling dataflows that is decoupled from materialization concerns, and presents an industry pragmatic avenue for building a simpler user experience for high level data ecosystem practitioners. Hamilton is available as open source code. 1. Introduction lineage grew in difficulty with the number of transforms. The Hamilton framework[2] was therefore conceived With the shift to "Full Stack Data Science"[1], data scien- to mitigate the FED team’s software engineering pain tists are expected to not only do data science, but also points. Specifically, Hamilton enables a simpler paradigm engineer and manage data pipelines for their production to create, maintain, and execute code for data engineer- models. This additional responsibility places burdens on ing, especially in the case of highly complex data trans- data scientists, who no longer hand off their ideas off to a formation dependency chains. Hamilton does this by software engineering team for implementation and main- deriving a directed acyclic graph (DAG) of dependencies tenance. This burden becomes especially acute in the using specially defined Python functions that describe the domain of time-series forecasting, where data transfor- user’s intended dataflow. Altogether, Hamilton makes mation needs involve creating an ever increasing number incremental development, code reuse, unit testing, deter- of features (columns) in a dataframe (table) for use with mining lineage, and documentation natural and straight- model fitting/forecasting. To create better time-series forward. Furthermore, it provides avenues to quickly forecasts, one is continually seeking to add more features and easily scale computation onto various distributed by incorporating new data, updating existing features, computation frameworks, e.g. Ray[3]/Spark[4]/Dask[5], and deriving new features from existing ones. The ma- without changing much code. jority of features are the product of a chain of transfor- We will first provide some examples of typical soft- mations over other features. At Stitch Fix, the Forecast- ware engineering pain points with data transformations ing, Estimation, and Demand (FED) team had curated a at Stitch Fix, then talk about related tooling, and spend code base over the course of several years, to produce a the rest of this report diving into Hamilton’s program- dataframe for fitting time-series models with thousands ming paradigm. We will show the benefits this paradigm of such features. Unfortunately, maintaining and adding brings, provide a lightweight evaluation of the frame- features to the code base had become burdensome to the work, and finish with a summary and a description of point where their delivery of work slowed significantly. future work. Unit-testing was virtually non-existent, documentation was scattered and inconsistent, and determining feature 2. Software engineering pain Proc. of the First International Workshop on Data Ecosystems (DEco’22), September 5, 2022, Sydney, Australia points with data Corresponding author transformations * $ stefank@cs.stanford.edu (S. Krawczyk); elijah.benizzy@stitchfix.com (E. b. Izzy); danielle.quinn@stitchfix.com (D. Quinn) Since software engineering pain points are somewhat © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). subjective, we present the following Python script using CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Pandas[6] to illustrate common software engineering 41 pain points we encountered at Stitch Fix. It demonstrates 4. relies heavily on code execution order; line 21 has creating data transforms that represent features to fit a to occur before line 24. time-series model. At only twenty-seven lines, the code in Listing 1 looks 1# create_features.py innocuous. However, scaling this script from six to 1000+ 2 import pandas as pd data transforms (as occurred at Stitch Fix) presents the 3 from library import loader, is_holiday, is_uk_holiday following problems: 4 5 def compute_bespoke_feature(df: pd.DataFrame) -> pd. Series: 2.1. Inconsistent unit test coverage 6 """Some documentation explaining what this is""" 7 return (df[’A’] - df[’B’] + df[’C’]) * loader. Only three of the derived features lend themselves to- get_weights() wards straightforward unit testing. One cannot unit test 8 9 def multiply_columns(col1: pd.Series, the inline dataframe manipulations without running the 10 col2: pd.Series) -> pd.Series: entire script, so the code base inevitably has minimal, if 11 """Some documentation explaining what this is""" any, test coverage. In such a codebase, it is difficult to 12 return col1 * col2 determine behavioral changes when code changes. 13 14 def run(dates, config): 15 df = loader.load_actuals(dates) # e.g. spend, 2.2. Code readability and documentation signups 16 if config[’region’] == ’UK’: Well organized code with documentation is critical for a 17 df[’holidays’] = is_uk_holiday(df[’year’], df maintainer to understand and contribute to a codebase. [’week’]) It ensures information is not siloed in the original devel- else: 18 oper’s mind, and that newcomers to the codebase can 19 df[’holidays’] = is_holiday(df[’year’], df[’ week’]) quickly become productive. In Listing 1, code readability 20 df[’avg_3wk_spend’] = df[’spend’].rolling(3).mean and documentation is tragically lost between inline ma- () nipulations, functions, and the organization of the run 21 df[’acquisition_cost’] = df[’spend’] / df[’ function. Identifying the logic used to derive a feature is signups’] far from trivial, even with the best developer tools. 22 df[’spend_shift_3weeks’] = df[’spend’].shift(3) 23 df[’special_feature1’] = compute_bespoke_feature( df) 2.3. Difficulty in tracing data lineage 24 df[’spend_b’] = multiply_columns(df[’ acquisition_cost’], df[’B’]) At six features, tracing lineage of inputs to a data trans- 25 save_df(df, "some_location") form is not particularly difficult. At 1000+ data trans- 26 if __name__ == ’__main__’: forms, however, this is a challenging task. At Stitch Fix, 27 run(dates=..., config=...) there are chains of transformation that span over four- Listing 1: Example script that loads data, transforms data teen such functions, with the average transformation into features, and saves them chain length just over five. In order to add a new data transform, a developer Listing 1 demonstrates the highly heterogenous nature has to make a decision as to where to put it. It could of data transformation code. The run function: be at the end of the run function, or ideally near some 1. loads some data into a central dataframe object logical grouping of transforms. However there is no (line 15). forcing function for a developer to do so, which inevitably 2. adds and derives features through various means: leads to critical transform code spread throughout the a) inline code that directly alters the dataframe entire codebase. A "spaghetti" codebase like this results in (lines 20-22). slow and frustrating debug cycles, requiring the cognitive b) a function that takes the whole dataframe burden of internalizing a mental map of computation in and assigns the result to a new column order to identify and fix problems. Ability to debug is (lines 5, 23). then heavily correlated with tenure on the team! c) a function that uses columns from the cen- tral dataframe and assigns the result to a 2.4. Integration testing requires new column (lines 9, 17, 19, 24). calculating all data transforms d) a conditional branch that changes the im- plementation used to compute a column While feature generating scripts such as Listing 1 are based on some configuration (lines 16-19). initially quick to execute, they grow into a large monolith. 3. contains only sporadic documentation. In order to test the integration of a new feature, one has to run the entire script. As the script inevitably grows 42 with the increasing complexity of a problem space, it that help validate data quality expectations over large takes longer to run, and thus longer to iterate on, fix datasets. After a dataset has been constructed, the user bugs, and improve. defines expectations over that data, that are then checked via execution on Apache Spark. 2.5. Code Reuse & Duplication Great Expectations, like Deequ, is also a heavy-weight framework, but is more broadly applicable to python. Because transform logic is not well encapsulated, code It allows one to validate, document, and profile data to reuse is difficult to achieve outside of the current context ensure data quality. It follows a similar implementation of the script. Good software engineering practices advise pattern to Deequ, as one needs to explicitly integrate it consistently refactoring code for reuse, however this is after dataset construction into a dataflow. easy to skip. It is simpler for a data scientist to instead None of the frameworks are meant to be run like unit find the relevant code and cut & paste it to their new tests, and thus are not designed for testing transform context, especially when they are scarce for time. Left logic. unchecked, this behavior creates more monolithic scripts As for the user experience, one has to explicitly add and propagates the problem. data quality test(s) into a dataflow. Determining how to add tests, when to add tests, and how to maintain them as dataflows evolve causes extra burden on the 3. Related tooling dataflow developer. For example, it is possible to change In industry, there are a few tools that come to mind when data transform logic and forget to update data quality discussing some of the pain points above. expectations if they are defined in separate steps of the dataflow, located in a different file in the code base, or stored externally in a datastore. Analogously, if a data 3.1. Lineage/Data Catalogs quality check fails, it can be similarly difficult to deter- OpenLineage[7] is an framework for data lineage collec- mine what source code generated the data, if one does tion and analysis. It aims to provide an open standard to not link the data quality test appropriately via naming enable disjoint tools to emit lineage metadata that can or documentation. then be centrally tracked and curated. It requires a oner to implement the standard, as well as maintain infras- 3.3. Orchestration Frameworks tructure to collect the emitted lineage metadata. It is designed for tracking materialization of whole data sets. Similar in approach to Hamilton are orchestration frame- It cannot track lineage at a columnar level. works [13, 14, 15, 16]. They too model their operations Data catalogs like Datahub[8] and Amundsen[9], are via a DAG, however their focus is modeling a user’s end systems of record with which one can emit and store to end workflow at a macro-level. Specifically, they model lineage and other metadata (e.g. for GDPR purposes). discrete steps at each of which an artifact is created and They require one to explicitly integrate with their APIs data is materialized. For example, in one step, raw data to capture this information. They are only as useful as the is ingested and transformed and saved as a table, and in information provided to them, so a developer needs to ex- a subsequent step, a machine learning model is trained plicitly consider integration as part of their development on that data and that model is saved. workflow. These frameworks also do not try to address any soft- ware engineering pain points a data transformation de- veloper might have. 3.2. Data Quality When one thinks about data transformations and test- ing data, one often thinks of Pandera[10], Deequ[11], or 4. Hamilton Framework Great Expectations[12]. The Hamilton framework alleviates the pain points de- Pandera is a stateless lightweight API for performing scribed in Section 2 through three distinct concepts: data validation on Pandas dataframes (i.e. in memory tables). Its focus is to provide a quick mechanism to define • Hamilton functions: the low-level unit of work expectations in code to create robust data processing developers use to encode dataflow components. pipelines. It has a small python dependency footprint so • Function DAG: The representation of the dataflow’s is easy to install and embed within a pipeline, enabling dependency structure, built by combining func- it to live close to transform logic. tion definitions. Deequ is a stateful, heavy-weight framework, that re- • Driver code: the code used to execute Hamilton quires peripheral services to operate. It is built on top functions by specifying the functions used to of Apache Spark and aims to define "unit tests for data" 43 build the DAG, the inputs to execution, and the 4.1.2. Unit Testing parts of the DAG to run. As Hamilton functions contain well encapsulated logic For those eager to see a simple Hello World we direct and clearly specify inputs, all data transform code is unit readers to Listing 5 in the Appendix. testable! 4.1. Hamilton Functions 4.1.3. Code readability and documentation Hamilton functions force a novel programming paradigm 1. Encapsulating feature logic in functions implies on the user. Like regular Python functions, they encapsu- a natural location for documentation (namely the late computational logic. However, the user is not respon- Python docstring). sible for invoking functions and assigning the results to 2. Coupling the name of the function with a reusable a variable. Instead, this is encoded in the structure of downstream artifact forces more meaningful nam- the function itself in a declarative manner. The function ing. It is trivial to determine the definition of a name serves to specify, or declare, the intended output feature and locate its usage. One needs to sim- variable, and the function input parameters (as well as ply search the code base for a function with that their type-annotations) map to expected input variables, name or which has that as an input parameter. i.e. declared dependencies. In the context of creating a dataframe, the function name serves as the intended 4.1.4. Vector friendly computation output column name, and the function input parameters In the case of creating dataframes, the Hamilton program- serve as the expected input columns/values. Type annota- ming paradigm pushes a user to write a function to create tions on the function and the variables are required by a single column, with inputs as columns as well. This nat- the Hamilton Framework. urally leads the developer to write logic that can utilize Note (1), Hamilton can be used to model any python vector computation, which often speeds up execution. object creation. For the remainder of this paper, we will stick to the context of creating pandas dataframes. Note (2), if Hamilton functions have wildly different python 4.1.5. Functions as the core interface dependency requirements, using Hamilton is still possi- Python functions have well defined boundaries; inputs ble, one would just partition DAG execution into multiple go in, and one output comes out. They can be serialized, steps matching the different python dependency require- inspected, and executed. Therefore, functions are used as ments. a universal interface and building block for both the user 1# rather than experience and the framework. A user does not need to 2 df[’acquisition_cost’] = df[’spend’] / df[’signups’] implement nor understand a special interface to use the 3 core Hamilton features. Similarly, the framework, with- 4# a user would instead write out knowing the exact shape of the function beforehand, 5 def acquisition_cost( 6 signups: pd.Series, spend: pd.Series) -> pd. has a clear object with which to work with, where it can Series: wrap a user’s functions to inject operational concerns via 7 """Example showing a simple Hamilton function""" decorators (see 4.2), or at run time (see 4.3.3). 8 return spend / signups Listing 2: the core Hamilton programming paradigm 4.2. Advanced Hamilton Functions with dataframes In an effort to encapsulate operational concerns and re- Listing 2 shows an example of the Hamilton paradigm duce repetitive function logic, Hamilton comes with a and what it is replacing. Hamilton’s breakdown of the variety of decorators. Decorators primarily fulfill one of example function’s components is demonstrated in Table the following purposes: 1. By defining functions in this manner, the developer 1. Determining whether a function should exist. if specifies their intended dataflow. This method of writing else blocks are dropped in favor of readable anno- Python functions has a variety of implications: tations (e.g. @config in listing 4). 2. Parameterizing function creation. A single func- 4.1.1. Verbosity tion can create multiple nodes. This approach increases the lines of code required to 3. Simplifying function logic by promoting reuse. describe simple operations. However, the benefits out- Syntactic sugar can help reduce verbosity and re- weigh the cost. Inputs are clearly specified, and logic is peated code (e.g. @extract_columns in listing automatically encapsulated in named functions. 4. 44 Table 1 How functions become nodes in a the DAG using the function defined in Listing 2 as an example. Function Name acquisition_cost Node name Type-hints pd.Series Node input & output types Parameter Names signups, spend Upstream dependencies Documentation Example showing a simple hamilton function Node Documentation Function Body return spend/signups Node Definition 4. Modeling operational concerns in a modular man- 4.2.2. @tag ner. For example, adding metadata for GDPR pur- As data systems and environments change over time, poses, or specifying run time data expectations. different metadata needs arise. Rather than requiring Hamilton decorators are extensible and can also be explicit integrations with metadata systems, or enforcing layered to enable highly expressive functions. a specific schema, Hamilton enables a lightweight way to Note, as functions are the core interface (see 4.1.5), the annotate functions with such concerns. @tag() takes in abstraction provided by Hamilton’s decorator system en- string key value pairs, and is thus amenable to annotat- ables, a platform team for example, a clear and decoupled ing functions with anything relevant to your particular way to plug into the user’s function writing experience, data ecosystem. E.g. ownership, source table names, while providing a clear way to manage and service their GDPR concerns, project names, etc. These tags are then decorator implementations. Done correctly, user func- attached to nodes in the DAG, which then can be used as tion definitions remains static to platform changes. a basis for querying for nodes, or asking graph questions With respect to data ecosystems, we will explain two of the DAG. See listing 4 for an example of usage. relevant Hamilton decorators: @check_output() and @tag(). We direct readers to the Hamilton documenta- 4.3. The Function DAG tion [2] for more information on other decorators. The function DAG is the framework’s representation of the nodes that should be executed and the dependencies 4.2.1. @check_output between them. In machine learning (ML) dataflows, data quality issues are a common cause of model problems. It is a best prac- 4.3.1. Node Creation tice to setup data expectations to mitigate these prob- lems. However, as explained in section 3.2, one needs Hamilton resolves the mapping of functions (e.g. listing to additionally integrate such a concern into a dataflow 2) to nodes. In the case of Hamilton functions annotated explicitly. With Hamilton, integrating data quality expec- with one or more decorators, a resolution step occurs to tations are less burdensome, as this takes the form of a determine how many nodes to create (e.g. in case of a lightweight Python decorator @check_output(), with parameterized function), and what the nodes should be which one can simply annotate their Hamilton functions. named. Functions beginning with _ are presumed to be Doing so enables transform logic and data expectations helper functions and thus excluded from inclusion in the to be co-located, without cluttering the user’s dataflow. DAG. There is no need to maintain separate code bases and data stores, or manually integrate checks as an explicit step 4.3.2. Constructing the DAG of a dataflow. Therefore maintenance and operational Hamilton compiles the DAG from a list of Python mod- costs are low for adding runtime data quality checks to a ules containing Hamilton functions and optional con- dataflow. figuration. It collects the relevant functions to create At DAG construction time, Hamilton automatically nodes, determines node dependencies, and assigns edges adds nodes to the DAG to check the output of the dec- between them. Any dependency that does not map to a orated function. At run time, after executing the user known node is marked as a required input for execution. function, Hamilton validates the provided expectations, surfacing data quality errors to the dataflow developer via logging, or stopping execution altogether if desired. 4.3.3. Walking the DAG See listing 4 for an example of usage. Given desired outputs, a topological sorting of the DAG is performed to determine the execution order. As the DAG is walked, additional operational concerns are injected, e.g. checking inputs and matching against function input 45 types, delegating function computation, and constructing the final object returned from execution. UD: dates 4.4. Driver Code Driver code steers execution of the Function DAG, pro- viding a convenient abstraction layer. Thus the developer actuals never has to interact with the DAG itself, and instead utilizes the driver to run and manage their dataflow. It handles the following: signups spend 4.4.1. DAG Instantiation The Driver directs construction of the Function DAG. Creation of the driver is as simple as the following: B acquisition_cost 1 from hamilton import driver 2 from funcs import spend_forecast, spend_data_loader 3 4 config={...} spend_b 5 modules = [spend_data_loader, spend_forecast] 6 dr = driver.Driver(config, *modules, adapter=...) Listing 3: Sample Driver code to instantiate a DAG Figure 1: Example rendering produced by running The call to instantiate the driver accepts a config ar- visualize_execution() on an instantiated DAG, if one gument. This takes the form of a dictionary with string was interested in computing spend_b from Listing 1 as imple- keys and Python objects as values, that serves two pur- mented in Hamilton in Listing 4. Hamilton makes it straight- forward to determine what is required to compute a feature. poses: (1) it helps determine the shape of the DAG when UD refers to user defined input. Note: for diagram legi- coupled with appropriate decorators (section 4.2); (2) it bility, we omitted displaying the validation nodes that the sets inputs that a user wants to be invariant between @check_output() decorator would add to the DAG. DAG execution runs. Meanwhile, the adapter argument (optional) controls execution (such as delegating to Dask), and determines the object type returned from DAG exe- 1 # in a module, e.g. my_functions.py cution. 2 3 @tag(source="prod.denormalized", owner="team:DE") 4.4.2. DAG Execution 4 @extract_columns(’year’, ’week’, ’spend’, ’signups’, ’A’, ’B’, ’C’) The driver has two primary methods: 5 def actuals(dates: ’a_date_type’) -> pd.DataFrame: 6 return loader.load_actuals(dates) 1. execute(outputs_wanted, inputs, 7 overrides) executes the DAG, computing only 8 @check_output(data_type=np.float64, allow_nans=False) what is required to create the output, and returns 9 def weights() -> pd.Series: a python object, e.g. a Pandas dataframe. 10 return loader.get_weights() 11 2. visualize_execution(outputs_wanted, 12 @config.when(region=’UK’) inputs, ...) visualizes the parts of the DAG 13 def holidays__uk(year: pd.Series, week: pd.Series) -> required for execution. pd.Series: 14 return is_uk_holiday(year, week) Note that the developer can pass parameters to the DAG 15 through two Python dictionaries: inputs and overrid- 16 @config.when(region=’US’) es. Inputs specifies runtime inputs to the DAG, provid- 17 def holidays__us(year: pd.Series, week: pd.Series) -> ing requisite dependencies that are not satisfied by exist- pd.Series: 18 return is_holiday(year, week) ing nodes. Overrides enables the developer to bypass 19 execution of specified nodes, effectively short-circuiting 20 def avg_3wk_spend(spend: pd.Series) -> pd.Series: their computation. Hamilton will forego computation 21 return spend.rolling(3).mean() of any upstream node depended on solely by overrid- 22 den nodes. By offering these parametrization capabili- 23 def acquisition_cost(spend: pd.Series, signups: pd. Series) -> pd.Series: ties, Hamilton enables precise control over the dataflow’s 24 return spend / signups structure and execution. 46 25 code-diffing, breakpoints, and bisection) gain in value 26 def spend_shift_3weeks(spend: pd.Series) -> pd.Series due to Hamilton’s logical mapping of code to produced : data. For example, to debug spend_b from our contrived 27 return spend.shift(3) 28 example (listing 1), it is straightforward to visualize it’s 29 def special_feature1(A: pd.Series, B: pd.Series, C: execution path, Figure 1, and thus determine what needs pd.Series, weights: pd.Series) -> pd.Series: to be debugged. 30 """Some documentation explaining what this is""" return (A - B + C) * weights 4.5.3. Documentation 31 32 33 @check_output(data_type=np.float64, range=(0.0, The confluence of: 100.0), allow_nans=False) 34 def spend_b(acquisition_cost: pd.Series, B: pd.Series • using function documentation strings ) -> pd.Series: • one-to-one mapping of outputs to functions 35 """documentation to explain this function""" 36 return acquisition_cost * B • the ability to visualize the DAG and execution 37 paths 38 ## In a separate script/module, e.g. run.py, • the @tag() decorator for adding extra metadata 39 ## code to create and execute the DAG 40 from hamilton import driver enables a clear and straightforward means to document 41 import my_functions transform logic in a standardized way. The function doc- 42 umentation string is perfect for long form explanations, 43 config = {...} # configuration 44 modules = [my_functions] # modules to crawl and can be exposed via tooling such as sphinx[17]. The 45 dr = driver.Driver(config, *modules) mapping of function names to outputs ensures that func- 46 df = dr.execute([’year’, ’week’, ’holidays’, ’ tion names and input parameters are meaningful while acquisition_cost’, ...]) # materialize also enabling one to quickly locate the definition of an 47 save_df(df, "some_location") # save result output. The ability to visualize the DAG and execution Listing 4: Hamilton version of the earlier example script paths helps provide a big picture mental model for those in Listing 1, with four decorators used to show learning the code base. The @tag() decorator makes it example usage. easy to add additional metadata concerns, without clut- tering the transform logic itself. 4.5. Benefits of Hamilton 4.5.4. Central Definition Store With respect to a data scientist’s workflow, we have found A common problem for machine learning practitioners is the following benefits when using Hamilton. that of leveraging other’s work. Most industry solutions target materialized data, e.g. [18], rather than the code 4.5.1. Incremental Development itself. As the code in Hamilton maps directly to outputs, module organization is highly incentivized. Curating all Rather than requiring execution of a monolithic script, modules into a single repository (as the FED team did Hamilton pushes the dataflow creator towards incremen- at Stitch Fix) provides a straightforward approach for a tal, test-driven, development. As dataflows are composed team to refer to and reuse work. of discrete, unit-testable components, modifications to produce new data can be started locally by conducting 4.5.5. Transparent Scaling test-driven development on the function itself. As node execution only requires running upstream dependencies, Most distributed computation frameworks follow a lazy integrating with the full dataflow is straightforward. The execution model e.g. Dask, Ray, and Spark. They build a developer need only request computation of the new DAG of the computation required prior to distributing ex- node via the Hamilton driver to integration test the new ecution. As Hamilton’s Function DAG is structured using addition. the same approach, it can provide a layer of indirection between dataflow definition and method of execution. In 4.5.2. Debugging practice, this means that most Hamilton Functions do not need modification to run on these distributed compu- Hamilton makes debugging dataflows simpler by provid- tation systems, unless the data type they operate over is ing a standard methodical approach. One can isolate bugs not supported by that system. For example, both Spark by determining the erroneous output, finding the same- and Dask implement the Pandas dataframe API, so a user name function definition, debugging that logic, and if would not have to change their Pandas code to scale to no error is found, repeat tracing through each upstream a Dask or Spark cluster, other than changing how they dependency. Standard debugging procedures (such as load data for execution. 47 4.5.6. Source Code Based Lineage take a whole day for a team member to complete prior to Hamilton. After Hamilton, this task takes no more than The declarative nature of Hamilton enables an entire end two hours, which represents a 4x improvement! to end ML workflow to be modeled. Column level lin- eage from source, to machine learning feature, to model that consumes it, generally requires additional integra- 5.3. Qualitative assessment tion work to ensure it’s emission and storage, e.g. with The initial success criteria for the Hamilton project were Amundsen. With Hamilton, no such integration or sys- all qualitative measures. Namely, that a core data sci- tem is required. The declarative functions can model ence team adopted the tooling, enjoyed using it, and this entire process with any tooling that is python based, were able to deliver on their business objectives. On all as the function source code becomes the source of truth. accounts, Hamilton delivered successfully, without any To build a standalone lightweight lineage system, one detractors. Since then, two and a half years in production need to only pair the function definitions, driver code have passed and the same qualitative measures still hold. and configuration, with a source code version control The team manages over 4000 data transforms, which system (e.g. git) to snapshot the code (e.g. git commit) represents almost a decade of work, written by at least when an artifact is created, to enable reconstruction of fifteen different team members. the DAG for lineage querying purposes. 4.5.7. Lineage for Data Privacy/Provenance 6. Summary Concerns Hamilton is a novel dataflow framework that makes data Hamilton unlocks the ability to provide fine grained lin- transformation engineering in Python straightforward. eage of computation. With the growth of privacy con- By representing dataflows as a series of simple Python cerns and data regulation, organizations need to know functions, Hamilton produces code that is easy to read what data comes in, where it goes, and how it is used. and decoupled from execution. This results in trans- Hamilton functions can be marked (via @tag() with form logic that is always unit testable and documenta- privacy or regulation concerns, e.g. that it contains Per- tion friendly, provides lineage out of the box, enables sonally Identifiable Information (PII), enabling one to lightweight run time data quality checks, and unlocks easily surface answers to questions of data usage and fast iteration and debug cycles. It has enabled the FED data impact from the structure of the DAG. team at Stitch Fix to scale, managing over 4000 data trans- forms that create features for time-series modeling. In addition, Hamilton provides a layer of indirection 5. Evaluation that transparently scales computation onto various dis- tributed computation frameworks (such as Ray, Spark, 5.1. Adoption and Dask) as materialization is decoupled from function To enjoy the benefits of Hamilton, one must use the transform definitions. This opens the door for exciting paradigm. For existing systems, this means a migration future work. needs to occur, which has been the largest friction point to adopting Hamilton. Internally, teams with active fea- ture development for time-series forecasting have been 7. Future Work the most prolific adopters, as they are the willing to pay Here we highlight three avenues of future work. For the migration/adoption cost to reap the paradigm’s bene- more, see open issues in Hamilton’s github repository. fits. Externally (since October 2021), at minimum, teams using Pandas and wanting to improve software engineer- ing hygiene have been Hamilton’s best adopters. 7.1. Source code based data governance With Hamilton, one can encode a rich repository of 5.2. Quantitative assessment metadata (see section 4.5.7) into the source code directly. Because source code is required to perform data trans- A quantitative assessment of Hamilton’s benefits to a formations, keeping transform logic synchronized with team is challenging, as one would have to construct a tags, data quality checks, and documentation is a simpler tightly controlled experiment, e.g. like [19]. In an indus- proposition than having that metadata in separate inde- try environment, however, it is hard to secure resourcing pendent steps of a dataflow or separate systems. There- for such an endeavor. That said, anecdotally, for the FED fore the source code itself could conceivably be used as a team, a monthly feature engineering task of adding and reliable base for data governance. adjusting data transformations for model fitting used to 48 However, how to expose this information for consump- function, 2019. URL: https://multithreaded.stitchfix. tion requires more thought. Does one build directly on com/blog/2019/03/11/FullStackDS-Generalists/. top of the source code? Or does one emit this information [2] Stefan Krawczyk, Elijah ben Izzy, Danielle Quinn, to an existing system, such as a data catalog? For the A scalable general purpose micro-framework for former, a new system would need to be built. For the defining dataflows, 2021. URL: https://github.com/ latter, one could integrate a continuous integration sys- stitchfix/hamilton. tem that publishes changes when source code is snapshot [3] P. Moritz, Ray: A Distributed Execution Engine (i.e. committed), or augment the Hamilton driver/DAG for the Machine Learning Ecosystem, Ph.D. the- walking methodology to emit this information at DAG sis, EECS Department, University of California, instantiation/execution time. Berkeley, 2019. URL: http://www2.eecs.berkeley. Similarly, data access/use policies could also be a target edu/Pubs/TechRpts/2019/EECS-2019-124.html. for source code based governance. By tagging functions [4] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Arm- that ingest data sources with appropriate data policies, brust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, one could, prior to DAG execution, walk the DAG to M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, ensure the requesting user and requested DAG execution I. Stoica, Apache spark: a unified engine for big data meets the policy requirements for those data sources. processing, Commun. ACM 59 (2016) 56–65. URL: http://doi.acm.org/10.1145/2934664. doi:10.1145/ 2934664. 7.2. Compiling to an orchestration [5] Various, Dask: Library for dynamic task scheduling, framework 2016. URL: https://dask.org. A common problem with ML tooling is choosing an or- [6] Pandas dev. team, pandas-dev/pandas: Pandas, chestration system. This is a big decision, because com- 2020. URL: https://doi.org/10.5281/zenodo.3509134. panies rarely change this infrastructure. As Hamilton doi:10.5281/zenodo.3509134. functions do not define or set materialization concerns, [7] Various, An open framework for data lineage col- it cannot be used in place of an orchestration framework lection and analysis, 2017. URL: https://openlineage. such as Airflow[15], where computation is split into dis- io/. crete steps and materialized to a data store in between [8] Various, Datahub, 2020. URL: https://github.com/ steps. If one were to provide node groupings and a ma- datahub-project/datahub. terialization function, then it would be straightforward [9] Various, Amundsen, 2019. URL: https://github.com/ to compile the Hamilton Function DAG into any exist- amundsen-io/amundsen. ing framework. Programmatically defining orchestration [10] Niels Bantilan, pandera: Statistical Data Valida- would also unlock the possibility for low cost infrastruc- tion of Pandas Dataframes, in: Meghann Agar- ture migrations, while avoiding vendor lock in. wal, Chris Calloway, Dillon Niederhut, David Shupe (Eds.), Proceedings of the 19th Python in Science Conference, 2020, pp. 116 – 124. doi:10.25080/ 7.3. Modeling your entire data warehouse Majora-342d178e-010. independently of materialization [11] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biess- concerns mann, A. Grafberger, Automating large-scale data Common industry data tools and orchestration frame- quality verification, Proceedings of the VLDB En- works leak materialization concerns into the user experi- dowment 11 (2018) 1781–1794. ence. For example, using SQL, the end user has to think [12] Various, Great expectations, 2017. URL: in tables. This naturally cascades to how data is mate- https://github.com/great-expectations/great_ rialized and transferred between workflows. What if, expectations. instead, one could model the dependencies of one’s data [13] Various, Metaflow: a framework for real-life transforms, independently of how and where the data is data science, 2020. URL: https://github.com/Netflix/ stored? The declarative nature of Hamilton unlocks this metaflow. possibility. [14] Various, Prefect workflow management system, 2017. URL: https://github.com/PrefectHQ/prefect. [15] Various, Apache airflow, 2015. URL: https://github. References com/apache/airflow. [16] Various, Dagster: An orchestration platform for the [1] Eric Colson, Beware the data science pin factory: development, production, and observation of data The power of the full-stack data science gener- assets, 2020. URL: https://github.com/dagster-io/ alist and the perils of division of labor through dagster. [17] Georg Brandl, Sphinx documentation, 2008. URL: 49 https://www.sphinx-doc.org/en/master/. 39 # Note: these values don’t have to be all series, [18] T. Kakantousis, A. Kouzoupis, F. Buso, G. Berthou, they could be scalar. ’signups’: pd.Series([1, 10, 50, 100, 200, 400]), J. Dowling, S. Haridi, Horizontally scalable ml 40 41 ’spend’: pd.Series([10, 10, 20, 40, 40, 50]), pipelines with a feature store, in: Proc. 2nd SysML 42 } Conf., Palo Alto, USA, 2019. 43 # instantiate the DAG - multiple modules can be [19] D. L. Moody, Cognitive load effects on end user passed understanding of conceptual models: An experi- 44 dr = driver.Driver(initial_columns, my_functions) mental analysis, in: A. Benczúr, J. Demetrovics, 45 # we need to specify what we want in the final dataframe G. Gottlob (Eds.), Advances in Databases and Infor- 46 output_columns = [ mation Systems, Springer Berlin Heidelberg, Berlin, 47 ’spend’, Heidelberg, 2004, pp. 129–143. 48 ’signups’, 49 ’avg_3wk_spend’, 50 ’spend_per_signup’, A. A full Hamilton Hello World 51 52 ] ’spend_zero_mean_unit_variance’ Example 53 # 54 df by default execution returns a dataframe = dr.execute(output_columns) 55 print(df.to_string()) 1 ## --- in my_functions.py 56 2 import pandas as pd 57 # To visualize do ‘pip install sf-hamilton[ 3 visualization]‘ if you want these to work 4 def avg_3wk_spend(spend: pd.Series) -> pd.Series: 58 dr.visualize_execution(output_columns, ’./my_dag.dot’ 5 """Rolling 3 week average spend.""" , {}) 6 return spend.rolling(3).mean() 59 dr.display_all_functions(’./my_full_dag.dot’) 7 8 Listing 5: A full hello world example. 9 def spend_per_signup(spend: pd.Series, signups: pd. Series) -> pd.Series: 10 """The cost per signup in relation to spend.""" 11 return spend / signups 12 13 14 def spend_mean(spend: pd.Series) -> float: 15 """Shows function creating a scalar. In this case it computes the mean of the entire column.""" 16 return spend.mean() 17 18 19 def spend_zero_mean(spend: pd.Series, spend_mean: float) -> pd.Series: 20 """Shows function that takes a scalar. In this case to zero mean spend.""" 21 return spend - spend_mean 22 23 24 def spend_std_dev(spend: pd.Series) -> float: 25 """Function that computes the standard deviation of the spend column.""" 26 return spend.std() 27 28 29 def spend_zero_mean_unit_variance(spend_zero_mean: pd .Series, spend_std_dev: float) -> pd.Series: 30 """Function showing one way to make spend have zero mean and unit variance.""" 31 return spend_zero_mean / spend_std_dev 32 33 ## in run.py 34 import pandas as pd 35 from hamilton import driver 36 import my_functions # we import user functions here 37 38 initial_columns = { # load from actuals or wherever -- this is our initial data we use as input. 50