<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Krawczyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elijah ben Izzy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danielle Quinn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stitch Fix</institution>
          ,
          <addr-line>1 Montgomery Tower, Suite 1500, 94104, San Francisco, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>41</fpage>
      <lpage>50</lpage>
      <abstract>
        <p>While data science, as a high level consumer of and producer to data ecosystems, has grown in prevalence within organizations, software engineering practices for data science code bases have not. Stereotypical data science code is not known for unit testing coverage, ease of documentation, reuseability, or enabling quick incremental development as it grows. Over time, this lack of software engineering quality impacts the maintainers ability to make progress within a data ecosystem. The data platform team at Stitch Fix created Hamilton to solve these software engineering pain points with respect to data transformations. It does this by requiring a programming paradigm change that enables straightforward specification and execution of dataflow graphs. Hamilton has enabled data science teams at Stitch Fix to scale their code bases to support 4000+ data transformations, by ensuring that transformation code is always unit testable, documentation friendly, easily curated, reuseable, and amenable to fast incremental development. Hamilton also enables transparently scaling computation onto distributed systems such as Dask, Ray, and Spark, without requiring a rewrite of data transform logic. Hamilton therefore represents a novel approach to modeling dataflows that is decoupled from materialization concerns, and presents an industry pragmatic avenue for building a simpler user experience for high level data ecosystem practitioners. Hamilton is available as open source code.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>lineage grew in dificulty with the number of transforms.</p>
      <p>
        The Hamilton framework[2] was therefore conceived
With the shift to "Full Stack Data Science"[1], data scien- to mitigate the FED team’s software engineering pain
tists are expected to not only do data science, but also points. Specifically, Hamilton enables a simpler paradigm
engineer and manage data pipelines for their production to create, maintain, and execute code for data
engineermodels. This additional responsibility places burdens on ing, especially in the case of highly complex data
transdata scientists, who no longer hand of their ideas of to a formation dependency chains. Hamilton does this by
software engineering team for implementation and main- deriving a directed acyclic graph (DAG) of dependencies
tenance. This burden becomes especially acute in the using specially defined Python functions that describe the
domain of time-series forecasting, where data transfor- user’s intended dataflow. Altogether, Hamilton makes
mation needs involve creating an ever increasing number incremental development, code reuse, unit testing,
deterof features (columns) in a dataframe (table) for use with mining lineage, and documentation natural and
straightmodel fitting/forecasting. To create better time-series forward. Furthermore, it provides avenues to quickly
forecasts, one is continually seeking to add more features and easily scale computation onto various distributed
by incorporating new data, updating existing features, computation frameworks, e.g. Ray[
        <xref ref-type="bibr" rid="ref37">3</xref>
        ]/Spark[4]/Dask[5],
and deriving new features from existing ones. The ma- without changing much code.
jority of features are the product of a chain of transfor- We will first provide some examples of typical
softmations over other features. At Stitch Fix, the Forecast- ware engineering pain points with data transformations
ing, Estimation, and Demand (FED) team had curated a at Stitch Fix, then talk about related tooling, and spend
code base over the course of several years, to produce a the rest of this report diving into Hamilton’s
programdataframe for fitting time-series models with thousands ming paradigm. We will show the benefits this paradigm
of such features. Unfortunately, maintaining and adding brings, provide a lightweight evaluation of the
framefeatures to the code base had become burdensome to the work, and finish with a summary and a description of
point where their delivery of work slowed significantly. future work.
      </p>
      <p>Unit-testing was virtually non-existent, documentation
was scattered and inconsistent, and determining feature
2. Software engineering pain
points with data
transformations
pain points we encountered at Stitch Fix. It demonstrates
creating data transforms that represent features to fit a
time-series model.
1 # create_features.py
2import pandas as pd
3from library import loader, is_holiday, is_uk_holiday
4
5 def compute_bespoke_feature(df: pd.DataFrame) -&gt; pd.</p>
      <p>Series:
6 """Some documentation explaining what this is"""
7 return (df[’A’] - df[’B’] + df[’C’]) * loader.</p>
      <p>get_weights()
8
9 def multiply_columns(col1: pd.Series,
10 col2: pd.Series) -&gt; pd.Series:
11 """Some documentation explaining what this is"""
12 return col1 * col2
13
14 def run(dates, config):
15 df = loader.load_actuals(dates) # e.g. spend,</p>
      <p>signups
16 if config[’region’] == ’UK’:
17 df[’holidays’] = is_uk_holiday(df[’year’], df</p>
      <p>[’week’])
18 else:
19 df[’holidays’] = is_holiday(df[’year’], df[’</p>
      <p>week’])
20 df[’avg_3wk_spend’] = df[’spend’].rolling(3).mean</p>
      <p>()
21 df[’acquisition_cost’] = df[’spend’] / df[’</p>
      <p>signups’]
22 df[’spend_shift_3weeks’] = df[’spend’].shift(3)
23 df[’special_feature1’] = compute_bespoke_feature(</p>
      <p>df)
24 df[’spend_b’] = multiply_columns(df[’</p>
      <p>acquisition_cost’], df[’B’])
25 save_df(df, "some_location")
26 if __name__ == ’__main__’:
27 run(dates=..., config=...)</p>
      <p>At only twenty-seven lines, the code in Listing 1 looks
innocuous. However, scaling this script from six to 1000+
data transforms (as occurred at Stitch Fix) presents the
following problems:</p>
      <sec id="sec-1-1">
        <title>2.1. Inconsistent unit test coverage</title>
        <p>Only three of the derived features lend themselves
towards straightforward unit testing. One cannot unit test
the inline dataframe manipulations without running the
entire script, so the code base inevitably has minimal, if
any, test coverage. In such a codebase, it is dificult to
determine behavioral changes when code changes.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.2. Code readability and documentation</title>
        <p>Well organized code with documentation is critical for a
maintainer to understand and contribute to a codebase.
It ensures information is not siloed in the original
developer’s mind, and that newcomers to the codebase can
quickly become productive. In Listing 1, code readability
and documentation is tragically lost between inline
manipulations, functions, and the organization of the run
function. Identifying the logic used to derive a feature is
far from trivial, even with the best developer tools.</p>
      </sec>
      <sec id="sec-1-3">
        <title>2.3. Dificulty in tracing data lineage</title>
        <p>Listing 1 demonstrates the highly heterogenous nature
of data transformation code. The run function:
Listing 1: Example script that loads data, transforms data
into features, and saves them</p>
        <p>At six features, tracing lineage of inputs to a data
transform is not particularly dificult. At 1000+ data
transforms, however, this is a challenging task. At Stitch Fix,
there are chains of transformation that span over
fourteen such functions, with the average transformation
chain length just over five.</p>
        <p>In order to add a new data transform, a developer
has to make a decision as to where to put it. It could
be at the end of the run function, or ideally near some
1. loads some data into a central dataframe object logical grouping of transforms. However there is no
(line 15). forcing function for a developer to do so, which inevitably
2. adds and derives features through various means: leads to critical transform code spread throughout the
a) inline code that directly alters the dataframe entire codebase. A "spaghetti" codebase like this results in
(lines 20-22). slow and frustrating debug cycles, requiring the cognitive
b) a function that takes the whole dataframe burden of internalizing a mental map of computation in
and assigns the result to a new column order to identify and fix problems. Ability to debug is
(lines 5, 23). then heavily correlated with tenure on the team!
c) a function that uses columns from the
central dataframe and assigns the result to a 2.4. Integration testing requires
new column (lines 9, 17, 19, 24). calculating all data transforms
d) a conditional branch that changes the
implementation used to compute a column
based on some configuration (lines 16-19).
3. contains only sporadic documentation.</p>
        <p>While feature generating scripts such as Listing 1 are
initially quick to execute, they grow into a large monolith.</p>
        <p>In order to test the integration of a new feature, one has
to run the entire script. As the script inevitably grows
with the increasing complexity of a problem space, it that help validate data quality expectations over large
takes longer to run, and thus longer to iterate on, fix datasets. After a dataset has been constructed, the user
bugs, and improve. defines expectations over that data, that are then checked
via execution on Apache Spark.
2.5. Code Reuse &amp; Duplication Great Expectations, like Deequ, is also a heavy-weight
framework, but is more broadly applicable to python.</p>
        <p>Because transform logic is not well encapsulated, code It allows one to validate, document, and profile data to
reuse is dificult to achieve outside of the current context ensure data quality. It follows a similar implementation
of the script. Good software engineering practices advise pattern to Deequ, as one needs to explicitly integrate it
consistently refactoring code for reuse, however this is after dataset construction into a dataflow.
easy to skip. It is simpler for a data scientist to instead None of the frameworks are meant to be run like unit
ifnd the relevant code and cut &amp; paste it to their new tests, and thus are not designed for testing transform
context, especially when they are scarce for time. Left logic.
unchecked, this behavior creates more monolithic scripts As for the user experience, one has to explicitly add
and propagates the problem. data quality test(s) into a dataflow. Determining how
to add tests, when to add tests, and how to maintain
3. Related tooling them as dataflows evolve causes extra burden on the
dataflow developer. For example, it is possible to change
In industry, there are a few tools that come to mind when data transform logic and forget to update data quality
discussing some of the pain points above. expectations if they are defined in separate steps of the
dataflow, located in a diferent file in the code base, or
stored externally in a datastore. Analogously, if a data
3.1. Lineage/Data Catalogs quality check fails, it can be similarly dificult to
determine what source code generated the data, if one does
not link the data quality test appropriately via naming
or documentation.</p>
        <p>OpenLineage[7] is an framework for data lineage
collection and analysis. It aims to provide an open standard to
enable disjoint tools to emit lineage metadata that can
then be centrally tracked and curated. It requires a oner
to implement the standard, as well as maintain infras- 3.3. Orchestration Frameworks
tructure to collect the emitted lineage metadata. It is
designed for tracking materialization of whole data sets. Similar in approach to Hamilton are orchestration
frameIt cannot track lineage at a columnar level. works [13, 14, 15, 16]. They too model their operations</p>
        <p>
          Data catalogs like Datahub[
          <xref ref-type="bibr" rid="ref14">8</xref>
          ] and Amundsen[9], are via a DAG, however their focus is modeling a user’s end
systems of record with which one can emit and store to end workflow at a macro-level. Specifically, they model
lineage and other metadata (e.g. for GDPR purposes). discrete steps at each of which an artifact is created and
They require one to explicitly integrate with their APIs data is materialized. For example, in one step, raw data
to capture this information. They are only as useful as the is ingested and transformed and saved as a table, and in
information provided to them, so a developer needs to ex- a subsequent step, a machine learning model is trained
plicitly consider integration as part of their development on that data and that model is saved.
workflow. These frameworks also do not try to address any
software engineering pain points a data transformation
developer might have.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>3.2. Data Quality</title>
        <p>
          When one thinks about data transformations and
testing data, one often thinks of Pandera[
          <xref ref-type="bibr" rid="ref12">10</xref>
          ], Deequ[
          <xref ref-type="bibr" rid="ref20">11</xref>
          ], or
Great Expectations[12].
        </p>
        <p>Pandera is a stateless lightweight API for performing
data validation on Pandas dataframes (i.e. in memory
tables). Its focus is to provide a quick mechanism to define
expectations in code to create robust data processing
pipelines. It has a small python dependency footprint so
is easy to install and embed within a pipeline, enabling
it to live close to transform logic.</p>
        <p>Deequ is a stateful, heavy-weight framework, that
requires peripheral services to operate. It is built on top
of Apache Spark and aims to define "unit tests for data"</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Hamilton Framework</title>
      <p>The Hamilton framework alleviates the pain points
described in Section 2 through three distinct concepts:
• Hamilton functions: the low-level unit of work
developers use to encode dataflow components.
• Function DAG: The representation of the dataflow’s
dependency structure, built by combining
function definitions.
• Driver code: the code used to execute Hamilton
functions by specifying the functions used to</p>
      <sec id="sec-2-1">
        <title>4.1. Hamilton Functions</title>
        <p>Hamilton functions force a novel programming paradigm
on the user. Like regular Python functions, they
encapsulate computational logic. However, the user is not
responsible for invoking functions and assigning the results to
a variable. Instead, this is encoded in the structure of
the function itself in a declarative manner. The function
name serves to specify, or declare, the intended output
variable, and the function input parameters (as well as
their type-annotations) map to expected input variables,
i.e. declared dependencies. In the context of creating
a dataframe, the function name serves as the intended
output column name, and the function input parameters
serve as the expected input columns/values. Type
annotations on the function and the variables are required by
the Hamilton Framework.</p>
        <p>Note (1), Hamilton can be used to model any python
object creation. For the remainder of this paper, we will
stick to the context of creating pandas dataframes. Note
(2), if Hamilton functions have wildly diferent python
dependency requirements, using Hamilton is still
possible, one would just partition DAG execution into multiple
steps matching the diferent python dependency
requirements.
1 # rather than
2 df[’acquisition_cost’] = df[’spend’] / df[’signups’]
3
4 # a user would instead write
5 def acquisition_cost(
6 signups: pd.Series, spend: pd.Series) -&gt; pd.</p>
        <p>Series:
7 """Example showing a simple Hamilton function"""
8 return spend / signups
Listing 2: the core Hamilton programming paradigm
with dataframes</p>
        <sec id="sec-2-1-1">
          <title>4.1.1. Verbosity</title>
          <p>This approach increases the lines of code required to
describe simple operations. However, the benefits
outweigh the cost. Inputs are clearly specified, and logic is
automatically encapsulated in named functions.
build the DAG, the inputs to execution, and the
parts of the DAG to run.
As Hamilton functions contain well encapsulated logic
For those eager to see a simple Hello World we direct and clearly specify inputs, all data transform code is unit
readers to Listing 5 in the Appendix. testable!</p>
          <p>In an efort to encapsulate operational concerns and
re</p>
          <p>Listing 2 shows an example of the Hamilton paradigm duce repetitive function logic, Hamilton comes with a
and what it is replacing. Hamilton’s breakdown of the variety of decorators. Decorators primarily fulfill one of
example function’s components is demonstrated in Table the following purposes:
1. By defining functions in this manner, the developer
specifies their intended dataflow. This method of writing
Python functions has a variety of implications:</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>4.1.3. Code readability and documentation</title>
          <p>1. Encapsulating feature logic in functions implies
a natural location for documentation (namely the
Python docstring).
2. Coupling the name of the function with a reusable
downstream artifact forces more meaningful
naming. It is trivial to determine the definition of a
feature and locate its usage. One needs to
simply search the code base for a function with that
name or which has that as an input parameter.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>4.1.4. Vector friendly computation</title>
          <p>In the case of creating dataframes, the Hamilton
programming paradigm pushes a user to write a function to create
a single column, with inputs as columns as well. This
naturally leads the developer to write logic that can utilize
vector computation, which often speeds up execution.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>4.1.5. Functions as the core interface</title>
          <p>Python functions have well defined boundaries; inputs
go in, and one output comes out. They can be serialized,
inspected, and executed. Therefore, functions are used as
a universal interface and building block for both the user
experience and the framework. A user does not need to
implement nor understand a special interface to use the
core Hamilton features. Similarly, the framework,
without knowing the exact shape of the function beforehand,
has a clear object with which to work with, where it can
wrap a user’s functions to inject operational concerns via
decorators (see 4.2), or at run time (see 4.3.3).
4.2. Advanced Hamilton Functions
1. Determining whether a function should exist. if
else blocks are dropped in favor of readable
annotations (e.g. @config in listing 4).
2. Parameterizing function creation. A single
func</p>
          <p>tion can create multiple nodes.
3. Simplifying function logic by promoting reuse.</p>
          <p>Syntactic sugar can help reduce verbosity and
repeated code (e.g. @extract_columns in listing
4.
4. Modeling operational concerns in a modular man- 4.2.2. @tag
ner. For example, adding metadata for GDPR
purposes, or specifying run time data expectations.</p>
          <p>As data systems and environments change over time,
diferent metadata needs arise. Rather than requiring</p>
          <p>Hamilton decorators are extensible and can also be explicit integrations with metadata systems, or enforcing
layered to enable highly expressive functions. a specific schema, Hamilton enables a lightweight way to</p>
          <p>Note, as functions are the core interface (see 4.1.5), the annotate functions with such concerns. @tag() takes in
abstraction provided by Hamilton’s decorator system en- string key value pairs, and is thus amenable to
annotatables, a platform team for example, a clear and decoupled ing functions with anything relevant to your particular
way to plug into the user’s function writing experience, data ecosystem. E.g. ownership, source table names,
while providing a clear way to manage and service their GDPR concerns, project names, etc. These tags are then
decorator implementations. Done correctly, user func- attached to nodes in the DAG, which then can be used as
tion definitions remains static to platform changes. a basis for querying for nodes, or asking graph questions</p>
          <p>With respect to data ecosystems, we will explain two of the DAG. See listing 4 for an example of usage.
relevant Hamilton decorators: @check_output() and
@tag(). We direct readers to the Hamilton documenta- 4.3. The Function DAG
tion [2] for more information on other decorators.</p>
          <p>The function DAG is the framework’s representation of
the nodes that should be executed and the dependencies
between them.
4.2.1. @check_output
In machine learning (ML) dataflows, data quality issues
are a common cause of model problems. It is a best prac- 4.3.1. Node Creation
tice to setup data expectations to mitigate these
problems. However, as explained in section 3.2, one needs Hamilton resolves the mapping of functions (e.g. listing
to additionally integrate such a concern into a dataflow 2) to nodes. In the case of Hamilton functions annotated
explicitly. With Hamilton, integrating data quality expec- with one or more decorators, a resolution step occurs to
tations are less burdensome, as this takes the form of a determine how many nodes to create (e.g. in case of a
lightweight Python decorator @check_output(), with parameterized function), and what the nodes should be
which one can simply annotate their Hamilton functions. named. Functions beginning with _ are presumed to be
Doing so enables transform logic and data expectations helper functions and thus excluded from inclusion in the
to be co-located, without cluttering the user’s dataflow. DAG.</p>
          <p>There is no need to maintain separate code bases and data
stores, or manually integrate checks as an explicit step 4.3.2. Constructing the DAG
of a dataflow. Therefore maintenance and operational Hamilton compiles the DAG from a list of Python
modcosts are low for adding runtime data quality checks to a ules containing Hamilton functions and optional
condataflow. ifguration. It collects the relevant functions to create</p>
          <p>At DAG construction time, Hamilton automatically nodes, determines node dependencies, and assigns edges
adds nodes to the DAG to check the output of the dec- between them. Any dependency that does not map to a
orated function. At run time, after executing the user known node is marked as a required input for execution.
function, Hamilton validates the provided expectations,
surfacing data quality errors to the dataflow developer
via logging, or stopping execution altogether if desired. 4.3.3. Walking the DAG
See listing 4 for an example of usage.</p>
          <p>Given desired outputs, a topological sorting of the DAG is
performed to determine the execution order. As the DAG
is walked, additional operational concerns are injected,
e.g. checking inputs and matching against function input
types, delegating function computation, and constructing
the final object returned from execution.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.4. Driver Code</title>
        <p>Driver code steers execution of the Function DAG,
providing a convenient abstraction layer. Thus the developer
never has to interact with the DAG itself, and instead
utilizes the driver to run and manage their dataflow. It
handles the following:</p>
        <sec id="sec-2-2-1">
          <title>4.4.1. DAG Instantiation</title>
          <p>The Driver directs construction of the Function DAG.
Creation of the driver is as simple as the following:
1from hamilton import driver
2from funcs import spend_forecast, spend_data_loader
3
4config={...}
5modules = [spend_data_loader, spend_forecast]
6 dr = driver.Driver(config, *modules, adapter=...)
UD:
actuals
signups</p>
          <p>spend
B</p>
          <p>acquisition_cost
spend_b</p>
          <p>Listing 3: Sample Driver code to instantiate a DAG</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>4.4.2. DAG Execution</title>
          <p>The driver has two primary methods:
25
26 def spend_shift_3weeks(spend: pd.Series) -&gt; pd.Series
:
return spend.shift(3)
27
28
29 def special_feature1(A: pd.Series, B: pd.Series, C:
pd.Series, weights: pd.Series) -&gt; pd.Series:
"""Some documentation explaining what this is"""
return (A - B + C) * weights</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>4.5. Benefits of Hamilton</title>
        <p>With respect to a data scientist’s workflow, we have found
the following benefits when using Hamilton.</p>
        <sec id="sec-2-3-1">
          <title>4.5.1. Incremental Development</title>
          <p>Rather than requiring execution of a monolithic script,
Hamilton pushes the dataoflw creator towards
incremental, test-driven, development. As dataflows are composed
of discrete, unit-testable components, modifications to
produce new data can be started locally by conducting
test-driven development on the function itself. As node
execution only requires running upstream dependencies,
integrating with the full dataflow is straightforward. The
developer need only request computation of the new
node via the Hamilton driver to integration test the new
addition.
code-difing, breakpoints, and bisection) gain in value
due to Hamilton’s logical mapping of code to produced
data. For example, to debug spend_b from our contrived
example (listing 1), it is straightforward to visualize it’s
execution path, Figure 1, and thus determine what needs
to be debugged.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>4.5.3. Documentation</title>
          <p>
            The confluence of:
• using function documentation strings
• one-to-one mapping of outputs to functions
• the ability to visualize the DAG and execution
paths
• the @tag() decorator for adding extra metadata
enables a clear and straightforward means to document
transform logic in a standardized way. The function
documentation string is perfect for long form explanations,
and can be exposed via tooling such as sphinx[
            <xref ref-type="bibr" rid="ref27">17</xref>
            ]. The
mapping of function names to outputs ensures that
function names and input parameters are meaningful while
also enabling one to quickly locate the definition of an
output. The ability to visualize the DAG and execution
paths helps provide a big picture mental model for those
learning the code base. The @tag() decorator makes it
easy to add additional metadata concerns, without
cluttering the transform logic itself.
          </p>
        </sec>
        <sec id="sec-2-3-3">
          <title>4.5.4. Central Definition Store</title>
          <p>A common problem for machine learning practitioners is
that of leveraging other’s work. Most industry solutions
target materialized data, e.g. [18], rather than the code
itself. As the code in Hamilton maps directly to outputs,
module organization is highly incentivized. Curating all
modules into a single repository (as the FED team did
at Stitch Fix) provides a straightforward approach for a
team to refer to and reuse work.</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>4.5.5. Transparent Scaling</title>
          <p>Most distributed computation frameworks follow a lazy
execution model e.g. Dask, Ray, and Spark. They build a
DAG of the computation required prior to distributing
execution. As Hamilton’s Function DAG is structured using
the same approach, it can provide a layer of indirection
between dataflow definition and method of execution. In
4.5.2. Debugging practice, this means that most Hamilton Functions do
not need modification to run on these distributed
compuHamilton makes debugging dataflows simpler by provid- tation systems, unless the data type they operate over is
ing a standard methodical approach. One can isolate bugs not supported by that system. For example, both Spark
by determining the erroneous output, finding the same- and Dask implement the Pandas dataframe API, so a user
name function definition, debugging that logic, and if would not have to change their Pandas code to scale to
no error is found, repeat tracing through each upstream a Dask or Spark cluster, other than changing how they
dependency. Standard debugging procedures (such as load data for execution.</p>
        </sec>
        <sec id="sec-2-3-5">
          <title>4.5.6. Source Code Based Lineage</title>
          <p>The declarative nature of Hamilton enables an entire end
to end ML workflow to be modeled. Column level
lineage from source, to machine learning feature, to model
that consumes it, generally requires additional integra- 5.3. Qualitative assessment
tion work to ensure it’s emission and storage, e.g. with The initial success criteria for the Hamilton project were
Amundsen. With Hamilton, no such integration or sys- all qualitative measures. Namely, that a core data
scitem is required. The declarative functions can model ence team adopted the tooling, enjoyed using it, and
this entire process with any tooling that is python based, were able to deliver on their business objectives. On all
as the function source code becomes the source of truth. accounts, Hamilton delivered successfully, without any
To build a standalone lightweight lineage system, one detractors. Since then, two and a half years in production
need to only pair the function definitions, driver code have passed and the same qualitative measures still hold.
and configuration, with a source code version control The team manages over 4000 data transforms, which
system (e.g. git) to snapshot the code (e.g. git commit) represents almost a decade of work, written by at least
when an artifact is created, to enable reconstruction of iffteen diferent team members.
the DAG for lineage querying purposes.
take a whole day for a team member to complete prior to
Hamilton. After Hamilton, this task takes no more than
two hours, which represents a 4x improvement!</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Summary</title>
      <sec id="sec-3-1">
        <title>4.5.7. Lineage for Data Privacy/Provenance</title>
      </sec>
      <sec id="sec-3-2">
        <title>Concerns</title>
        <p>Hamilton is a novel dataflow framework that makes data
Hamilton unlocks the ability to provide fine grained lin- transformation engineering in Python straightforward.
eage of computation. With the growth of privacy con- By representing dataflows as a series of simple Python
cerns and data regulation, organizations need to know functions, Hamilton produces code that is easy to read
what data comes in, where it goes, and how it is used. and decoupled from execution. This results in
transHamilton functions can be marked (via @tag() with form logic that is always unit testable and
documentaprivacy or regulation concerns, e.g. that it contains Per- tion friendly, provides lineage out of the box, enables
sonally Identifiable Information (PII), enabling one to lightweight run time data quality checks, and unlocks
easily surface answers to questions of data usage and fast iteration and debug cycles. It has enabled the FED
data impact from the structure of the DAG. team at Stitch Fix to scale, managing over 4000 data
transforms that create features for time-series modeling.
5. Evaluation In addition, Hamilton provides a layer of indirection
that transparently scales computation onto various
dis5.1. Adoption tributed computation frameworks (such as Ray, Spark,
and Dask) as materialization is decoupled from function
transform definitions. This opens the door for exciting
future work.</p>
        <p>To enjoy the benefits of Hamilton, one must use the
paradigm. For existing systems, this means a migration
needs to occur, which has been the largest friction point
to adopting Hamilton. Internally, teams with active
feature development for time-series forecasting have been
the most prolific adopters, as they are the willing to pay
the migration/adoption cost to reap the paradigm’s
beneifts. Externally (since October 2021), at minimum, teams
using Pandas and wanting to improve software
engineering hygiene have been Hamilton’s best adopters.</p>
        <sec id="sec-3-2-1">
          <title>5.2. Quantitative assessment</title>
          <p>A quantitative assessment of Hamilton’s benefits to a
team is challenging, as one would have to construct a
tightly controlled experiment, e.g. like [19]. In an
industry environment, however, it is hard to secure resourcing
for such an endeavor. That said, anecdotally, for the FED
team, a monthly feature engineering task of adding and
adjusting data transformations for model fitting used to</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Future Work</title>
      <p>Here we highlight three avenues of future work. For
more, see open issues in Hamilton’s github repository.</p>
      <sec id="sec-4-1">
        <title>7.1. Source code based data governance</title>
        <p>With Hamilton, one can encode a rich repository of
metadata (see section 4.5.7) into the source code directly.
Because source code is required to perform data
transformations, keeping transform logic synchronized with
tags, data quality checks, and documentation is a simpler
proposition than having that metadata in separate
independent steps of a dataflow or separate systems.
Therefore the source code itself could conceivably be used as a
reliable base for data governance.</p>
        <p>However, how to expose this information for
consumption requires more thought. Does one build directly on
top of the source code? Or does one emit this information
to an existing system, such as a data catalog? For the
former, a new system would need to be built. For the
latter, one could integrate a continuous integration
system that publishes changes when source code is snapshot
(i.e. committed), or augment the Hamilton driver/DAG
walking methodology to emit this information at DAG
instantiation/execution time.</p>
        <p>Similarly, data access/use policies could also be a target
for source code based governance. By tagging functions
that ingest data sources with appropriate data policies,
one could, prior to DAG execution, walk the DAG to
ensure the requesting user and requested DAG execution
meets the policy requirements for those data sources.</p>
      </sec>
      <sec id="sec-4-2">
        <title>7.2. Compiling to an orchestration framework</title>
        <p>A common problem with ML tooling is choosing an
orchestration system. This is a big decision, because
companies rarely change this infrastructure. As Hamilton
functions do not define or set materialization concerns,
it cannot be used in place of an orchestration framework
such as Airflow[ 15], where computation is split into
discrete steps and materialized to a data store in between
steps. If one were to provide node groupings and a
materialization function, then it would be straightforward
to compile the Hamilton Function DAG into any
existing framework. Programmatically defining orchestration
would also unlock the possibility for low cost
infrastructure migrations, while avoiding vendor lock in.</p>
      </sec>
      <sec id="sec-4-3">
        <title>7.3. Modeling your entire data warehouse independently of materialization concerns</title>
        <p>Common industry data tools and orchestration
frameworks leak materialization concerns into the user
experience. For example, using SQL, the end user has to think
in tables. This naturally cascades to how data is
materialized and transferred between workflows. What if,
instead, one could model the dependencies of one’s data
transforms, independently of how and where the data is
stored? The declarative nature of Hamilton unlocks this
possibility.
[1] Eric Colson, Beware the data science pin factory:
The power of the full-stack data science
generalist and the perils of division of labor through</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. A full Hamilton Hello World Example</title>
      <p>48
49
50
51
52 ]
53 # by default execution returns a dataframe
54 df = dr.execute(output_columns)
55print(df.to_string())
56
57 # To visualize do ‘pip install sf-hamilton[
visualization]‘ if you want these to work
58 dr.visualize_execution(output_columns, ’./my_dag.dot’
, {})
59 dr.display_all_functions(’./my_full_dag.dot’)
21
22
23
24 def spend_std_dev(spend: pd.Series) -&gt; float:
25 """Function that computes the standard deviation
of the spend column."""
return spend.std()
16
17
18
19 def spend_zero_mean(spend: pd.Series, spend_mean:
float) -&gt; pd.Series:
20 """Shows function that takes a scalar. In this
case to zero mean spend."""
return spend - spend_mean
31
32
33 ## in run.py
34import pandas as pd
35from hamilton import driver
36import my_functions # we import user functions here
37
38initial_columns = { # load from actuals or wherever
-- this is our initial data we use as input.
1 ## --- in my_functions.py
2import pandas as pd
3
4 def avg_3wk_spend(spend: pd.Series) -&gt; pd.Series:
5 """Rolling 3 week average spend."""
6 return spend.rolling(3).mean()
7
8
9 def spend_per_signup(spend: pd.Series, signups: pd.</p>
      <p>Series) -&gt; pd.Series:
"""The cost per signup in relation to spend."""
return spend / signups
10
11
12
13
14 def spend_mean(spend: pd.Series) -&gt; float:
15 """Shows function creating a scalar. In this case
it computes the mean of the entire column."""
return spend.mean()</p>
      <p>Listing 5: A full hello world example.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>function</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://multithreaded.stitchfix.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          com/blog/2019/03/11/FullStackDS-Generalists/. [2]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          , Elijah ben Izzy, Danielle Quinn,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>defining dataflows</source>
          ,
          <year>2021</year>
          . URL: https://github.com/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          stitchfix/hamilton. [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          , Ray: A Distributed Execution Engine
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Berkeley</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: http://www2.eecs.berkeley.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          edu/Pubs/TechRpts/2019/EECS-2019-124.html. [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wendell</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          Arm-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>processing</surname>
          </string-name>
          ,
          <source>Commun. ACM</source>
          <volume>59</volume>
          (
          <year>2016</year>
          )
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          http://doi.acm.
          <source>org/10</source>
          .1145/2934664. doi:
          <volume>10</volume>
          .1145/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2934664. [5]
          <string-name>
            <surname>Various</surname>
          </string-name>
          ,
          <article-title>Dask: Library for dynamic task scheduling,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2016. URL: https://dask.org. [6]
          <article-title>Pandas dev</article-title>
          . team, pandas-dev/pandas: Pandas,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>2020. URL: https://doi.org/10.5281/zenodo.3509134.</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>doi:10</source>
          .5281/zenodo.3509134. [7]
          <string-name>
            <surname>Various</surname>
          </string-name>
          ,
          <article-title>An open framework for data lineage col-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>lection and analysis</source>
          ,
          <year>2017</year>
          . URL: https://openlineage.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          io/. [8]
          <string-name>
            <surname>Various</surname>
          </string-name>
          , Datahub,
          <year>2020</year>
          . URL: https://github.com/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>datahub-project/datahub</article-title>
          . [9]
          <string-name>
            <surname>Various</surname>
          </string-name>
          , Amundsen,
          <year>2019</year>
          . URL: https://github.com/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>amundsen-io/amundsen</article-title>
          . [10]
          <string-name>
            <surname>Niels</surname>
            <given-names>Bantilan</given-names>
          </string-name>
          , pandera: Statistical Data Valida-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 19th Python in Science</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Conference</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>124</lpage>
          . doi:
          <volume>10</volume>
          .25080/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Majora-</surname>
            342d178e-010. [11]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schelter</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Celikel</surname>
          </string-name>
          , F. Biess-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>dowment 11</source>
          (
          <year>2018</year>
          )
          <fpage>1781</fpage>
          -
          <lpage>1794</lpage>
          . [12]
          <string-name>
            <surname>Various</surname>
          </string-name>
          , Great expectations,
          <year>2017</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          expectations. [13]
          <string-name>
            <surname>Various</surname>
          </string-name>
          ,
          <article-title>Metaflow: a framework for real-life</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>data science</source>
          ,
          <year>2020</year>
          . URL: https://github.com/Netflix/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          metaflow. [14]
          <string-name>
            <surname>Various</surname>
          </string-name>
          ,
          <article-title>Prefect workflow management system,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          2017. URL: https://github.com/PrefectHQ/prefect. [15]
          <string-name>
            <surname>Various</surname>
          </string-name>
          , Apache airflow,
          <year>2015</year>
          . URL: https://github.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          com/apache/airflow. [16]
          <string-name>
            <surname>Various</surname>
            ,
            <given-names>Dagster:</given-names>
          </string-name>
          <article-title>An orchestration platform for the</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>assets</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://github.com/dagster-io/
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          dagster. [17]
          <string-name>
            <surname>Georg</surname>
            <given-names>Brandl</given-names>
          </string-name>
          , Sphinx documentation,
          <year>2008</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          https://www.sphinx-doc.org/en/master/.
          <volume>39</volume>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kakantousis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kouzoupis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Buso</surname>
          </string-name>
          , G. Berthou,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Dowling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haridi</surname>
          </string-name>
          ,
          <source>Horizontally scalable ml 40</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>pipelines with a feature store</article-title>
          ,
          <source>in: Proc. 2nd SysML 41</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Conf.</given-names>
            ,
            <surname>Palo</surname>
          </string-name>
          <string-name>
            <surname>Alto</surname>
          </string-name>
          , USA,
          <year>2019</year>
          .
          <article-title>43 # instantiate the DAG - multiple modules can be</article-title>
          [19]
          <string-name>
            <surname>D</surname>
          </string-name>
          . L. Moody,
          <article-title>Cognitive load efects on end user passed</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>understanding of conceptual models: An experi- 44 dr = driver.Driver(initial_columns, my_functions)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>mental analysis</article-title>
          , in: A.
          <string-name>
            <surname>Benczúr</surname>
          </string-name>
          , J. Demetrovics,
          <volume>45</volume>
          #
          <article-title>we need to specify what we want in the final</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>46output_columns = [</mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>mation Systems</source>
          , Springer Berlin Heidelberg, Berlin, 47 'spend',
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Heidelberg</surname>
          </string-name>
          ,
          <year>2004</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>143</lpage>
          . 'signups',
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <article-title>'avg_3wk_spend',</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>