=Paper= {{Paper |id=Vol-3306/paper5 |storemode=property |title=Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs |pdfUrl=https://ceur-ws.org/Vol-3306/paper5.pdf |volume=Vol-3306 |authors=Stefan Krawczyk,Elijah ben Izzy,Danielle Quinn |dblpUrl=https://dblp.org/rec/conf/vldb/KrawczykIQ22 }} ==Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs== https://ceur-ws.org/Vol-3306/paper5.pdf
Hamilton: enabling software engineering best practices for
data transformations via generalized dataflow graphs
Stefan Krawczyk1,* , Elijah ben Izzy1 and Danielle Quinn1
1
    Stitch Fix, 1 Montgomery Tower, Suite 1500, 94104, San Francisco, California, USA


                                             Abstract
                                             While data science, as a high level consumer of and producer to data ecosystems, has grown in prevalence within organizations,
                                             software engineering practices for data science code bases have not. Stereotypical data science code is not known for unit
                                             testing coverage, ease of documentation, reuseability, or enabling quick incremental development as it grows. Over time,
                                             this lack of software engineering quality impacts the maintainers ability to make progress within a data ecosystem. The
                                             data platform team at Stitch Fix created Hamilton to solve these software engineering pain points with respect to data
                                             transformations. It does this by requiring a programming paradigm change that enables straightforward specification and
                                             execution of dataflow graphs. Hamilton has enabled data science teams at Stitch Fix to scale their code bases to support 4000+
                                             data transformations, by ensuring that transformation code is always unit testable, documentation friendly, easily curated,
                                             reuseable, and amenable to fast incremental development. Hamilton also enables transparently scaling computation onto
                                             distributed systems such as Dask, Ray, and Spark, without requiring a rewrite of data transform logic. Hamilton therefore
                                             represents a novel approach to modeling dataflows that is decoupled from materialization concerns, and presents an industry
                                             pragmatic avenue for building a simpler user experience for high level data ecosystem practitioners. Hamilton is available as
                                             open source code.


1. Introduction                                                                                                                       lineage grew in difficulty with the number of transforms.
                                                                                                                                         The Hamilton framework[2] was therefore conceived
With the shift to "Full Stack Data Science"[1], data scien-                                                                           to mitigate the FED team’s software engineering pain
tists are expected to not only do data science, but also                                                                              points. Specifically, Hamilton enables a simpler paradigm
engineer and manage data pipelines for their production                                                                               to create, maintain, and execute code for data engineer-
models. This additional responsibility places burdens on                                                                              ing, especially in the case of highly complex data trans-
data scientists, who no longer hand off their ideas off to a                                                                          formation dependency chains. Hamilton does this by
software engineering team for implementation and main-                                                                                deriving a directed acyclic graph (DAG) of dependencies
tenance. This burden becomes especially acute in the                                                                                  using specially defined Python functions that describe the
domain of time-series forecasting, where data transfor-                                                                               user’s intended dataflow. Altogether, Hamilton makes
mation needs involve creating an ever increasing number                                                                               incremental development, code reuse, unit testing, deter-
of features (columns) in a dataframe (table) for use with                                                                             mining lineage, and documentation natural and straight-
model fitting/forecasting. To create better time-series                                                                               forward. Furthermore, it provides avenues to quickly
forecasts, one is continually seeking to add more features                                                                            and easily scale computation onto various distributed
by incorporating new data, updating existing features,                                                                                computation frameworks, e.g. Ray[3]/Spark[4]/Dask[5],
and deriving new features from existing ones. The ma-                                                                                 without changing much code.
jority of features are the product of a chain of transfor-                                                                               We will first provide some examples of typical soft-
mations over other features. At Stitch Fix, the Forecast-                                                                             ware engineering pain points with data transformations
ing, Estimation, and Demand (FED) team had curated a                                                                                  at Stitch Fix, then talk about related tooling, and spend
code base over the course of several years, to produce a                                                                              the rest of this report diving into Hamilton’s program-
dataframe for fitting time-series models with thousands                                                                               ming paradigm. We will show the benefits this paradigm
of such features. Unfortunately, maintaining and adding                                                                               brings, provide a lightweight evaluation of the frame-
features to the code base had become burdensome to the                                                                                work, and finish with a summary and a description of
point where their delivery of work slowed significantly.                                                                              future work.
Unit-testing was virtually non-existent, documentation
was scattered and inconsistent, and determining feature
                                                                                                                                      2. Software engineering pain
Proc. of the First International Workshop on Data Ecosystems (DEco’22),
September 5, 2022, Sydney, Australia                                                                                                     points with data
  Corresponding author
                                                                                                                                         transformations
*

$ stefank@cs.stanford.edu (S. Krawczyk);
elijah.benizzy@stitchfix.com (E. b. Izzy);
danielle.quinn@stitchfix.com (D. Quinn)                                                                                               Since software engineering pain points are somewhat
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      subjective, we present the following Python script using
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        Pandas[6] to illustrate common software engineering




                                                                                                                                41
     pain points we encountered at Stitch Fix. It demonstrates          4. relies heavily on code execution order; line 21 has
     creating data transforms that represent features to fit a             to occur before line 24.
     time-series model.
                                                                   At only twenty-seven lines, the code in Listing 1 looks
1#  create_features.py                                             innocuous. However, scaling this script from six to 1000+
2 import pandas as pd
                                                                   data transforms (as occurred at Stitch Fix) presents the
3 from library import loader, is_holiday, is_uk_holiday
                                                                   following problems:
4
5 def    compute_bespoke_feature(df: pd.DataFrame) -> pd.
          Series:                                                    2.1. Inconsistent unit test coverage
6       """Some documentation explaining what this is"""
7       return (df[’A’] - df[’B’] + df[’C’]) * loader.             Only three of the derived features lend themselves to-
          get_weights()                                            wards straightforward unit testing. One cannot unit test
8

9 def    multiply_columns(col1: pd.Series,
                                                                   the inline dataframe manipulations without running the
10                        col2: pd.Series) -> pd.Series:           entire script, so the code base inevitably has minimal, if
11      """Some documentation explaining what this is"""           any, test coverage. In such a codebase, it is difficult to
12      return col1 * col2                                         determine behavioral changes when code changes.
13
14 def run(dates, config):
15     df = loader.load_actuals(dates) # e.g. spend,                 2.2. Code readability and documentation
        signups
16     if config[’region’] == ’UK’:                                Well organized code with documentation is critical for a
17         df[’holidays’] = is_uk_holiday(df[’year’], df           maintainer to understand and contribute to a codebase.
        [’week’])                                                  It ensures information is not siloed in the original devel-
       else:
18
                                                                   oper’s mind, and that newcomers to the codebase can
19         df[’holidays’] = is_holiday(df[’year’], df[’
        week’])
                                                                   quickly become productive. In Listing 1, code readability
20     df[’avg_3wk_spend’] = df[’spend’].rolling(3).mean           and documentation is tragically lost between inline ma-
        ()                                                         nipulations, functions, and the organization of the run
21     df[’acquisition_cost’] = df[’spend’] / df[’                 function. Identifying the logic used to derive a feature is
        signups’]
                                                                   far from trivial, even with the best developer tools.
22     df[’spend_shift_3weeks’] = df[’spend’].shift(3)
23     df[’special_feature1’] = compute_bespoke_feature(
        df)                                                          2.3. Difficulty in tracing data lineage
24     df[’spend_b’] = multiply_columns(df[’
        acquisition_cost’], df[’B’])                               At six features, tracing lineage of inputs to a data trans-
25     save_df(df, "some_location")                                form is not particularly difficult. At 1000+ data trans-
26 if __name__ == ’__main__’:
                                                                   forms, however, this is a challenging task. At Stitch Fix,
27     run(dates=..., config=...)
                                                                   there are chains of transformation that span over four-
     Listing 1: Example script that loads data, transforms data teen such functions, with the average transformation
                into features, and saves them                      chain length just over five.
                                                                      In order to add a new data transform, a developer
     Listing 1 demonstrates the highly heterogenous nature has to make a decision as to where to put it. It could
     of data transformation code. The run function:                be at the end of the run function, or ideally near some
          1. loads some data into a central dataframe object logical grouping of transforms. However there is no
             (line 15).                                            forcing function for a developer to do so, which inevitably
          2. adds and derives features through various means: leads to critical transform code spread throughout the
                 a) inline code that directly alters the dataframe entire codebase. A "spaghetti" codebase like this results in
                    (lines 20-22).                                 slow and frustrating debug cycles, requiring the cognitive
                 b) a function that takes the whole dataframe burden of internalizing a mental map of computation in
                    and assigns the result to a new column order to identify and fix problems. Ability to debug is
                    (lines 5, 23).                                 then heavily correlated with tenure on the team!
               c) a function that uses columns from the cen-
                  tral dataframe and assigns the result to a         2.4. Integration testing requires
                  new column (lines 9, 17, 19, 24).                       calculating all data transforms
               d) a conditional branch that changes the im-
                  plementation used to compute a column            While feature generating scripts such as Listing 1 are
                  based on some configuration (lines 16-19).       initially quick to execute, they grow into a large monolith.
         3. contains only sporadic documentation.                  In order to test the integration of a new feature, one has
                                                                   to run the entire script. As the script inevitably grows




                                                                42
with the increasing complexity of a problem space, it        that help validate data quality expectations over large
takes longer to run, and thus longer to iterate on, fix      datasets. After a dataset has been constructed, the user
bugs, and improve.                                           defines expectations over that data, that are then checked
                                                             via execution on Apache Spark.
2.5. Code Reuse & Duplication                                   Great Expectations, like Deequ, is also a heavy-weight
                                                             framework, but is more broadly applicable to python.
Because transform logic is not well encapsulated, code It allows one to validate, document, and profile data to
reuse is difficult to achieve outside of the current context ensure data quality. It follows a similar implementation
of the script. Good software engineering practices advise pattern to Deequ, as one needs to explicitly integrate it
consistently refactoring code for reuse, however this is after dataset construction into a dataflow.
easy to skip. It is simpler for a data scientist to instead     None of the frameworks are meant to be run like unit
find the relevant code and cut & paste it to their new tests, and thus are not designed for testing transform
context, especially when they are scarce for time. Left logic.
unchecked, this behavior creates more monolithic scripts        As for the user experience, one has to explicitly add
and propagates the problem.                                  data quality test(s) into a dataflow. Determining how
                                                             to add tests, when to add tests, and how to maintain
                                                             them as dataflows evolve causes extra burden on the
3. Related tooling                                           dataflow developer. For example, it is possible to change
In industry, there are a few tools that come to mind when data transform logic and forget to update data quality
discussing some of the pain points above.                    expectations if they are defined in separate steps of the
                                                             dataflow, located in a different file in the code base, or
                                                             stored externally in a datastore. Analogously, if a data
3.1. Lineage/Data Catalogs                                   quality check fails, it can be similarly difficult to deter-
OpenLineage[7] is an framework for data lineage collec- mine what source code generated the data, if one does
tion and analysis. It aims to provide an open standard to not link the data quality test appropriately via naming
enable disjoint tools to emit lineage metadata that can or documentation.
then be centrally tracked and curated. It requires a oner
to implement the standard, as well as maintain infras- 3.3. Orchestration Frameworks
tructure to collect the emitted lineage metadata. It is
designed for tracking materialization of whole data sets. Similar in approach to Hamilton are orchestration frame-
It cannot track lineage at a columnar level.                 works [13, 14, 15, 16]. They too model their operations
   Data catalogs like Datahub[8] and Amundsen[9], are via a DAG, however their focus is modeling a user’s end
systems of record with which one can emit and store to end workflow at a macro-level. Specifically, they model
lineage and other metadata (e.g. for GDPR purposes). discrete steps at each of which an artifact is created and
They require one to explicitly integrate with their APIs data is materialized. For example, in one step, raw data
to capture this information. They are only as useful as the is ingested and transformed and saved as a table, and in
information provided to them, so a developer needs to ex- a subsequent step, a machine learning model is trained
plicitly consider integration as part of their development on that data and that model is saved.
workflow.                                                       These frameworks also do not try to address any soft-
                                                             ware engineering pain points a data transformation de-
                                                             veloper might have.
3.2. Data Quality
When one thinks about data transformations and test-
ing data, one often thinks of Pandera[10], Deequ[11], or
                                                             4. Hamilton Framework
Great Expectations[12].                                      The Hamilton framework alleviates the pain points de-
    Pandera is a stateless lightweight API for performing scribed in Section 2 through three distinct concepts:
data validation on Pandas dataframes (i.e. in memory
tables). Its focus is to provide a quick mechanism to define     • Hamilton functions: the low-level unit of work
expectations in code to create robust data processing              developers use to encode dataflow components.
pipelines. It has a small python dependency footprint so         • Function DAG: The representation of the dataflow’s
is easy to install and embed within a pipeline, enabling           dependency structure, built by combining func-
it to live close to transform logic.                               tion definitions.
    Deequ is a stateful, heavy-weight framework, that re-        • Driver code: the code used to execute Hamilton
quires peripheral services to operate. It is built on top          functions by specifying the functions used to
of Apache Spark and aims to define "unit tests for data"




                                                          43
           build the DAG, the inputs to execution, and the        4.1.2. Unit Testing
           parts of the DAG to run.
                                                                 As Hamilton functions contain well encapsulated logic
    For those eager to see a simple Hello World we direct        and clearly specify inputs, all data transform code is unit
    readers to Listing 5 in the Appendix.                        testable!

    4.1. Hamilton Functions                                       4.1.3. Code readability and documentation

    Hamilton functions force a novel programming paradigm            1. Encapsulating feature logic in functions implies
    on the user. Like regular Python functions, they encapsu-           a natural location for documentation (namely the
    late computational logic. However, the user is not respon-          Python docstring).
    sible for invoking functions and assigning the results to        2. Coupling the name of the function with a reusable
    a variable. Instead, this is encoded in the structure of            downstream artifact forces more meaningful nam-
    the function itself in a declarative manner. The function           ing. It is trivial to determine the definition of a
    name serves to specify, or declare, the intended output             feature and locate its usage. One needs to sim-
    variable, and the function input parameters (as well as             ply search the code base for a function with that
    their type-annotations) map to expected input variables,            name or which has that as an input parameter.
    i.e. declared dependencies. In the context of creating
    a dataframe, the function name serves as the intended         4.1.4. Vector friendly computation
    output column name, and the function input parameters
                                                                 In the case of creating dataframes, the Hamilton program-
    serve as the expected input columns/values. Type annota-
                                                                 ming paradigm pushes a user to write a function to create
    tions on the function and the variables are required by
                                                                 a single column, with inputs as columns as well. This nat-
    the Hamilton Framework.
                                                                 urally leads the developer to write logic that can utilize
       Note (1), Hamilton can be used to model any python
                                                                 vector computation, which often speeds up execution.
    object creation. For the remainder of this paper, we will
    stick to the context of creating pandas dataframes. Note
    (2), if Hamilton functions have wildly different python       4.1.5. Functions as the core interface
    dependency requirements, using Hamilton is still possi-      Python functions have well defined boundaries; inputs
    ble, one would just partition DAG execution into multiple    go in, and one output comes out. They can be serialized,
    steps matching the different python dependency require-      inspected, and executed. Therefore, functions are used as
    ments.                                                       a universal interface and building block for both the user
1#    rather than                                                experience and the framework. A user does not need to
2 df[’acquisition_cost’]    = df[’spend’] / df[’signups’]        implement nor understand a special interface to use the
3                                                                core Hamilton features. Similarly, the framework, with-
4#    a user would instead write                                 out knowing the exact shape of the function beforehand,
5 def   acquisition_cost(
6       signups: pd.Series, spend: pd.Series) -> pd.
                                                                 has a clear object with which to work with, where it can
         Series:                                                 wrap a user’s functions to inject operational concerns via
7       """Example showing a simple Hamilton function"""         decorators (see 4.2), or at run time (see 4.3.3).
8       return spend / signups

    Listing 2: the core Hamilton programming paradigm             4.2. Advanced Hamilton Functions
               with dataframes
                                                                 In an effort to encapsulate operational concerns and re-
       Listing 2 shows an example of the Hamilton paradigm       duce repetitive function logic, Hamilton comes with a
    and what it is replacing. Hamilton’s breakdown of the        variety of decorators. Decorators primarily fulfill one of
    example function’s components is demonstrated in Table       the following purposes:
    1. By defining functions in this manner, the developer
                                                                     1. Determining whether a function should exist. if
    specifies their intended dataflow. This method of writing
                                                                        else blocks are dropped in favor of readable anno-
    Python functions has a variety of implications:
                                                                        tations (e.g. @config in listing 4).
                                                                     2. Parameterizing function creation. A single func-
    4.1.1. Verbosity
                                                                        tion can create multiple nodes.
    This approach increases the lines of code required to            3. Simplifying function logic by promoting reuse.
    describe simple operations. However, the benefits out-              Syntactic sugar can help reduce verbosity and re-
    weigh the cost. Inputs are clearly specified, and logic is          peated code (e.g. @extract_columns in listing
    automatically encapsulated in named functions.                      4.




                                                             44
Table 1
How functions become nodes in a the DAG using the function defined in Listing 2 as an example.

        Function Name                    acquisition_cost                              Node name
          Type-hints                         pd.Series                                 Node input & output types
       Parameter Names                     signups, spend                              Upstream dependencies
        Documentation       Example showing a simple hamilton function                 Node Documentation
        Function Body                  return spend/signups                            Node Definition



     4. Modeling operational concerns in a modular man- 4.2.2. @tag
         ner. For example, adding metadata for GDPR pur-
                                                              As data systems and environments change over time,
         poses, or specifying run time data expectations.
                                                              different metadata needs arise. Rather than requiring
   Hamilton decorators are extensible and can also be explicit integrations with metadata systems, or enforcing
layered to enable highly expressive functions.                a specific schema, Hamilton enables a lightweight way to
   Note, as functions are the core interface (see 4.1.5), the annotate functions with such concerns. @tag() takes in
abstraction provided by Hamilton’s decorator system en- string key value pairs, and is thus amenable to annotat-
ables, a platform team for example, a clear and decoupled ing functions with anything relevant to your particular
way to plug into the user’s function writing experience, data ecosystem. E.g. ownership, source table names,
while providing a clear way to manage and service their GDPR concerns, project names, etc. These tags are then
decorator implementations. Done correctly, user func- attached to nodes in the DAG, which then can be used as
tion definitions remains static to platform changes.          a basis for querying for nodes, or asking graph questions
   With respect to data ecosystems, we will explain two of the DAG. See listing 4 for an example of usage.
relevant Hamilton decorators: @check_output() and
@tag(). We direct readers to the Hamilton documenta- 4.3. The Function DAG
tion [2] for more information on other decorators.
                                                              The function DAG is the framework’s representation of
                                                              the nodes that should be executed and the dependencies
4.2.1. @check_output
                                                              between them.
In machine learning (ML) dataflows, data quality issues
are a common cause of model problems. It is a best prac- 4.3.1. Node Creation
tice to setup data expectations to mitigate these prob-
lems. However, as explained in section 3.2, one needs Hamilton resolves the mapping of functions (e.g. listing
to additionally integrate such a concern into a dataflow 2) to nodes. In the case of Hamilton functions annotated
explicitly. With Hamilton, integrating data quality expec- with one or more decorators, a resolution step occurs to
tations are less burdensome, as this takes the form of a determine how many nodes to create (e.g. in case of a
lightweight Python decorator @check_output(), with parameterized function), and what the nodes should be
which one can simply annotate their Hamilton functions. named. Functions beginning with _ are presumed to be
Doing so enables transform logic and data expectations helper functions and thus excluded from inclusion in the
to be co-located, without cluttering the user’s dataflow. DAG.
There is no need to maintain separate code bases and data
stores, or manually integrate checks as an explicit step       4.3.2. Constructing the DAG
of a dataflow. Therefore maintenance and operational
                                                               Hamilton compiles the DAG from a list of Python mod-
costs are low for adding runtime data quality checks to a
                                                               ules containing Hamilton functions and optional con-
dataflow.
                                                               figuration. It collects the relevant functions to create
   At DAG construction time, Hamilton automatically
                                                               nodes, determines node dependencies, and assigns edges
adds nodes to the DAG to check the output of the dec-
                                                               between them. Any dependency that does not map to a
orated function. At run time, after executing the user
                                                               known node is marked as a required input for execution.
function, Hamilton validates the provided expectations,
surfacing data quality errors to the dataflow developer
via logging, or stopping execution altogether if desired.      4.3.3. Walking the DAG
See listing 4 for an example of usage.                         Given desired outputs, a topological sorting of the DAG is
                                                               performed to determine the execution order. As the DAG
                                                               is walked, additional operational concerns are injected,
                                                               e.g. checking inputs and matching against function input




                                                          45
    types, delegating function computation, and constructing
    the final object returned from execution.
                                                                                        UD: dates
    4.4. Driver Code
    Driver code steers execution of the Function DAG, pro-
    viding a convenient abstraction layer. Thus the developer                             actuals
    never has to interact with the DAG itself, and instead
    utilizes the driver to run and manage their dataflow. It
    handles the following:
                                                                                         signups          spend
    4.4.1. DAG Instantiation
    The Driver directs construction of the Function DAG.
    Creation of the driver is as simple as the following:                         B           acquisition_cost

1 from   hamilton import driver
2 from   funcs import spend_forecast, spend_data_loader
3

4 config={...}
                                                                                        spend_b
5 modules = [spend_data_loader, spend_forecast]
6 dr = driver.Driver(config, *modules, adapter=...)


      Listing 3: Sample Driver code to instantiate a DAG
                                                                     Figure 1:   Example rendering produced by running
       The call to instantiate the driver accepts a config ar- visualize_execution() on an instantiated DAG, if one
    gument. This takes the form of a dictionary with string was interested in computing spend_b from Listing 1 as imple-
    keys and Python objects as values, that serves two pur- mented in Hamilton in Listing 4. Hamilton makes it straight-
                                                                  forward to determine what is required to compute a feature.
    poses: (1) it helps determine the shape of the DAG when
                                                                  UD refers to user defined input. Note: for diagram legi-
    coupled with appropriate decorators (section 4.2); (2) it bility, we omitted displaying the validation nodes that the
    sets inputs that a user wants to be invariant between @check_output() decorator would add to the DAG.
    DAG execution runs. Meanwhile, the adapter argument
    (optional) controls execution (such as delegating to Dask),
    and determines the object type returned from DAG exe-
                                                                1 # in a module, e.g. my_functions.py
    cution.
                                                                2

                                                                3 @tag(source="prod.denormalized",   owner="team:DE")
    4.4.2. DAG Execution                                        4 @extract_columns(’year’,   ’week’, ’spend’, ’signups’,
                                                                         ’A’, ’B’, ’C’)
    The driver has two primary methods:                         5 def   actuals(dates: ’a_date_type’) -> pd.DataFrame:
                                                                6       return loader.load_actuals(dates)
         1. execute(outputs_wanted, inputs,                     7
            overrides) executes the DAG, computing only 8 @check_output(data_type=np.float64, allow_nans=False)
            what is required to create the output, and returns 9 def weights() -> pd.Series:
            a python object, e.g. a Pandas dataframe.          10    return loader.get_weights()
                                                               11
         2. visualize_execution(outputs_wanted, 12 @config.when(region=’UK’)
            inputs, ...) visualizes the parts of the DAG 13 def holidays__uk(year: pd.Series, week: pd.Series) ->
            required for execution.                                    pd.Series:
                                                                14      return is_uk_holiday(year, week)
    Note that the developer can pass parameters to the DAG 15
    through two Python dictionaries: inputs and overrid- 16 @config.when(region=’US’)
    es. Inputs specifies runtime inputs to the DAG, provid- 17 def holidays__us(year: pd.Series, week: pd.Series) ->
    ing requisite dependencies that are not satisfied by exist-         pd.Series:
                                                                18    return is_holiday(year, week)
    ing nodes. Overrides enables the developer to bypass 19
    execution of specified nodes, effectively short-circuiting 20 def avg_3wk_spend(spend: pd.Series) -> pd.Series:
    their computation. Hamilton will forego computation 21            return spend.rolling(3).mean()
    of any upstream node depended on solely by overrid- 22
    den nodes. By offering these parametrization capabili- 23 def acquisition_cost(spend: pd.Series, signups: pd.
                                                                       Series) -> pd.Series:
    ties, Hamilton enables precise control over the dataflow’s 24     return spend / signups
    structure and execution.




                                                              46
25                                                                 code-diffing, breakpoints, and bisection) gain in value
26 def   spend_shift_3weeks(spend: pd.Series) -> pd.Series         due to Hamilton’s logical mapping of code to produced
          :
                                                                   data. For example, to debug spend_b from our contrived
27       return spend.shift(3)
28
                                                                   example (listing 1), it is straightforward to visualize it’s
29 def   special_feature1(A: pd.Series, B: pd.Series, C:           execution path, Figure 1, and thus determine what needs
          pd.Series, weights: pd.Series) -> pd.Series:             to be debugged.
30       """Some documentation explaining what this is"""
         return (A - B + C) * weights
                                                                   4.5.3. Documentation
31
32

33 @check_output(data_type=np.float64,      range=(0.0,
                                                                  The confluence of:
          100.0), allow_nans=False)
34 def   spend_b(acquisition_cost: pd.Series, B: pd.Series              • using function documentation strings
          ) -> pd.Series:
                                                                        • one-to-one mapping of outputs to functions
35       """documentation to explain this function"""
36       return acquisition_cost * B                                    • the ability to visualize the DAG and execution
37                                                                        paths
38 ## In a separate script/module, e.g. run.py,                         • the @tag() decorator for adding extra metadata
39 ## code to create and execute the DAG
40 from hamilton import driver                                     enables a clear and straightforward means to document
41 import my_functions                                             transform logic in a standardized way. The function doc-
42
                                                                   umentation string is perfect for long form explanations,
43 config = {...} # configuration
44 modules = [my_functions] # modules to crawl
                                                                   and can be exposed via tooling such as sphinx[17]. The
45 dr = driver.Driver(config, *modules)                            mapping of function names to outputs ensures that func-
46 df = dr.execute([’year’, ’week’, ’holidays’, ’                  tion names and input parameters are meaningful while
        acquisition_cost’, ...]) # materialize                     also enabling one to quickly locate the definition of an
47 save_df(df, "some_location") # save result
                                                                   output. The ability to visualize the DAG and execution
     Listing 4: Hamilton version of the earlier example script     paths helps provide a big picture mental model for those
                in Listing 1, with four decorators used to show    learning the code base. The @tag() decorator makes it
                example usage.                                     easy to add additional metadata concerns, without clut-
                                                                   tering the transform logic itself.

     4.5. Benefits of Hamilton                                     4.5.4. Central Definition Store
     With respect to a data scientist’s workflow, we have found A common problem for machine learning practitioners is
     the following benefits when using Hamilton.                that of leveraging other’s work. Most industry solutions
                                                                target materialized data, e.g. [18], rather than the code
     4.5.1. Incremental Development                             itself. As the code in Hamilton maps directly to outputs,
                                                                module organization is highly incentivized. Curating all
     Rather than requiring execution of a monolithic script, modules into a single repository (as the FED team did
     Hamilton pushes the dataflow creator towards incremen- at Stitch Fix) provides a straightforward approach for a
     tal, test-driven, development. As dataflows are composed team to refer to and reuse work.
     of discrete, unit-testable components, modifications to
     produce new data can be started locally by conducting 4.5.5. Transparent Scaling
     test-driven development on the function itself. As node
     execution only requires running upstream dependencies, Most distributed computation frameworks follow a lazy
     integrating with the full dataflow is straightforward. The execution model e.g. Dask, Ray, and Spark. They build a
     developer need only request computation of the new DAG of the computation required prior to distributing ex-
     node via the Hamilton driver to integration test the new ecution. As Hamilton’s Function DAG is structured using
     addition.                                                  the same approach, it can provide a layer of indirection
                                                                between dataflow definition and method of execution. In
     4.5.2. Debugging                                           practice, this means that most Hamilton Functions do
                                                                not need modification to run on these distributed compu-
     Hamilton makes debugging dataflows simpler by provid- tation systems, unless the data type they operate over is
     ing a standard methodical approach. One can isolate bugs not supported by that system. For example, both Spark
     by determining the erroneous output, finding the same- and Dask implement the Pandas dataframe API, so a user
     name function definition, debugging that logic, and if would not have to change their Pandas code to scale to
     no error is found, repeat tracing through each upstream a Dask or Spark cluster, other than changing how they
     dependency. Standard debugging procedures (such as load data for execution.




                                                              47
4.5.6. Source Code Based Lineage                              take a whole day for a team member to complete prior to
                                                              Hamilton. After Hamilton, this task takes no more than
The declarative nature of Hamilton enables an entire end
                                                              two hours, which represents a 4x improvement!
to end ML workflow to be modeled. Column level lin-
eage from source, to machine learning feature, to model
that consumes it, generally requires additional integra-      5.3. Qualitative assessment
tion work to ensure it’s emission and storage, e.g. with     The initial success criteria for the Hamilton project were
Amundsen. With Hamilton, no such integration or sys-         all qualitative measures. Namely, that a core data sci-
tem is required. The declarative functions can model         ence team adopted the tooling, enjoyed using it, and
this entire process with any tooling that is python based,   were able to deliver on their business objectives. On all
as the function source code becomes the source of truth.     accounts, Hamilton delivered successfully, without any
To build a standalone lightweight lineage system, one        detractors. Since then, two and a half years in production
need to only pair the function definitions, driver code      have passed and the same qualitative measures still hold.
and configuration, with a source code version control        The team manages over 4000 data transforms, which
system (e.g. git) to snapshot the code (e.g. git commit)     represents almost a decade of work, written by at least
when an artifact is created, to enable reconstruction of     fifteen different team members.
the DAG for lineage querying purposes.

4.5.7. Lineage for Data Privacy/Provenance                    6. Summary
       Concerns
                                                              Hamilton is a novel dataflow framework that makes data
Hamilton unlocks the ability to provide fine grained lin-     transformation engineering in Python straightforward.
eage of computation. With the growth of privacy con-          By representing dataflows as a series of simple Python
cerns and data regulation, organizations need to know         functions, Hamilton produces code that is easy to read
what data comes in, where it goes, and how it is used.        and decoupled from execution. This results in trans-
Hamilton functions can be marked (via @tag() with             form logic that is always unit testable and documenta-
privacy or regulation concerns, e.g. that it contains Per-    tion friendly, provides lineage out of the box, enables
sonally Identifiable Information (PII), enabling one to       lightweight run time data quality checks, and unlocks
easily surface answers to questions of data usage and         fast iteration and debug cycles. It has enabled the FED
data impact from the structure of the DAG.                    team at Stitch Fix to scale, managing over 4000 data trans-
                                                              forms that create features for time-series modeling.
                                                                 In addition, Hamilton provides a layer of indirection
5. Evaluation                                                 that transparently scales computation onto various dis-
                                                              tributed computation frameworks (such as Ray, Spark,
5.1. Adoption                                                 and Dask) as materialization is decoupled from function
To enjoy the benefits of Hamilton, one must use the           transform definitions. This opens the door for exciting
paradigm. For existing systems, this means a migration        future work.
needs to occur, which has been the largest friction point
to adopting Hamilton. Internally, teams with active fea-
ture development for time-series forecasting have been 7. Future Work
the most prolific adopters, as they are the willing to pay
                                                           Here we highlight three avenues of future work. For
the migration/adoption cost to reap the paradigm’s bene-
                                                           more, see open issues in Hamilton’s github repository.
fits. Externally (since October 2021), at minimum, teams
using Pandas and wanting to improve software engineer-
ing hygiene have been Hamilton’s best adopters.            7.1. Source code based data governance
                                                            With Hamilton, one can encode a rich repository of
5.2. Quantitative assessment                                metadata (see section 4.5.7) into the source code directly.
                                                            Because source code is required to perform data trans-
A quantitative assessment of Hamilton’s benefits to a
                                                            formations, keeping transform logic synchronized with
team is challenging, as one would have to construct a
                                                            tags, data quality checks, and documentation is a simpler
tightly controlled experiment, e.g. like [19]. In an indus-
                                                            proposition than having that metadata in separate inde-
try environment, however, it is hard to secure resourcing
                                                            pendent steps of a dataflow or separate systems. There-
for such an endeavor. That said, anecdotally, for the FED
                                                            fore the source code itself could conceivably be used as a
team, a monthly feature engineering task of adding and
                                                            reliable base for data governance.
adjusting data transformations for model fitting used to




                                                         48
    However, how to expose this information for consump-               function, 2019. URL: https://multithreaded.stitchfix.
tion requires more thought. Does one build directly on                 com/blog/2019/03/11/FullStackDS-Generalists/.
top of the source code? Or does one emit this information          [2] Stefan Krawczyk, Elijah ben Izzy, Danielle Quinn,
to an existing system, such as a data catalog? For the                 A scalable general purpose micro-framework for
former, a new system would need to be built. For the                   defining dataflows, 2021. URL: https://github.com/
latter, one could integrate a continuous integration sys-              stitchfix/hamilton.
tem that publishes changes when source code is snapshot            [3] P. Moritz, Ray: A Distributed Execution Engine
(i.e. committed), or augment the Hamilton driver/DAG                   for the Machine Learning Ecosystem, Ph.D. the-
walking methodology to emit this information at DAG                    sis, EECS Department, University of California,
instantiation/execution time.                                          Berkeley, 2019. URL: http://www2.eecs.berkeley.
    Similarly, data access/use policies could also be a target         edu/Pubs/TechRpts/2019/EECS-2019-124.html.
for source code based governance. By tagging functions             [4] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Arm-
that ingest data sources with appropriate data policies,               brust, A. Dave, X. Meng, J. Rosen, S. Venkataraman,
one could, prior to DAG execution, walk the DAG to                     M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker,
ensure the requesting user and requested DAG execution                 I. Stoica, Apache spark: a unified engine for big data
meets the policy requirements for those data sources.                  processing, Commun. ACM 59 (2016) 56–65. URL:
                                                                       http://doi.acm.org/10.1145/2934664. doi:10.1145/
                                                                       2934664.
7.2. Compiling to an orchestration
                                                                   [5] Various, Dask: Library for dynamic task scheduling,
     framework                                                         2016. URL: https://dask.org.
A common problem with ML tooling is choosing an or-                [6] Pandas dev. team, pandas-dev/pandas: Pandas,
chestration system. This is a big decision, because com-               2020. URL: https://doi.org/10.5281/zenodo.3509134.
panies rarely change this infrastructure. As Hamilton                  doi:10.5281/zenodo.3509134.
functions do not define or set materialization concerns,           [7] Various, An open framework for data lineage col-
it cannot be used in place of an orchestration framework               lection and analysis, 2017. URL: https://openlineage.
such as Airflow[15], where computation is split into dis-              io/.
crete steps and materialized to a data store in between            [8] Various, Datahub, 2020. URL: https://github.com/
steps. If one were to provide node groupings and a ma-                 datahub-project/datahub.
terialization function, then it would be straightforward           [9] Various, Amundsen, 2019. URL: https://github.com/
to compile the Hamilton Function DAG into any exist-                   amundsen-io/amundsen.
ing framework. Programmatically defining orchestration            [10] Niels Bantilan, pandera: Statistical Data Valida-
would also unlock the possibility for low cost infrastruc-             tion of Pandas Dataframes, in: Meghann Agar-
ture migrations, while avoiding vendor lock in.                        wal, Chris Calloway, Dillon Niederhut, David Shupe
                                                                       (Eds.), Proceedings of the 19th Python in Science
                                                                       Conference, 2020, pp. 116 – 124. doi:10.25080/
7.3. Modeling your entire data warehouse
                                                                       Majora-342d178e-010.
     independently of materialization                             [11] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biess-
     concerns                                                          mann, A. Grafberger, Automating large-scale data
Common industry data tools and orchestration frame-                    quality verification, Proceedings of the VLDB En-
works leak materialization concerns into the user experi-              dowment 11 (2018) 1781–1794.
ence. For example, using SQL, the end user has to think           [12] Various, Great expectations, 2017. URL:
in tables. This naturally cascades to how data is mate-                https://github.com/great-expectations/great_
rialized and transferred between workflows. What if,                   expectations.
instead, one could model the dependencies of one’s data           [13] Various, Metaflow: a framework for real-life
transforms, independently of how and where the data is                 data science, 2020. URL: https://github.com/Netflix/
stored? The declarative nature of Hamilton unlocks this                metaflow.
possibility.                                                      [14] Various, Prefect workflow management system,
                                                                       2017. URL: https://github.com/PrefectHQ/prefect.
                                                                  [15] Various, Apache airflow, 2015. URL: https://github.
References                                                             com/apache/airflow.
                                                                  [16] Various, Dagster: An orchestration platform for the
 [1] Eric Colson, Beware the data science pin factory:                 development, production, and observation of data
     The power of the full-stack data science gener-                   assets, 2020. URL: https://github.com/dagster-io/
     alist and the perils of division of labor through                 dagster.
                                                                  [17] Georg Brandl, Sphinx documentation, 2008. URL:




                                                             49
          https://www.sphinx-doc.org/en/master/.              39     # Note: these values don’t have to be all series,
     [18] T. Kakantousis, A. Kouzoupis, F. Buso, G. Berthou,           they could be scalar.
                                                                     ’signups’: pd.Series([1, 10, 50, 100, 200, 400]),
          J. Dowling, S. Haridi, Horizontally scalable ml 40
                                                              41     ’spend’: pd.Series([10, 10, 20, 40, 40, 50]),
          pipelines with a feature store, in: Proc. 2nd SysML 42 }
          Conf., Palo Alto, USA, 2019.                        43 # instantiate the DAG - multiple modules can be
     [19] D. L. Moody, Cognitive load effects on end user             passed
          understanding of conceptual models: An experi- 44 dr = driver.Driver(initial_columns, my_functions)
          mental analysis, in: A. Benczúr, J. Demetrovics, 45 # we need to specify what we want in the final
                                                                      dataframe
          G. Gottlob (Eds.), Advances in Databases and Infor- 46 output_columns = [
          mation Systems, Springer Berlin Heidelberg, Berlin, 47     ’spend’,
          Heidelberg, 2004, pp. 129–143.                      48     ’signups’,
                                                              49      ’avg_3wk_spend’,
                                                              50      ’spend_per_signup’,

     A. A full Hamilton Hello World                           51
                                                              52 ]
                                                                      ’spend_zero_mean_unit_variance’


        Example                                               53 #
                                                              54 df
                                                                   by default execution returns a dataframe
                                                                    = dr.execute(output_columns)
                                                              55 print(df.to_string())
1 ##    --- in my_functions.py                                56
2 import    pandas as pd                                      57 # To visualize do ‘pip install sf-hamilton[
3                                                                     visualization]‘ if you want these to work
4 def    avg_3wk_spend(spend: pd.Series) -> pd.Series:        58 dr.visualize_execution(output_columns, ’./my_dag.dot’
5        """Rolling 3 week average spend."""                          , {})
6        return spend.rolling(3).mean()                       59 dr.display_all_functions(’./my_full_dag.dot’)
7
8
                                                                          Listing 5: A full hello world example.
9 def    spend_per_signup(spend: pd.Series, signups: pd.
          Series) -> pd.Series:
10       """The cost per signup in relation to spend."""
11       return spend / signups
12
13
14 def   spend_mean(spend: pd.Series) -> float:
15       """Shows function creating a scalar. In this case
           it computes the mean of the entire column."""
16       return spend.mean()
17

18
19 def   spend_zero_mean(spend: pd.Series, spend_mean:
          float) -> pd.Series:
20       """Shows function that takes a scalar. In this
          case to zero mean spend."""
21       return spend - spend_mean
22

23
24 def   spend_std_dev(spend: pd.Series) -> float:
25       """Function that computes the standard deviation
          of the spend column."""
26       return spend.std()
27

28
29 def   spend_zero_mean_unit_variance(spend_zero_mean: pd
          .Series, spend_std_dev: float) -> pd.Series:
30       """Function showing one way to make spend have
          zero mean and unit variance."""
31       return spend_zero_mean / spend_std_dev
32

33 ## in run.py
34 import pandas as pd
35 from hamilton import driver
36 import my_functions  # we import user functions here
37
38 initial_columns   = { # load from actuals or wherever
          -- this is our initial data we use as input.




                                                             50