Lineage + Hamilton in 10 minutes. Spend less time debugging your… | by Stefan Krawczyk | May, 2023

By virtue of writing code in the Hamiltonian way, you are defining computation within functions, and then specifying via function input arguments, how things connect, encoding lineage. Taking this code and connecting with e.g. a version control system (e.g. git), then also provides you with the means to snapshot lineage at points in time! Because you have to update code to change how computation operates, you, by definition, then update lineage without you having to do anything 😎.

Image showing code to DAG mapping
Overview of the Hamilton paradigm. Instead of procedural code you write functions, that define how computation should proceed. Defining functions in this way maps exactly to lineage!

At a high level the recipe for ‘Lineage as Code’ and then subsequently accessing it is:

  1. Write Hamilton code & version it.
  2. Instantiate a Hamilton Driver, it’ll then have a representation of how data and compute flow as defined by your Hamilton code. The Driver object can then emit/provide information on lineage!
  3. Using your version control system, you can then go back in time to understand how lineage changes over time, since it’s encoded in code!

Adding metadata to make lineage more useful!

Lineage gets more useful, when you can also attach metadata to it. Hamilton by default enables you to express how things connect, but by attaching extra metadata, we can now connect business and company concepts via the lineage you have encoded with Hamilton.

For example, you can annotate functions that pull data with concerns such as PII, ownership, importance, etc., and annotate functions that create important artifacts, e.g. models, data sets, with similar information. With these functions now annotated, the set of questions lineage can answer is now much greater! Other systems, in contrast, require you to put this metadata somewhere else; a YAML file, a different part of the code base, or separately curate this information in some other place. Philosophically with Hamilton, we believe it makes the most sense to annotate the actual code directly, since that ensures that the source of truth of code & metadata is always up to date. With Hamilton, it’s easier to maintain both since there’s only a single place needed to make a change.

In Hamilton, the way to add extra metadata is through the use of the `@tag` and `@tag_outputs` decorators. They allow one to specify arbitrary string key-value pairs. This provides flexibility and the ability for you and your organization to define tags and values that make sense for your context.

For example, the following code adds a bunch of metadata to:

> The titanic_data function specifying its source, an owner, importance, contains PII, and links to some internal wiki.
> The age and sex columns, specifying that they are PII.
Note: I’m showing a slightly more complex example with multiple decorators just to show that things don’t get more complex than this — the code is still quite readable!

@tag_outputs(age={"PII": "true"}, sex={"PII": "true"})
def titanic_data(index_col: str, location: str) -> pd.DataFrame:
"""Returns a dataframe, that then has its columns extracted."""
# ... contents of function not important ... code skipped for brevity

While the following code helps provide more context from a business perspective on the model being created:

@tag(owner="data-science", importance="production", artifact="model")
def fit_random_forest(
prefit_random_forest: base.ClassifierMixin,
X_train: pd.DataFrame,
y_train: pd.Series,
) -> base.ClassifierMixin:
"""Returns a fit RF model."""
# ... contents of function not important ... code skipped for brevity

With this extra metadata now attached to functions and therefore lineage, we can use that as context to ask more useful questions. More on that below.

The basic mechanics of answering lineage questions relies on instantiating a Hamilton Driver, which under the hood creates a directed acyclic graph (DAG) to represent its view of the world, and then using the functions the Driver has. Let’s list the relevant Driver functions and what they do.

Visualizing lineage (i.e. displaying the DAG):

  • display_*() there are three display_* functions. One that helps you display how everything defined connects. Then a function to only visualize what’s upstream, and then one to visualize what is downstream of a given function/node.
  • visualize_execution() which helps visualize everything that is required to produce some outputs. Useful for displaying what .execute() does.
  • visualize_path_between() which helps visualize a path between two nodes.

Getting access to metadata for lineage needs:

With Hamilton, using the functions above, you get programmatic access to lineage as well as visualizing it, which means that you can ask these questions in your CI systems, or in a notebook that uses the code that’s running in production, or anywhere that python runs!

To recap what we’ve talked about thus far, the general recipe to answer lineage questions is therefore as follows:

  1. Write Hamilton code.
  2. Use @tag to annotate functions.
  3. Instantiate a Driver that creates your DAG.
  4. Use the Driver functions to ask/answer questions.

To keep this post short, we won’t do an extensive deep dive into their use, we’ll just surface some general questions that you can answer in a straightforward manner with Hamilton, with some examples using the functions above. The toy problem we’ll be using is an end-to-end model pipeline in Hamilton that builds a model to predict Titanic survival; the code and more information can be found in the Hamilton repository’s Lineage example. Terminology side note: we use function and node interchangeably, since in Hamilton a function becomes a node in our visualizations.

(1) What sequence of operations produced this data/model?

This is a common question to ask, how do we get from A->B, where A can be some data, `->` is opaque to us, and B can be some artifact (either more data or a model/object).

With Hamilton, by virtue of writing code in the style prescribed by Hamilton, you clearly and easily define the sequence of operations of what should happen and how it relates! So if you can’t answer the question by looking at the code itself, you can ask Hamilton for help. In the context of our example, to understand how say the feature_encoders are created, without knowing much about them at all, we can ask the Hamilton Driver to visualize how they’re created for us:

# create the driver
dr = driver.Driver(config, data_loading, features, sets, model_pipeline, adapter=adapter)
# visualize how it's created
[features.encoders], "encoder_lineage", {"format": "png"}, inputs=inputs

Which outputs:

Example lineage visualization. Image by author.

You can then pair this with walking through your code base to more easily navigate and understand what is going on.

(2) Whose/What data sources led to this artifact/model?

This is useful for debugging data issues, as well as understanding what teams and data sources an artifact/model depends on. For our Titanic example, say there is something funky with our Random Forest model and we want to double check, for the current production model, what the data sources are and who owns them so we can go ping them. To determine this, we can write the following code:

# create the driver
dr = driver.Driver(config, data_loading, features, sets, model_pipeline, adapter=adapter)
# Gives us all the operations that are upstream of creating the output.
upstream_nodes = dr.what_is_upstream_of("fit_random_forest")

The above two lines create a Driver, and then pull all upstream nodes of fit_random_forest . Then we can iterate through the nodes and pull the information we want:

teams = []
# iterate through
for node in upstream_nodes:
# filter to nodes that we're interested in getting information about
if node.tags.get("source"):
# append for output
"team": node.tags.get("owner"),
"source": node.tags.get("source"),
# [{'team': 'data-engineering',
# 'function': 'titanic_data',
# 'source': 'prod.titanic'}]

(3) Who/What is downstream of this transform?

Answering this question is effectively the complement of (2). You commonly encounter it when someone wants to make a change to a feature, or a data source. Using our Titanic example, say we’re on data engineering and want to change the source data. How could we determine what the artifacts that use this data are and who owns them?

We use the what_is_downstream_of() Driver function to get nodes that are downstream:

# create the driver
dr = driver.Driver(config, data_loading, features, sets, model_pipeline, adapter=adapter)
# Gives us all the operations that are upstream of creating the output.
downstream_nodes = dr.what_is_downstream_of("titanic_data")

Then analogously like in (2), we just loop through and pull out the information we want:

artifacts = []
for node in downstream_nodes:
# if it's an artifact function
if node.tags.get("artifact"):
# pull out the information we want
"team": node.tags.get("owner", "FIX_ME"),
"artifact": node.tags.get("artifact"),
# [{'team': 'data-science', 'function': 'training_set_v1', 'artifact': 'training_set'},
# {'team': 'data-science', 'function': 'fit_random_forest', 'artifact': 'model'},
# {'team': 'data-science', 'function': 'encoders', 'artifact': 'encoders'}]

(4) What is defined as PII data, and what does it end up in?

With regulations these days, this is becoming a more common question that needs to be answered. Building off of the above, we can combine a few Driver functions to answer this type of question.

In the context of our Titanic example, let’s say our compliance team has come to us to understand how we’re using PII data, i.e. what artifacts does it end up in? They want this report every month. Well, with Hamilton we can write a script to programmatically get lineage information related to PII data. To start we need to get everything marked PII:

# create the driver
dr = driver.Driver(config, data_loading, features, sets, model_pipeline, adapter=adapter)
# using a list comprehension to get all things marked PII
pii_nodes = [n for n in dr.list_available_variables()
if n.tags.get("PII") == "true"]

Then to get all artifacts that are downstream, we just need to ask for all nodes that are downstream, and then filter to ones with the “artifact” tag:

pii_to_artifacts = {}
# loop through each PII node
for node in pii_nodes:
pii_to_artifacts[] = []
# ask what is downstream
downstream_nodes = dr.what_is_downstream_of(
for dwn_node in downstream_nodes:
# Filter to nodes of interest
if dwn_node.tags.get("artifact"):
# pull out information
"team": dwn_node.tags.get("owner"),
"artifact": dwn_node.tags.get("artifact"),
# {'age': [{'artifact': 'training_set',
# 'function': 'training_set_v1',
# 'team': 'data-science'},
# {'artifact': 'model',
# 'function': 'fit_random_forest',
# 'team': 'data-science'}],
# 'sex': [{'artifact': 'training_set',
# 'function': 'training_set_v1',
# 'team': 'data-science'},
# {'artifact': 'encoders', 'function': 'encoders', 'team': None},
# {'artifact': 'model',
# 'function': 'fit_random_forest',
# 'team': 'data-science'}]}

Some of you might be thinking why not have this type of lineage view as you develop? Great idea! Since we’re an open source project we’d love some help here; if you’re interested in testing/contributing, we have an alpha version of a VSCode extension, started by Thierry Jean, that can help you visualize lineage as you type. We’d love contributions.

Alpha Hamilton VSCode extension started by Thierry Jean. Image by author.

With Hamilton, your code defines lineage. This means you get lineage out of the box without the need for another system, and when coupled with a version control system and extra metadata, a very straightforward and lightweight means to understand how data & code connect.

Hamilton enables programmatic access to the lineage and metadata you encode, which enables you to place this in CI jobs, scripts, or anywhere that python runs.

I hope you enjoyed this quick overview, and if you’re excited by any of this or want more resources, here are some links:

Source link

Leave a Comment