Build Deployable Machine Learning Pipelines | by John Adeojo | Jun, 2023

Now that we have established a high-level view, let’s get into some of the core components of this pipeline.

Project Structure

│ ├───base
│ │ └───parameters
│ └───local
│ ├───01_raw
│ ├───02_intermediate
│ ├───03_primary
│ ├───04_feature
│ ├───05_model_input
│ ├───06_models
│ │ ├───experiment_run
│ │ │ └───model
│ │ │ ├───logs
│ │ │ │ ├───test
│ │ │ │ ├───training
│ │ │ │ └───validation
│ │ │ └───training_checkpoints
│ │ └───experiment_run_0
│ │ └───model
│ │ ├───logs
│ │ │ ├───test
│ │ │ ├───training
│ │ │ └───validation
│ │ └───training_checkpoints
│ ├───07_model_output
│ └───08_reporting
│ └───source

│ ├───pipelines
│ ├───train_model

Kedro provides a templated directory structure that is established when you initiate a project. From this base, you can programmatically add more pipelines to your directory structure. This standardised structure ensures that every machine learning project is identical and easy to document, thereby facilitating ease of maintenance.

Data Management

Data plays a crucial role in machine learning. The ability to track your data becomes even more essential when employing machine learning models in a commercial setting. You often find yourself facing audits, or the necessity to productionise or reproduce your pipeline on someone else’s machine.

Kedro offers two methods for enforcing best practices in data management. The first is a directory structure, designed for machine learning workloads, providing distinct locations for the intermediate tables produced during data transformation and the model artefacts. The second method is the data catalogue. As part of the Kedro workflow, you are required to register datasets within a .yaml configuration file, thereby enabling you to leverage these datasets in your pipelines. This approach may initially seem unusual, but it allows you and others working on your pipeline to track data with ease.

Orchestration — Nodes and Pipelines

This is really where the magic happens. Kedro provides you with pipeline functionality straight out of the box.

The initial building block of your pipeline is the nodes. Each executable piece of code can be encapsulated within a Node, which is simply a Python function that accepts an input and yields an output. You can then structure a pipeline as a series of nodes. Pipelines are easily constructed by invoking the node and specifying the inputs and outputs. Kedro determines the execution order.

Once pipelines are constructed, they are registered in the provided file. The beauty of this approach is that you can create multiple pipelines. This is particularly beneficial in machine learning, where you might have a data processing pipeline, a model training pipeline, an inference pipeline, and so forth.

Once set up, it’s straightforward enough to modify aspects of your pipeline.

Code snippet showing example of script
Code snippet showing example of Pipeline script


Kedro’s best practices stipulate that all configurations should be handled through the provided parameters.yml file. From a machine learning perspective, hyperparameters fall into this category. This approach streamlines experimentation, as you can simply substitute one parameters.yml file with a set of hyperparameters for another, which is also much easier to track.

I have also included the locations of my Ludwig deep neural network model.yaml and the data source within the parameters.yml configuration. Should the model or the location of the data change — for instance, when moving between developers’ machines — it would be incredibly straightforward to adjust these settings.

Code snippet showing the contents of the parameters.yml file


Kedro includes a requirements.txt file as part of the templated structure. This makes it really straightforward to monitor your environment and exact library versions. However, should you prefer, you can employ other environment management methods, such as an environment.yml file.

Source link

Leave a Comment