Data Scientists are now expected to write production code to deploy their machine learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is
make. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.
make is a terminal command/executable just like
cd that is in most UNIX-like operating systems such as MacOS and Linux.
The use of
make is to simplify and breakdown your workflow into a logical grouping of shell commands.
It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.
make is a powerful tool that Data Scientists should be utilising for the following reasons:
- Automate the setup of machine learning environments
- Clearer end-to-end pipeline documentation
- Easier to test models with different parameters
- Obvious structure and execution of your project
Makefile is basically what the
make commands read and execute from. It has three components:
- Targets: These are the files you are trying to build or you have a
PHONYtarget if you are just carrying out commands.
- Dependencies: Source files that need to be run before this target is executed.
- Command: As it says on the tin, these are the list of steps to produce the target.
Let’s run through a very simple example to make this theory concrete.