From Chaos to Consistency: Docker for Data Scientists | by Egor Howell | May, 2023


An introduction and application of Docker for Data Scientists

Photo by Ian Taylor on Unsplash

But it works on my machine?

This is a classic meme in the tech community, especially for Data Scientists who want to ship their amazing machine-learning model, only to learn that the production machine has a different operating system. Far from ideal.

However…

There is a solution thanks to these wonderful things called containers and tools to control them such as Docker.

In this post, we will dive into what containers are and how you can build and run them using Docker. The use of containers and Docker has become an industry standard and common practice for data products. As a Data Scientist, learning these tools is then an invaluable tool in your arsenal.

Docker is a service that help build, run and execute code and applications in containers.

Now you may be wondering, what is a container?

Ostensibly, a container is very similar to a virtual machine (VM). It is a small isolated environment where everything is self ‘contained’ and can be run on any machine. The primary selling point of containers and VMs is their portability, allowing your application or model to run seamlessly on any on-premise server, local machine, or on cloud platforms such as AWS.

The main difference between containers and VMs is how they use their hosts computer resources. Containers are a lot more lightweight as they do not actively partition the hardware resources of the host machine. I will not delve into the full technical details here, however if you want to understand a bit more, I have linked a great article explaining their differences here.

Docker is then simply a tool we use to create, manage and run these containers with ease. It is one of the main reasons why containers have become very popular, as it enables developers to easily deploy applications and models that run anywhere.

Diagram by author.

There are three main elements we need to run a container using Docker:

  • Dockerfile: A text file that contains the instructions of how to build a docker. image
  • Docker Image: A blueprint or template to create a Docker container.
  • Docker Container: An isolated environment that provides everything an application or machine learning model needs to run. Includes things such as dependencies and OS versions.
Diagram by author.

There are also a few other key points to note:

  • Docker Daemon: A background process (daemon) that deals with the incoming requests to docker.
  • Docker Client: A shell interface that enables the user to speak to Docker through its daemon.
  • DockerHub: Similar to GitHun, a place where developers can share their Docker images.

Hombrew

The first thing you should install is Homebrew (link here). This is dubbed as the ‘missing package manager for MacOS’ and is very useful for anyone coding on their Mac.

To install Homebrew, simply run the command given on their website:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Verify Homebrew is installed by running brew help.

Docker

Now with Homebrew installed, you can install docker by running brew install docker. Verify docker is installed by running which docker , the output should not rise any errors and look like this:

/opt/homebrew/bin/docker

Colima

The final part, is it install Colima. Simply, run install colima and verify it is installed with which colima. Again, the output should look like this:

/opt/homebrew/bin/colima

Now you might be wondering, what on earth is Colima?

Colima is a software package that enables container runtimes on MacOS. In more laymen terms, Colima creates the environment for containers to work on our system. To achieve this, it runs a Linux virtual machine with a daemon that Docker can communicate with using the client-server model.

Alternativetly, you can also install Docker desktop instead of Colima. However, I prefer Colima for a few reasons: its free, more lightweight and I like working in the terminal!

See this blog post here for more arguments for Colima

Workflow

Below is an example of how Data Scientists and Machine Learning Engineers can deploy their model using Docker:

Diagram by author.

The first step is obviously to build their amazing model. Then, you need to wrap up all the stuff you are using to run the model, stuff like the python version and package dependencies. The final step is to use that requirements file inside the Dockerfile.

If this seems completely arbitrary to you at the moment don’t worry, we will go over this process step by step!

Basic Model

Let’s start by building a basic model. The provided code snippet displays a simple implementation of the Random Forest classification model on the famous Iris dataset:

Dataset from Kaggle with a CC0 licence.

GitHub Gist by author.

This file is called basic_rf_model.py for reference.

Create Requirements File

Now that we have our model ready, we need to create a requirement.txt file to house all the dependencies that underpin the running of our model. In this simple example, we luckily only rely on the scikit-learn package. Therefore, our requirement.txt will simply look like this:

scikit-learn==1.2.2

You can check the version you are running on your computer by the scikit-learn --version command.

Create Dockerfile

Now we can finally create our Dockerfile!

So, in the same directiory as the requirement.txt and basic_rf_model.py, create a file named Dockerfile. Inside Dockerfile we will have the following:

GitHub Gist by author.

Let’s go over line by line to see what it all means:

  • FROM python:3.9: This is the base image for our image
  • MAINTAINER egor@some.email.com: This indicates who maintains this image
  • WORKDIR /src: Sets the working directory of the image to be src
  • COPY . .: Copy the current directory files to the Docker directory
  • RUN pip install -r requirements.txt: Install the requirements from requirement.txt file into the Docker environment
  • CMD ["python", "basic_rf_model.py"]: Tells the container to execute the command python basic_rf_model.py and run the model

Initiate Colima & Docker

The next step is setup the Docker environment: First we need to boot up Colima:

colima start

After Colima has started up, check that the Docker commands are working by running:

docker ps

It should return something like this:

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

This is good and means both Colima and Docker are working as expected!

Note: the docker ps command lists all the current running containers.

Build Image

Now it is time to build our first Docker Image from the Dockerfile that we created above:

docker build . -t docker_medium_example

The -t flag indicates the name of the image and the . tells us to build from this current directory.

If we now run docker images, we should see something like this:

Image from author.

Congrats, the image has been built!

Run Container

After the image has been created, we can run it as a container using the IMAGE ID listed above:

docker run bb59f770eb07

Output:

Accuracy: 0.9736842105263158

Because all it has done is run the basic_rf_model.py script!

Extra Information

This tutorial is just scratching the surface of what Docker can do and be used for. There are many more features and commands to learn to understand Docker. I great detailed tutorial is given on the Docker website that you can find here.

One cool feature is that you can run the container in interactive mode and go into its shell. For example, if we run:

docker run -it bb59f770eb07 /bin/bash

You will enter the Docker container and it should look something like this:

Image by author.

We also used the ls command to show all the files in the Docker working directory.

Docker and containers are fantastic tools to ensure Data Scientists’ models can run anywhere and anytime with no issues. They do this by creating small isolated compute environments that contain everything for the model to run effectively. This is called a container. It is easy to use and lightweight, rendering it a common industrial practice nowadays. In this article, we went over a basic example of how you can package your model into a container using Docker. The process was simple and seamless, so is something Data Scientists can learn and pick up quickly.

Full code used in this article can be found at my GitHub here:

(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA 4.0)



Source link

Leave a Comment