[ad_1]
Introduction and Applications of Docker for Data Scientists
But does it work on my machine?
It’s a classic meme in the tech community, especially for data scientists who want to ship their amazing machine-learning models, only to find that the production machine has a different operating system. Far from ideal.
however…
there is a solution for these wonderful things called containers and tools to control them such as postal worker,
In this post, we will learn what are containers and how you can create and run them using Docker. The use of containers and Docker for data products has become an industry standard and common practice. As a data scientist, learning these tools is an invaluable tool in your arsenal.
Docker is a service that helps build, run, and execute code and applications in containers.
Now you must be wondering what is a container?
Clearly, a container is similar to a Virtual Machine (VM), It is a small isolated environment where everything is ‘self contained’ and can be run on any machine. The primary selling point of containers and VMs is their portability, which allows your applications or models to run seamlessly on any on-premises server, local machine, or cloud platform such as AWS,
The main difference between containers and VMs is how they use their host computer resources. Containers are much more lightweight because they do not actively partition the hardware resources of the host machine. I won’t go into full technical details here, however I’ve linked a great article explaining their differences if you want to understand a bit more.
Docker is just a tool that we use to easily create, manage and run these containers. This is one of the main reasons why containers have become so popular, as it enables developers to easily deploy applications and models anywhere they run.
To run a container using Docker we need three main elements:
- Dockerfile: A text file containing instructions for building Docker. image
- docker image, A blueprint or template for building Docker containers.
- Docker container: An isolated environment that provides everything needed to run an application or machine learning model. Things like dependencies and OS versions are included.
There are also some other key points to note:
- Docker Daemon: a background process (demon) that deals with incoming requests to Docker.
- Docker Client: A shell interface that enables the user to talk to Docker through its daemon.
- dockerhub, Similar to GitHun, a place where developers can share their Docker images.
homebrew
The first thing you should install is homebrew (link here). It’s dubbed as ‘the missing package manager for MacOS’ and is very useful for anyone coding on their Mac.
To install Homebrew, simply run the command provided on their website:
/bin/bash -c "$(curl -fsSL
Verify that Homebrew is installed by running brew help,
postal worker
Now with Homebrew installed, you can install Docker by running brew install docker, Verify that docker is installed by running which docker The output should not contain any errors and look like this:
/opt/homebrew/bin/docker
Kolyma
last part, is it installed Kolyma, run only install colima and verify that it is installed which colima, Again, the output should look like this:
/opt/homebrew/bin/colima
Now you must be wondering what is Kolyma?
Kolyma is a software package that enables container runtime on MacOS. In more general terms, Kolyma creates an environment for containers to work on our systems. To achieve this, it runs a Linux virtual machine demon Docker can communicate using client-server model,
Alternatively, you can also install docker desktop instead of Kolyma. However, I prefer Colima for a few reasons: it’s free, more lightweight and I like working in the terminal!
For more arguments for Colima see this blog post here
workflow
Below is an example of how data scientists and machine learning engineers can deploy their models using Docker:
The first step is obviously to build their amazing model. Then, you need to wrap all the stuff used to run the model, like the Python version and package dependencies. The last step is to use that require file inside Dockerfile.
If this seems completely arbitrary to you at this point don’t worry, we’ll go through the process step by step!
original model
Let’s start by creating a basic model. The code snippet provided demonstrates a simple implementation of this random forest Classification model on the famous iris dataset:
Dataset from Kaggle with CC0 license.
this file is called basic_rf_model.py for reference.
create requirements file
Now that our model is ready, we need to make a requirement.txt File to hold all the dependencies underpinning the running of our model. In this simple example, we fortunately only rely on scikit-learn package. Therefore, our requirement.txt It’ll just look like this:
scikit-learn==1.2.2
You can check the version running on your computer scikit-learn --version Permission.
create Dockerfile
Now we can finally build our Dockerfile!
So, in the same directory as requirement.txt And basic_rf_model.pycreate a file named Dockerfile, Inside Dockerfile We would have the following:
Let’s go line by line to see what it means:
FROM python:3.9, this is the base image for our imageMAINTAINER egor@some.email.com, Indicates who maintains this imageWORKDIR /src, sets the working directory of the image to be srcCOPY . ., copy current directory files to docker directoryRUN pip install -r requirements.txt, install requirements fromrequirement.txtfile in docker environmentCMD ("python", "basic_rf_model.py"), tells the container to execute the commandpython basic_rf_model.pyrun more models
start colima and docker
The next step is setting up the Docker environment: First we need to boot Kolyma:
colima start
After colima is started, check that Docker is working by running the command:
docker ps
It should return something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
This is good and means that both Colima and Docker are working as expected!
Comment: The
docker psThe command lists all currently running containers.
create image
Now it’s time to build our first Docker image Dockerfile which we have created above:
docker build . -t docker_medium_example
-t The flag indicates the name of the image and . tells us to build from this current directory.
if we run now docker imagesWe should see something like this:
Congratulations, the image is created!
run container
Once the image is created, we can run it as a container using IMAGE ID listed above:
docker run bb59f770eb07
Output:
Accuracy: 0.9736842105263158
Cause it’s all done basic_rf_model.py script!
Additional Information
This tutorial is just scratching the surface of what Docker can do and be used for. There are many more features and commands to learn in order to understand Docker. There is a very detailed tutorial on the Docker website which you can find here.
A nice feature is that you can run the container in interactive mode and go into its shell. For example, if we run:
docker run -it bb59f770eb07 /bin/bash
You will enter the Docker container and it should look something like this:
we also used ls Command to show all files in Docker working directory.
Docker and containers are great tools for making sure data scientists’ models can run anywhere and anytime without any issues. They do this by creating small isolated compute environments that contain everything the model needs to run effectively. This is called a container. It is easy to use and lightweight, which renders it a common industrial practice nowadays. In this article, we looked at a basic example of how to package your model into a container using Docker. The process was simple and seamless, so is something data scientists can learn and pick up quickly.
The full code used in this article can be found on my GitHub here:
(all designed by emoji openmoji – Open-source emoji and icon project. License: CC BY-SA 4.0,









)