Pipeline in a Container: Docker Essentials for Data Engineers
Master the fundamentals of Docker for robust data engineering projects
As a data engineer, you always look for ways to streamline workflows, improve efficiency, and tackle complex challenges. But have you ever struggled with inconsistent environments, dependency conflicts, or difficulties scaling and deploying your data pipelines?
Docker is a game-changer for data engineers. Leveraging containerisation technology enables you to encapsulate your applications and their dependencies into lightweight, portable containers. This means you can say goodbye to "works on my machine" problems and hello to seamless collaboration and deployment.
In this article, you and I will explore Docker and how it can transform your data engineering workflows. We'll cover the fundamentals of Docker, its architecture, and how it differs from virtual machines.
You'll learn how to Dockerise your applications, manage containers, and scale your data pipelines through real-world examples and best practices. By the end of this article, you'll understand the concepts behind Docker and be ready to leverage its power in your projects.
So, let's get started and unlock the potential of Docker for data engineering!
Reading time: 8 minutes
📣 Do you want to advertise in Data Gibberish? Book here
📦 What is Docker? (And What It's Not)
Docker is an open-source platform that enables developers to create, deploy, and run applications in containers.
Containers are lightweight, standalone executable packages that include everything needed to run an application – code, runtime, system tools, libraries, and settings.
Think of a container as an environment that can run an application in isolation from the host system and other containers. Docker achieves this isolation through kernel namespaces and control groups (groups).
Now, let's address a common misconception: Docker differs from a virtual machine (VM). While both technologies aim to provide isolation and portability, they differ in their approach and level of abstraction.
Virtual machines rely on a hypervisor to create and manage many virtual machines on a single physical server. Each VM runs its own complete operating system (OS) and has allocated resources, such as CPU, memory, and storage. The hypervisor abstracts the hardware, allowing VMs to share the same physical resources.
In contrast, Docker containers share the host machine's operating system kernel and use a lightweight virtualisation layer provided by the Docker engine. Instead of virtualising the hardware, containers virtualise the operating system, making them more lightweight and efficient than VMs.
Here are some critical differences between Docker containers and virtual machines:
🔋 Resource Efficiency: Containers share the host OS kernel and resources, resulting in lower overhead and faster startup times than VMs, which need an entire OS for each instance.
🔒 Isolation Level: VMs provide strong isolation at the hardware level, while containers provide process-level isolation, sharing the host kernel. This means containers are less isolated than VMs but offer better performance and resource efficiency.
📈 Scalability: Containers can scale up or down based on demand due to their lightweight nature. Spinning new containers is much faster than provisioning new VMs, making containers well-suited for scalable and elastic workloads.
I want to note that Docker and virtual machines are not exclusive. In fact, you can run Docker containers inside VMs to achieve better isolation and security or to run containers on top of different operating systems.
Now that you know what Docker is, let's look at how it works.
👷 How Docker Works: A Peek Behind the Curtain
Docker has three main parts:
🧑💼 Daemon: The Docker daemon is the background service that manages the containers and images.
📟 Client: The Docker client is the command-line tool that lets you talk to the daemon.
🛒 Registries: Docker registries are like app stores for Docker images, where you can store and share your pre-built images.
To create a Docker image, you start with a Dockerfile – a simple text file that tells Docker how to build your image. It's like a recipe that lists the ingredients (base image, libraries, tools) and steps to put them together. Once you have your Dockerfile, you can use the docker build command to create an image based on those instructions.
With your Docker image ready, you can run a container using the docker run
command. Just tell Docker which image to use, and poof! Docker will start up a container based on that image, and your application will run in its own little world.
🛠️ Real-World Examples: Docker in Action
Let's see Docker in action with some real examples!
Scripts 🐍
Say you have a Python script that you want to share with your friends. Instead of worrying about whether they have the correct version of Python or libraries, you can "Dockerise" your script. Create a Dockerfile that starts with a Python image, copies your script, and sets the command to run it into the image. Anyone with Docker can run your script with a single command, regardless of their computer.
Third-Party Tools 🧙
Or maybe you need a quick way to run Jupyter or another tool built by somebody else. With Docker, it's easy! Use an official image from Docker Hub. With one command, you'll have a working Jupyter Hub container. You don't have to worry about installing, setting it up, or port conflicts – Docker handles everything.
Data Platforms 🏔️
But what about bigger applications like Apache Iceberg? That's where Docker Compose comes in. With Docker Compose, you can describe and manage your Iceberg catalogue, query engine, BI Tool and networking using a single YAML file.
Just list the services, their settings, and how they should work together, and Docker Compose will start up the whole stack with a single command.
🏅The Benefits of Docker for Data Engineering
So, why should you care about Docker as a data engineer? Here are some key benefits:
🫙Isolation and Dependency Management: Docker allows you to encapsulate data platform components and all their dependencies into a single container. This avoids conflicts between different tools and versions, making managing complex data pipelines with many moving parts easier.
♻️ Reproducibility: With Docker, you can create consistent environments across development, testing, and production. No more "works on my machine" problems (I said that a lot in the past)! This makes it a breeze to collaborate with others and ensures that your applications run no matter where they're deployed.
🚢 Simplified Deployment and Orchestration: Docker containers are lightweight and portable, easy to deploy and scale. You can manage and coordinate large-scale data pipelines with orchestration tools like Kubernetes and Nomad.
🧪 Experimentation and Technology Adoption: Docker provides a safe and isolated environment to test new technologies without worrying about impacting your host system. This means you can experiment with new tools and frameworks and stay on the cutting edge of data engineering.
🗄️ Handling Databases with Docker
One common question people often ask me is how to handle databases. While running databases inside containers can be convenient for development and testing purposes, I don't recommend doing that for production environments. Here's why:
🚤 Performance: Running databases inside containers can introduce extra overhead and performance implications. Containers may not have direct access to the host's resources, such as storage and network, which can impact database performance.
📊 Scalability: Scaling databases horizontally (adding more instances) or vertically (increasing resources) can be more challenging when running inside containers. Managing and scaling databases independently from the application containers is often easier.
💽 Backup and Recovery: Database backups and recovery operations can be more complex when databases run inside containers. Managing backups and recovery for databases running on the host or a dedicated database server is generally simpler and more reliable.
Instead of running databases inside containers in production, a common approach is to:
☁️ Run databases on dedicated hosts or managed database services (e.g., Amazon RDS, Google Cloud SQL) optimised for database workloads.
🔃 Use containers for your application services and connect them to the external databases using connection strings or environment variables.
📂 If you need to run databases in containers for specific use cases, ensure that you have proper data persistence mechanisms, such as Docker volumes or mounting external storage.
And that is all you need to know for now. Let's wrap it up.
But before that, here’s a quick question for you:
🏁 Summary
In this article, you and I explored Docker and its relevance to data engineering. We focused on its ability to package, distribute, and run applications in a standardised way using containers.
I discussed the benefits of using Docker for data engineering, such as isolation, reproducibility, and simplified deployment. Additionally, you learned some best practices for handling databases in a containerised microservices architecture.
Incorporating Docker into your toolkit as a data engineer can enhance your ability to build, deploy, and manage data products and pipelines.
So go ahead, dive in, and start exploring Docker's possibilities!
Until next time,
Yordan
📚 Picks of the Week
There was a lot of noise aroud Apache Iceber in the last few days. Read this great work of collaboration between
and to learn what Iceberg is. (link)Do you want to master MapReduce? Here’s another outstanding work of data engineering collaboration between
and . (link)How do these fine individuals collaborate so well? Check this story (half free, half paid) about how
and communicate effectively. (link)
Did you enjoy this article? Hit the ❤️ button or share it with a friend or coworker. 🙏🏻
Nice intro Yordan ! Would be interesting to get an AWS perspective on Docker in a follow-up article: Docker + Lambda, Docker + ECS, Docker + EKS.
Thanks so much for the shout-out, Yordan!