How to Install a GPU-Ready Docker Environment on Ubuntu (Step-by-Step + Code + Explanations)
Complete guide to setting up GPU-ready Docker environment using Ansible automation for ML, AI, and HPC workloads
Quick Navigation
Difficulty: 🟡 Intermediate
Estimated Time: 30-45 minutes
Prerequisites: Basic Linux knowledge, Ansible experience helpful, NVIDIA GPU, Ubuntu/Debian system
What You'll Learn
This tutorial covers essential GPU-ready Docker concepts and tools:
- Ansible Automation - Complete automation of GPU Docker setup
- Docker Installation - Installing Docker Engine and Docker Compose
- NVIDIA Driver Setup - Installing and configuring GPU drivers
- Container Toolkit - Setting up NVIDIA Container Toolkit
- GPU Validation - Testing and verifying GPU access in containers
- Production Configuration - Advanced setup for production workloads
Prerequisites
- Basic Linux knowledge
- Ansible experience helpful
- NVIDIA GPU
- Ubuntu/Debian system
Related Tutorials
- Docker Best Practices 2025 - Production Docker strategies
- Multi-Stage Docker Builds - Advanced build optimization
- Docker GPU PyTorch Setup - PyTorch GPU containers
Introduction
Setting up a GPU-ready Docker environment is essential for ML, AI, and HPC workloads — but doing it manually is time-consuming and error-prone. With Ansible, you can automate the entire process across multiple machines — installing Docker, NVIDIA drivers, NVIDIA container toolkit, and nvidia-docker2 — all in the correct order.
This guide provides:
- Step-by-step role execution
- Equivalent shell commands
- Clear explanations of what each command does
Ansible Role Execution Order
- role: ../roles/docker
- role: ../roles/nvidia_driver
- role: ../roles/nvidia_container_toolkit
- role: ../roles/nvidia_docker
Docker Role — Install Docker Engine & Compose
Ansible Tasks
- name: Install Docker CE
apt:
name:
- docker-ce
- docker-ce-cli
- containerd.io
state: present
update_cache: yes
- name: Install Docker Compose
get_url:
url: https://github.com/docker/compose/releases/download/v2.36.2/docker-compose-Linux-x86_64
dest: /usr/bin/docker-compose
mode: '0755'
Shell Commands + Explanations
sudo apt update # Updates the package list
sudo apt install -y docker-ce docker-ce-cli containerd.io # Installs Docker and its core components
wget https://github.com/docker/compose/releases/download/v2.36.2/docker-compose-Linux-x86_64 -O docker-compose # Downloads Docker Compose v2
sudo mv docker-compose /usr/bin/docker-compose # Moves it to the executable path
sudo chmod +x /usr/bin/docker-compose # Grants execution permission
NVIDIA Driver Role — Install NVIDIA GPU Driver
Ansible Tasks
- name: Install dependencies
apt:
name:
- build-essential
- dkms
- pciutils
- wget
state: present
- name: Install NVIDIA Driver
apt:
name: "{{ nvidia_driver_package }}"
state: present
- name: Check if reboot is required
shell: 'lsmod | grep -i nvidia || echo "reboot_needed"'
register: nvidia_driver_loaded
- name: Reboot if driver not loaded
reboot:
msg: "Rebooting to load NVIDIA driver"
when: '"reboot_needed" in nvidia_driver_loaded.stdout'
Shell Commands + Explanations
sudo apt install -y build-essential dkms pciutils wget # Installs build tools and kernel module manager
sudo apt install -y nvidia-driver-535 # Installs the NVIDIA GPU driver (replace with desired version)
lsmod | grep -i nvidia # Checks if the driver is currently loaded
# If output is empty:
sudo reboot # Reboot the system so the kernel loads the new driver
NVIDIA Container Toolkit Role — Configure Docker to Use GPU
Ansible Tasks
- name: Download NVIDIA GPG key
shell: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- name: Add NVIDIA container repo
shell: |
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- name: Install container toolkit
apt:
name: nvidia-container-toolkit
state: present
- name: Configure Docker daemon
copy:
content: |
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
dest: /etc/docker/daemon.json
- name: Restart Docker
systemd:
name: docker
state: restarted
Shell Commands + Explanations
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Fetches NVIDIA's GPG key and converts it to keyring format for secure APT use
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Adds the NVIDIA container runtime repository
sudo apt update
sudo apt install -y nvidia-container-toolkit # Installs the toolkit for enabling GPU in Docker
# Set NVIDIA as default runtime
echo '{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}' | sudo tee /etc/docker/daemon.json
sudo systemctl restart docker # Restart Docker to apply the changes
NVIDIA Docker Role — Install NVIDIA Docker 2
Ansible Tasks
- name: Add NVIDIA Docker GPG key
apt_key:
url: https://nvidia.github.io/nvidia-docker/gpgkey
state: present
- name: Add NVIDIA Docker repository
shell: |
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
- name: Install NVIDIA Docker 2
apt:
name: nvidia-docker2
state: present
- name: Restart Docker
systemd:
name: docker
state: restarted
Shell Commands + Explanations
curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-docker-keyring.gpg
# Adds the NVIDIA Docker GPG key to validate packages
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Dynamically sets the repo based on your OS version (e.g. ubuntu2004)
sudo apt update
sudo apt install -y nvidia-docker2 # Installs Docker CLI extensions for GPU support
sudo systemctl restart docker
Final Test: Run GPU-Enabled Container
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Expected Output: A table listing your NVIDIA GPU(s) — confirming that the container can access them.
Docker Compose Example with GPU Support
version: "3.9"
services:
gpu-app:
image: nvidia/cuda:12.4.1-base-ubuntu22.04
container_name: cuda-gpu-container
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
command: nvidia-smi
What This Does:
image
: Uses NVIDIA's official CUDA imageruntime: nvidia
: Tells Docker to use the NVIDIA runtimedevices: capabilities: [gpu]
: Requests access to GPU hardwareNVIDIA_VISIBLE_DEVICES=all
: Exposes all available GPUscommand: nvidia-smi
: Runs the NVIDIA CLI to show GPU details
How to Run It
Make sure your system:
- Has the NVIDIA driver installed
- Has nvidia-container-toolkit installed
- Uses Docker with the NVIDIA runtime enabled
Then run:
docker compose up
You should see output from nvidia-smi
inside the container!
Advanced Configuration Options
Custom GPU Selection
version: "3.9"
services:
gpu-app:
image: nvidia/cuda:12.4.1-base-ubuntu22.04
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
count: 2 # Use 2 GPUs
driver: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0,1 # Only use GPUs 0 and 1
Multi-GPU TensorFlow Example
version: "3.9"
services:
tensorflow-gpu:
image: tensorflow/tensorflow:latest-gpu
ports:
- "8888:8888"
volumes:
- ./notebooks:/tf/notebooks
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- TF_FORCE_GPU_ALLOW_GROWTH=true
Testing and Validation
Test GPU Access
# Test basic GPU access
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# Test CUDA functionality
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvcc --version
# Test PyTorch GPU access
docker run --rm --gpus all pytorch/pytorch:latest python -c "import torch; print(torch.cuda.is_available())"
Performance Benchmarking
# Run GPU benchmark
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 \
bash -c "cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make && ./deviceQuery"
Troubleshooting Common Issues
Issue: "nvidia-container-runtime: command not found"
Solution: Ensure nvidia-container-toolkit is properly installed and Docker is restarted.
Issue: "docker: Error response from daemon: OCI runtime create failed"
Solution: Check that NVIDIA drivers are loaded and nvidia-container-runtime is configured.
Issue: GPU not visible in container
Solution: Verify --gpus all
flag and check NVIDIA_VISIBLE_DEVICES
environment variable.
Issue: CUDA version mismatch
Solution: Ensure container CUDA version is compatible with host NVIDIA driver version.
Production Deployment Considerations
Security
- Use specific GPU device IDs instead of
--gpus all
in production - Implement resource limits and quotas
- Use read-only filesystems when possible
- Implement proper network isolation
Performance
- Monitor GPU utilization with
nvidia-smi
- Use GPU memory profiling tools
- Implement proper batch sizing
- Consider multi-instance GPU (MIG) for A100/H100
Monitoring
- Set up GPU metrics collection
- Monitor container GPU usage
- Implement alerting for GPU failures
- Track GPU memory utilization
Conclusion
Setting up a GPU-accelerated Docker environment doesn't have to be painful. With a structured Ansible approach, you can automate everything:
Install Docker CE and Docker Compose
Deploy and validate the correct NVIDIA GPU driver
Set up the NVIDIA Container Toolkit for GPU container support
Finalize with NVIDIA Docker 2 for seamless CUDA integration
Whether you're preparing a single server or provisioning dozens of GPU nodes for ML, AI, or data science workloads, this Ansible-powered method guarantees consistency, scalability, and speed.
Now you can run nvidia-smi
inside containers like a boss
Take it further by combining these roles into a full CI/CD pipeline or integrating it with your infrastructure-as-code strategy.
Tags: #AnsibleAutomation #DockerGPU #NvidiaDocker2 #DevOpsTools #GPURuntime #MLInfrastructure #CloudDevOps #CudaDocker #InfrastructureAsCode