How to Install a GPU-Ready Docker Environment on Ubuntu (Step-by-Step + Code + Explanations)

Complete guide to setting up GPU-ready Docker environment using Ansible automation for ML, AI, and HPC workloads

8-12 minutes(2456 words)simple

Difficulty: 🟡 Intermediate
Estimated Time: 30-45 minutes
Prerequisites: Basic Linux knowledge, Ansible experience helpful, NVIDIA GPU, Ubuntu/Debian system

What You'll Learn

This tutorial covers essential GPU-ready Docker concepts and tools:

Ansible Automation - Complete automation of GPU Docker setup
Docker Installation - Installing Docker Engine and Docker Compose
NVIDIA Driver Setup - Installing and configuring GPU drivers
Container Toolkit - Setting up NVIDIA Container Toolkit
GPU Validation - Testing and verifying GPU access in containers
Production Configuration - Advanced setup for production workloads

Prerequisites

Basic Linux knowledge
Ansible experience helpful
NVIDIA GPU
Ubuntu/Debian system

Docker Best Practices 2025 - Production Docker strategies
Multi-Stage Docker Builds - Advanced build optimization
Docker GPU PyTorch Setup - PyTorch GPU containers

Introduction

Setting up a GPU-ready Docker environment is essential for ML, AI, and HPC workloads — but doing it manually is time-consuming and error-prone. With Ansible, you can automate the entire process across multiple machines — installing Docker, NVIDIA drivers, NVIDIA container toolkit, and nvidia-docker2 — all in the correct order.

This guide provides:

Step-by-step role execution
Equivalent shell commands
Clear explanations of what each command does

Ansible Role Execution Order

- role: ../roles/docker
- role: ../roles/nvidia_driver
- role: ../roles/nvidia_container_toolkit
- role: ../roles/nvidia_docker

Docker Role — Install Docker Engine & Compose

Ansible Tasks

- name: Install Docker CE
  apt:
    name:
      - docker-ce
      - docker-ce-cli
      - containerd.io
    state: present
    update_cache: yes

- name: Install Docker Compose
  get_url:
    url: https://github.com/docker/compose/releases/download/v2.36.2/docker-compose-Linux-x86_64
    dest: /usr/bin/docker-compose
    mode: '0755'

Shell Commands + Explanations

sudo apt update  # Updates the package list
sudo apt install -y docker-ce docker-ce-cli containerd.io  # Installs Docker and its core components
wget https://github.com/docker/compose/releases/download/v2.36.2/docker-compose-Linux-x86_64 -O docker-compose  # Downloads Docker Compose v2
sudo mv docker-compose /usr/bin/docker-compose  # Moves it to the executable path
sudo chmod +x /usr/bin/docker-compose  # Grants execution permission

NVIDIA Driver Role — Install NVIDIA GPU Driver

Ansible Tasks

- name: Install dependencies
  apt:
    name:
      - build-essential
      - dkms
      - pciutils
      - wget
    state: present

- name: Install NVIDIA Driver
  apt:
    name: "{{ nvidia_driver_package }}"
    state: present

- name: Check if reboot is required
  shell: 'lsmod | grep -i nvidia || echo "reboot_needed"'
  register: nvidia_driver_loaded

- name: Reboot if driver not loaded
  reboot:
    msg: "Rebooting to load NVIDIA driver"
    when: '"reboot_needed" in nvidia_driver_loaded.stdout'

Shell Commands + Explanations

sudo apt install -y build-essential dkms pciutils wget  # Installs build tools and kernel module manager
sudo apt install -y nvidia-driver-535  # Installs the NVIDIA GPU driver (replace with desired version)
lsmod | grep -i nvidia  # Checks if the driver is currently loaded
# If output is empty:
sudo reboot  # Reboot the system so the kernel loads the new driver

NVIDIA Container Toolkit Role — Configure Docker to Use GPU

Ansible Tasks

- name: Download NVIDIA GPG key
  shell: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

- name: Add NVIDIA container repo
  shell: |
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

- name: Install container toolkit
  apt:
    name: nvidia-container-toolkit
    state: present

- name: Configure Docker daemon
  copy:
    content: |
      {
        "runtimes": {
          "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
          }
        },
        "default-runtime": "nvidia"
      }
    dest: /etc/docker/daemon.json

- name: Restart Docker
  systemd:
    name: docker
    state: restarted

Shell Commands + Explanations

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Fetches NVIDIA's GPG key and converts it to keyring format for secure APT use

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Adds the NVIDIA container runtime repository

sudo apt update
sudo apt install -y nvidia-container-toolkit  # Installs the toolkit for enabling GPU in Docker

# Set NVIDIA as default runtime
echo '{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}' | sudo tee /etc/docker/daemon.json

sudo systemctl restart docker  # Restart Docker to apply the changes

NVIDIA Docker Role — Install NVIDIA Docker 2

Ansible Tasks

- name: Add NVIDIA Docker GPG key
  apt_key:
    url: https://nvidia.github.io/nvidia-docker/gpgkey
    state: present

- name: Add NVIDIA Docker repository
  shell: |
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list

- name: Install NVIDIA Docker 2
  apt:
    name: nvidia-docker2
    state: present

- name: Restart Docker
  systemd:
    name: docker
    state: restarted

Shell Commands + Explanations

curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-docker-keyring.gpg
# Adds the NVIDIA Docker GPG key to validate packages

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Dynamically sets the repo based on your OS version (e.g. ubuntu2004)

sudo apt update
sudo apt install -y nvidia-docker2  # Installs Docker CLI extensions for GPU support
sudo systemctl restart docker

Final Test: Run GPU-Enabled Container

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Expected Output: A table listing your NVIDIA GPU(s) — confirming that the container can access them.

Docker Compose Example with GPU Support

version: "3.9"

services:
  gpu-app:
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    container_name: cuda-gpu-container
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    command: nvidia-smi

What This Does:

image: Uses NVIDIA's official CUDA image
runtime: nvidia: Tells Docker to use the NVIDIA runtime
devices: capabilities: [gpu]: Requests access to GPU hardware
NVIDIA_VISIBLE_DEVICES=all: Exposes all available GPUs
command: nvidia-smi: Runs the NVIDIA CLI to show GPU details

How to Run It

Make sure your system:

Has the NVIDIA driver installed
Has nvidia-container-toolkit installed
Uses Docker with the NVIDIA runtime enabled

Then run:

docker compose up

You should see output from nvidia-smi inside the container!

Advanced Configuration Options

Custom GPU Selection

version: "3.9"

services:
  gpu-app:
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: 2  # Use 2 GPUs
              driver: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1  # Only use GPUs 0 and 1

Multi-GPU TensorFlow Example

version: "3.9"

services:
  tensorflow-gpu:
    image: tensorflow/tensorflow:latest-gpu
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/tf/notebooks
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - TF_FORCE_GPU_ALLOW_GROWTH=true

Testing and Validation

Test GPU Access

# Test basic GPU access
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

# Test CUDA functionality
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvcc --version

# Test PyTorch GPU access
docker run --rm --gpus all pytorch/pytorch:latest python -c "import torch; print(torch.cuda.is_available())"

Performance Benchmarking

# Run GPU benchmark
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 \
  bash -c "cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make && ./deviceQuery"

Troubleshooting Common Issues

Issue: "nvidia-container-runtime: command not found"

Solution: Ensure nvidia-container-toolkit is properly installed and Docker is restarted.

Issue: "docker: Error response from daemon: OCI runtime create failed"

Solution: Check that NVIDIA drivers are loaded and nvidia-container-runtime is configured.

Issue: GPU not visible in container

Solution: Verify --gpus all flag and check NVIDIA_VISIBLE_DEVICES environment variable.

Issue: CUDA version mismatch

Solution: Ensure container CUDA version is compatible with host NVIDIA driver version.

Production Deployment Considerations

Security

Use specific GPU device IDs instead of --gpus all in production
Implement resource limits and quotas
Use read-only filesystems when possible
Implement proper network isolation

Performance

Monitor GPU utilization with nvidia-smi
Use GPU memory profiling tools
Implement proper batch sizing
Consider multi-instance GPU (MIG) for A100/H100

Monitoring

Set up GPU metrics collection
Monitor container GPU usage
Implement alerting for GPU failures
Track GPU memory utilization

Conclusion

Setting up a GPU-accelerated Docker environment doesn't have to be painful. With a structured Ansible approach, you can automate everything:

Install Docker CE and Docker Compose
Deploy and validate the correct NVIDIA GPU driver
Set up the NVIDIA Container Toolkit for GPU container support
Finalize with NVIDIA Docker 2 for seamless CUDA integration

Whether you're preparing a single server or provisioning dozens of GPU nodes for ML, AI, or data science workloads, this Ansible-powered method guarantees consistency, scalability, and speed.

Now you can run nvidia-smi inside containers like a boss

Take it further by combining these roles into a full CI/CD pipeline or integrating it with your infrastructure-as-code strategy.

Tags: #AnsibleAutomation #DockerGPU #NvidiaDocker2 #DevOpsTools #GPURuntime #MLInfrastructure #CloudDevOps #CudaDocker #InfrastructureAsCode

Docker Networking Deep Dive

Docker GPU PyTorch Tutorial 2024

How to Install a GPU-Ready Docker Environment on Ubuntu (Step-by-Step + Code + Explanations)

Quick Navigation

What You'll Learn

Prerequisites

Related Tutorials

Introduction

Ansible Role Execution Order

Docker Role — Install Docker Engine & Compose

Ansible Tasks

Shell Commands + Explanations

NVIDIA Driver Role — Install NVIDIA GPU Driver

Ansible Tasks

Shell Commands + Explanations

NVIDIA Container Toolkit Role — Configure Docker to Use GPU

Ansible Tasks

Shell Commands + Explanations

NVIDIA Docker Role — Install NVIDIA Docker 2

Ansible Tasks

Shell Commands + Explanations

Final Test: Run GPU-Enabled Container

Docker Compose Example with GPU Support

What This Does:

How to Run It

Advanced Configuration Options

Custom GPU Selection

Multi-GPU TensorFlow Example

Testing and Validation

Test GPU Access

Performance Benchmarking

Troubleshooting Common Issues

Issue: "nvidia-container-runtime: command not found"

Issue: "docker: Error response from daemon: OCI runtime create failed"

Issue: GPU not visible in container

Issue: CUDA version mismatch

Production Deployment Considerations

Security

Performance

Monitoring

Conclusion