Ollama + Docker GPU Setup — Command Reference
System: Intel i3-9100F · 4 cores · 7.7GB RAM · AMD Radeon RX 550 (no ROCm support)
Status: Running CPU-only inference
1. Diagnose GPU & System
# Check number of CPU cores
nproc
# CPU model and specs
lscpu | grep -E "Model name|CPU\(s\)|Thread|MHz"
# Available RAM
free -h
# Check for NVIDIA GPU
lspci | grep -i nvidia
# Check for AMD GPU
lspci | grep -i amd
lspci | grep -i radeon
# Check GPU device files (should show card0, renderD128)
ls /dev/dri/
# Check NVIDIA driver on host
nvidia-smi
# Check nvidia-container-toolkit version
nvidia-ctk --version
2. Verify Ollama Container GPU Usage
# Check what model is loaded and on which processor
docker exec -it ollama ollama ps
# Check if GPU is visible inside container (NVIDIA)
docker exec -it ollama nvidia-smi
# Check how container was started (look for DeviceRequests / --gpus)
docker inspect ollama | grep -A5 "DeviceRequests"
# Monitor container resource usage
docker stats ollama
docker stats --no-stream
3. Install NVIDIA Drivers (Host)
# Check recommended driver version
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers devices
# Option A: Auto-install recommended driver
sudo ubuntu-drivers autoinstall
# Option B: Manual install
sudo apt install nvidia-driver-550
# Reboot required after driver install
sudo reboot
# Verify after reboot
nvidia-smi
# Add GPG key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Add repo
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install
sudo apt update && sudo apt install -y nvidia-container-toolkit
# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker
sudo systemctl restart docker
5. Run Ollama Container
NVIDIA GPU
docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
AMD GPU (ROCm — RX 5000+ series only)
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:rocm
CPU Only (optimized)
docker stop ollama && docker rm ollama
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
--cpus=$(nproc) \
-e OLLAMA_NUM_THREADS=$(nproc) \
-e OLLAMA_KEEP_ALIVE=5m \
-e OLLAMA_NUM_PARALLEL=1 \
-e OLLAMA_MAX_LOADED_MODELS=1 \
ollama/ollama
6. Docker Compose (Recommended)
services:
ollama:
image: ollama/ollama
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- OLLAMA_NUM_THREADS=4
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_MAX_LOADED_MODELS=1
- NVIDIA_VISIBLE_DEVICES=all
volumes:
ollama:
7. Key Environment Variables
| Variable | Description | Default |
|---|
OLLAMA_NUM_THREADS | CPU threads to use | All cores |
OLLAMA_NUM_PARALLEL | Parallel requests per model | 1 |
OLLAMA_MAX_LOADED_MODELS | Max models in memory at once | 3× GPUs |
OLLAMA_KEEP_ALIVE | How long to keep model loaded | 5m |
OLLAMA_MAX_QUEUE | Max queued requests | 512 |
CUDA_VISIBLE_DEVICES | Which NVIDIA GPUs to use | all |
OLLAMA_GPU_MEMORY_FRACTION | Fraction of VRAM to use | 1.0 |
8. Pull & Run Models
# Pull a model
docker exec -it ollama ollama pull gemma2:2b
# Run interactively
docker exec -it ollama ollama run gemma2:2b
# List downloaded models
docker exec -it ollama ollama list
# Remove a model
docker exec -it ollama ollama rm gemma2:2b
9. Recommended Models by Hardware
For CPU / Low RAM (≤ 8GB) — This System
| Model | RAM Needed | Speed (est.) |
|---|
qwen2.5:1.5b | ~1.2 GB | 6–10 tok/s ✅ Best |
gemma2:2b-instruct-q4_K_M | ~1.5 GB | 4–8 tok/s ✅ |
phi3.5:mini | ~2.5 GB | ⚠️ Tight |
| Any 7B+ model | 4–5+ GB | ❌ Avoid |
For GPU (VRAM guide)
| VRAM | Max Model Size |
|---|
| 6–8 GB | 7B (q4) |
| 10–12 GB | 13B (q4) |
| 16 GB | 13B (q8) / 14B |
| 24 GB | 32B (q4) |
10. Quantization Reference
| Format | RAM Usage | Speed | Quality |
|---|
q2 | Lowest | Fastest | Lower |
q4_K_M | Balanced ✅ | Good | Good |
q8 | High | Slower | Best |
# Pull a specific quantized version
docker exec -it ollama ollama pull gemma2:2b-instruct-q4_K_M
11. Manual Layer Control (Advanced)
Create a Modelfile to force specific GPU layers:
FROM llama3.1:8b
PARAMETER num_gpu 25
PARAMETER num_thread 4
docker exec -it ollama ollama create mymodel -f Modelfile
⚠️ Manual layer control can reduce VRAM usage but significantly lowers speed. Use only when needed to fit a larger model.
12. Keep Model Loaded in Memory
# Keep model alive for 30 minutes via API
curl http://localhost:11434/api/generate -d '{
"model": "gemma2:2b",
"keep_alive": "30m"
}'
13. Troubleshooting
| Problem | Cause | Fix |
|---|
100% CPU in ollama ps | No GPU passthrough | Recreate container with --gpus all |
nvidia-smi not found in container | GPU not passed through | Add --gpus all flag |
nvidia-smi fails on host | Drivers not installed | Install NVIDIA drivers + reboot |
lspci shows no GPU | No GPU on machine | Use CPU mode or upgrade instance |
| AMD GPU not working | ROCm not supported on old cards | RX 5000+ required for ROCm |
| Out of memory / crash | Model too large for RAM | Use smaller/quantized model |
| Slow inference | Swap being used | Free RAM, use smaller model |
14. AMD ROCm GPU Compatibility
❌ RX 550 (Lexa/GCN 4.0) is NOT supported by ROCm
| Generation | Example Cards | ROCm Support |
|---|
| GCN 4.0 | RX 550, RX 560, RX 580 | ❌ No |
| GCN 5.0 / RDNA 1 | RX 5500, RX 5700 | ✅ Yes |
| RDNA 2 | RX 6600, RX 6700, RX 6900 | ✅ Yes |
| RDNA 3 | RX 7600, RX 7900 | ✅ Yes |