Homelab 17 Nisan 2026

Ollama

Ollama + Docker GPU Setup — Command Reference

System: Intel i3-9100F · 4 cores · 7.7GB RAM · AMD Radeon RX 550 (no ROCm support)
Status: Running CPU-only inference


1. Diagnose GPU & System

# Check number of CPU cores
nproc

# CPU model and specs
lscpu | grep -E "Model name|CPU\(s\)|Thread|MHz"

# Available RAM
free -h

# Check for NVIDIA GPU
lspci | grep -i nvidia

# Check for AMD GPU
lspci | grep -i amd
lspci | grep -i radeon

# Check GPU device files (should show card0, renderD128)
ls /dev/dri/

# Check NVIDIA driver on host
nvidia-smi

# Check nvidia-container-toolkit version
nvidia-ctk --version

2. Verify Ollama Container GPU Usage

# Check what model is loaded and on which processor
docker exec -it ollama ollama ps

# Check if GPU is visible inside container (NVIDIA)
docker exec -it ollama nvidia-smi

# Check how container was started (look for DeviceRequests / --gpus)
docker inspect ollama | grep -A5 "DeviceRequests"

# Monitor container resource usage
docker stats ollama
docker stats --no-stream

3. Install NVIDIA Drivers (Host)

# Check recommended driver version
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers devices

# Option A: Auto-install recommended driver
sudo ubuntu-drivers autoinstall

# Option B: Manual install
sudo apt install nvidia-driver-550

# Reboot required after driver install
sudo reboot

# Verify after reboot
nvidia-smi

4. Install nvidia-container-toolkit (NVIDIA only)

# Add GPG key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# Add repo
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install
sudo apt update && sudo apt install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

5. Run Ollama Container

NVIDIA GPU

docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

AMD GPU (ROCm — RX 5000+ series only)

docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

CPU Only (optimized)

docker stop ollama && docker rm ollama

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  --cpus=$(nproc) \
  -e OLLAMA_NUM_THREADS=$(nproc) \
  -e OLLAMA_KEEP_ALIVE=5m \
  -e OLLAMA_NUM_PARALLEL=1 \
  -e OLLAMA_MAX_LOADED_MODELS=1 \
  ollama/ollama

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_NUM_THREADS=4
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_MAX_LOADED_MODELS=1
      - NVIDIA_VISIBLE_DEVICES=all

volumes:
  ollama:

7. Key Environment Variables

VariableDescriptionDefault
OLLAMA_NUM_THREADSCPU threads to useAll cores
OLLAMA_NUM_PARALLELParallel requests per model1
OLLAMA_MAX_LOADED_MODELSMax models in memory at once3× GPUs
OLLAMA_KEEP_ALIVEHow long to keep model loaded5m
OLLAMA_MAX_QUEUEMax queued requests512
CUDA_VISIBLE_DEVICESWhich NVIDIA GPUs to useall
OLLAMA_GPU_MEMORY_FRACTIONFraction of VRAM to use1.0

8. Pull & Run Models

# Pull a model
docker exec -it ollama ollama pull gemma2:2b

# Run interactively
docker exec -it ollama ollama run gemma2:2b

# List downloaded models
docker exec -it ollama ollama list

# Remove a model
docker exec -it ollama ollama rm gemma2:2b

For CPU / Low RAM (≤ 8GB) — This System

ModelRAM NeededSpeed (est.)
qwen2.5:1.5b~1.2 GB6–10 tok/s ✅ Best
gemma2:2b-instruct-q4_K_M~1.5 GB4–8 tok/s ✅
phi3.5:mini~2.5 GB⚠️ Tight
Any 7B+ model4–5+ GB❌ Avoid

For GPU (VRAM guide)

VRAMMax Model Size
6–8 GB7B (q4)
10–12 GB13B (q4)
16 GB13B (q8) / 14B
24 GB32B (q4)

10. Quantization Reference

FormatRAM UsageSpeedQuality
q2LowestFastestLower
q4_K_MBalanced ✅GoodGood
q8HighSlowerBest
# Pull a specific quantized version
docker exec -it ollama ollama pull gemma2:2b-instruct-q4_K_M

11. Manual Layer Control (Advanced)

Create a Modelfile to force specific GPU layers:

FROM llama3.1:8b
PARAMETER num_gpu 25
PARAMETER num_thread 4
docker exec -it ollama ollama create mymodel -f Modelfile

⚠️ Manual layer control can reduce VRAM usage but significantly lowers speed. Use only when needed to fit a larger model.


12. Keep Model Loaded in Memory

# Keep model alive for 30 minutes via API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma2:2b",
  "keep_alive": "30m"
}'

13. Troubleshooting

ProblemCauseFix
100% CPU in ollama psNo GPU passthroughRecreate container with --gpus all
nvidia-smi not found in containerGPU not passed throughAdd --gpus all flag
nvidia-smi fails on hostDrivers not installedInstall NVIDIA drivers + reboot
lspci shows no GPUNo GPU on machineUse CPU mode or upgrade instance
AMD GPU not workingROCm not supported on old cardsRX 5000+ required for ROCm
Out of memory / crashModel too large for RAMUse smaller/quantized model
Slow inferenceSwap being usedFree RAM, use smaller model

14. AMD ROCm GPU Compatibility

RX 550 (Lexa/GCN 4.0) is NOT supported by ROCm

GenerationExample CardsROCm Support
GCN 4.0RX 550, RX 560, RX 580❌ No
GCN 5.0 / RDNA 1RX 5500, RX 5700✅ Yes
RDNA 2RX 6600, RX 6700, RX 6900✅ Yes
RDNA 3RX 7600, RX 7900✅ Yes