omc345 notes

Context

Run Qwen3.5 27B locally at Q8_0 quantization, split across 2x RTX 3090. NVIDIA driver modules are mismatched with the running kernel — must fix first.

Hardware Layout

┌─────────────────────────────────────────────────────┐
│                   HOST SYSTEM                       │
│  CPU: x86_64  │  RAM: 128 GB  │  Ollama 0.17.5     │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ┌──────────────────┐    ┌──────────────────┐       │
│  │   GPU 0 (65:00)  │    │   GPU 1 (b3:00)  │       │
│  │   RTX 3090       │    │   RTX 3090       │       │
│  │   24 GB VRAM     │    │   24 GB VRAM     │       │
│  │                  │    │                  │       │
│  │  ┌────────────┐  │    │  ┌────────────┐  │       │
│  │  │ Q8_0 Layers│  │    │  │ Q8_0 Layers│  │       │
│  │  │  ~14 GB    │  │    │  │  ~14 GB    │  │       │
│  │  └────────────┘  │    │  └────────────┘  │       │
│  └──────────────────┘    └──────────────────┘       │
│         ▲                        ▲                  │
│         └────────┐  ┌────────────┘                  │
│                  │  │                               │
│            ┌─────┴──┴─────┐                         │
│            │  Ollama       │                         │
│            │  Tensor Split │                         │
│            │  (automatic)  │                         │
│            └──────────────┘                          │
└─────────────────────────────────────────────────────┘

Current Problem

Kernel:  6.17.0-19-generic  ◄── running
Modules: 6.14.0-27-generic  ◄── installed (MISMATCH!)
                                  ╰─► nvidia-smi FAILS
                                  ╰─► GPUs invisible to Ollama

Step 1: Fix NVIDIA Driver

Upgrade kernel modules to match the running kernel:

sudo apt update
sudo apt upgrade -y nvidia-driver-550 linux-modules-nvidia-550-generic-hwe-24.04

Reboot to load new modules:

sudo reboot

Verify after reboot:

nvidia-smi

Expected output:

+-------------------------+-------------------------+
| GPU 0: RTX 3090         | GPU 1: RTX 3090         |
| 24576 MiB VRAM          | 24576 MiB VRAM          |
| Driver: 550.163.01      | Driver: 550.163.01      |
+-------------------------+-------------------------+

Step 2: Pull Qwen3.5 27B Q8_0 via Ollama

Check existing models and pull the target tag:

ollama list    # check existing models
ollama pull qwen3.5:27b-q8_0

If the Q8_0 tag isn't in the Ollama library, download the GGUF from HuggingFace and create a Modelfile:

Option A: Ollama library has it

ollama pull qwen3.5:27b-q8_0

Option B: Custom GGUF from HuggingFace

Download the .gguf file (~28 GB)
Create a Modelfile:

FROM ./qwen3.5-27b-q8_0.gguf

Build the model:

ollama create qwen3.5-27b:q8_0 -f Modelfile

Step 3: Run and Verify

ollama run qwen3.5:27b-q8_0 "Hello, what model are you?"

Verify the GPU split in a second terminal while the model is running:

# Check GPU memory usage:
nvidia-smi

# Check Ollama logs for layer distribution:
journalctl -u ollama --no-pager | tail -30

Expected GPU memory usage during inference:

┌──────────────────┐    ┌──────────────────┐
│   GPU 0: 3090    │    │   GPU 1: 3090    │
│   ~14 GB / 24 GB │    │   ~14 GB / 24 GB │
│   ████████░░░░   │    │   ████████░░░░   │
│   58% utilized   │    │   58% utilized   │
└──────────────────┘    └──────────────────┘

Verification Checklist

nvidia-smi shows both GPUs after reboot
ollama pull completes successfully (~28 GB download)
ollama run responds correctly with GPU acceleration
Both GPUs show VRAM usage during inference