Fix NVIDIA Driver + Install Qwen3.5 27B Q8_0 via Ollama

Context

Run Qwen3.5 27B locally at Q8_0 quantization, split across 2x RTX 3090. NVIDIA driver modules are mismatched with the running kernel — must fix first.

Hardware Layout

┌─────────────────────────────────────────────────────┐
                   HOST SYSTEM                       
  CPU: x86_64    RAM: 128 GB    Ollama 0.17.5     
├─────────────────────────────────────────────────────┤
                                                     
  ┌──────────────────┐    ┌──────────────────┐       
     GPU 0 (65:00)         GPU 1 (b3:00)         
     RTX 3090              RTX 3090              
     24 GB VRAM            24 GB VRAM            
                                                 
    ┌────────────┐        ┌────────────┐         
     Q8_0 Layers│         Q8_0 Layers│         
      ~14 GB              ~14 GB             
    └────────────┘        └────────────┘         
  └──────────────────┘    └──────────────────┘       
                                                   
         └────────┐  ┌────────────┘                  
                                                   
            ┌─────┴──┴─────┐                         
              Ollama                                
              Tensor Split                          
              (automatic)                           
            └──────────────┘                          
└─────────────────────────────────────────────────────┘

Current Problem

Kernel:  6.17.0-19-generic  ◄── running
Modules: 6.14.0-27-generic  ◄── installed (MISMATCH!)
                                  ╰─► nvidia-smi FAILS
                                  ╰─► GPUs invisible to Ollama

Step 1: Fix NVIDIA Driver

Upgrade kernel modules to match the running kernel:

sudo apt update
sudo apt upgrade -y nvidia-driver-550 linux-modules-nvidia-550-generic-hwe-24.04

Reboot to load new modules:

sudo reboot

Verify after reboot:

nvidia-smi

Expected output:

+-------------------------+-------------------------+
| GPU 0: RTX 3090         | GPU 1: RTX 3090         |
| 24576 MiB VRAM          | 24576 MiB VRAM          |
| Driver: 550.163.01      | Driver: 550.163.01      |
+-------------------------+-------------------------+

Step 2: Pull Qwen3.5 27B Q8_0 via Ollama

Check existing models and pull the target tag:

ollama list    # check existing models
ollama pull qwen3.5:27b-q8_0

If the Q8_0 tag isn't in the Ollama library, download the GGUF from HuggingFace and create a Modelfile:

Option A: Ollama library has it

ollama pull qwen3.5:27b-q8_0

Option B: Custom GGUF from HuggingFace

  1. Download the .gguf file (~28 GB)
  2. Create a Modelfile:
FROM ./qwen3.5-27b-q8_0.gguf
  1. Build the model:
ollama create qwen3.5-27b:q8_0 -f Modelfile

Step 3: Run and Verify

ollama run qwen3.5:27b-q8_0 "Hello, what model are you?"

Verify the GPU split in a second terminal while the model is running:

# Check GPU memory usage:
nvidia-smi

# Check Ollama logs for layer distribution:
journalctl -u ollama --no-pager | tail -30

Expected GPU memory usage during inference:

┌──────────────────┐    ┌──────────────────┐
   GPU 0: 3090           GPU 1: 3090    
   ~14 GB / 24 GB        ~14 GB / 24 GB 
   ████████░░░░          ████████░░░░   
   58% utilized          58% utilized   
└──────────────────┘    └──────────────────┘

Verification Checklist

  1. nvidia-smi shows both GPUs after reboot
  2. ollama pull completes successfully (~28 GB download)
  3. ollama run responds correctly with GPU acceleration
  4. Both GPUs show VRAM usage during inference
← All notes