HomeSec-Bench: Local AI Owns Cloud Giants in Security Tasks 🔥

Yo, fam, let's break down this benchmark page from SharpAI. It's all about proving local AI (running on your MacBook) can hang with the cloud big boys like GPT-5.4 in real home security workflows. No fluff, straight fire results.

1️⃣ WHY? The Pain It Solves 👇

Before this, home security AI was cloud-locked:

  • Privacy nightmare 💀: Sending camera feeds to OpenAI? Nope, that's your house data pinging servers.
  • API costs stacking up 💸: Every alert = bill.
  • Latency + downtime: Cloud hiccups mean delayed "intruder alert."
  • No offline mode: Power outage? Blind.

HomeSec-Bench exists to prove local LLMs crush it — 93.8% pass rate on a 9B model using just 13.8GB on M5 MacBook. Zero costs, full privacy, 25 tok/s speed. "Ohhhh" moment: Your laptop > cloud for domain-specific tasks. 🤯

PROBLEM (CLOUD-ONLY) ❌          SOLUTION (LOCAL AI) ✅
════════════════════════        ═════════════════════
Privacy leaks                  │ Full data lockdown
API $$$                        │ Free forever
Cloud lag (601ms TTFT)         │ 435ms on 35B-MoE
Offline? Nope                  │ Runs anywhere

2️⃣ Big Picture: What + Where It Fits 🚀

  • HomeSec-Bench v1: 96 LLM tests + 35 VLM tests across 15 suites.
  • Tests real home sec AI flows: Triage events, dedupe visitors, tool calls, resist hacks.
  • Run on Apple Silicon (M5 MacBooks) via llama.cpp.
  • Compares local Qwen3.5 models (🏠) vs OpenAI cloud (☁️).
  • Part of SharpAI Aegis: Local-first home security app.

Fits in the local AI revolution — edge devices beating datacenter beasts on niche tasks.

LOCAL (Your Mac) ───► HomeSec-Bench ───► Scores vs Cloud
     │
     ▼
Camera Feed ──► LLM Triage ──► Alert (or chill)

3️⃣ How It Works: Step-by-Step ⚙️

1️⃣ Setup: Feed AI-generated fixture images + prompts mimicking home cams. 2️⃣ Run tests: Against OpenAI-compatible endpoints (local or cloud). 3️⃣ Score: Pass/fail on 96 evals. Measures accuracy, speed, memory. 4️⃣ Suites: Grouped by skill (see below).

Leaderboard Snippet (Top 5, Pass Rate %):

| Rank | Model | Type | Passed | Pass Rate | Time | |------|------------------------|-------|--------|-----------|---------| | 🥇 | GPT-5.4 | ☁️ | 94 | 97.9% | 2m22s | | 🥈 | GPT-5.4-mini | ☁️ | 92 | 95.8% | 1m17s | | 🥉 | Qwen3.5-9B (Q4_K_M) | 🏠 | 90 | 93.8%| 5m23s | | 4 | Qwen3.5-27B (Q4_K_M) | 🏠 | 90 | 93.8%|15m8s | | 5 | Qwen3.5-122B-MoE | 🏠 | 89 | 92.7% | 8m26s |

Key Metrics (Local crushes on privacy/cost, close on accuracy):

Time to First Token (Lower = Better) 📉
Qwen3.5-35B-MoE: 435ms   Beats ALL cloud!
GPT-5.4-nano:    508ms
↓↓↓

Decode Speed (Higher = Better) 📈
GPT-5.4-mini: 234 tok/s
Qwen3.5-9B:    25 tok/s    Still snappy on laptop

Memory (Local Only):
Qwen3.5-9B: 13.8 GB  🔥 Fits M5 Mac

15 Test Suites (Nested for clarity):

  • Core Reasoning: • Context Preprocessing (6 tests): Dedupe convos. • Topic Classification (4): Route to right handler.
  • Security Flows: • Event Deduplication (8): "Same dude on cam1/cam2?" • Security Classification (12): Normal → Critical. • VLM-to-Alert Triage (5): Vision → Urgency → Dispatch.
  • Tools & Robustness: • Tool Use (16): Pick tool + params right. • Prompt Injection Resistance (4): Don't get jailbroken. • Multi-Turn Reasoning (4): Remember past events.
  • Extras: • Chat/JSON (11), Narrative (4), Error Recovery (4), etc.

4️⃣ Details + Edges 💡

  • Quantization: Q4_K_M/IQ1_M = smaller/faster models (trade tiny accuracy for speed).
  • GPT-5-mini flopped (62.5%) cuz API whined about temp settings 😂.
  • All local on macOS 15.3 arm64 — no NVIDIA needed.
  • Watch it live: Vid shows tests firing in real-time.
  • GitHub: https://github.com/SharpAI/DeepCamera/tree/master/skills/analysis/home-security-benchmark

TL;DR / LOCK IT IN 🎯
Local Qwen3.5-9B: 93.8% (4pts behind GPT-5.4), 25 tok/s on M5 Mac, zero cost/privacy win. Benchmark = proof local AI ready for home sec. Download Aegis and run it yourself. You tracking? 🚀


Original article

← All notes