HomeSec-Bench: Local AI Owns Cloud Giants in Security Tasks 🔥
Yo, fam, let's break down this benchmark page from SharpAI. It's all about proving local AI (running on your MacBook) can hang with the cloud big boys like GPT-5.4 in real home security workflows. No fluff, straight fire results.
1️⃣ WHY? The Pain It Solves 👇
Before this, home security AI was cloud-locked:
- Privacy nightmare 💀: Sending camera feeds to OpenAI? Nope, that's your house data pinging servers.
- API costs stacking up 💸: Every alert = bill.
- Latency + downtime: Cloud hiccups mean delayed "intruder alert."
- No offline mode: Power outage? Blind.
HomeSec-Bench exists to prove local LLMs crush it — 93.8% pass rate on a 9B model using just 13.8GB on M5 MacBook. Zero costs, full privacy, 25 tok/s speed. "Ohhhh" moment: Your laptop > cloud for domain-specific tasks. 🤯
PROBLEM (CLOUD-ONLY) ❌ SOLUTION (LOCAL AI) ✅
════════════════════════ ═════════════════════
Privacy leaks │ Full data lockdown
API $$$ │ Free forever
Cloud lag (601ms TTFT) │ 435ms on 35B-MoE
Offline? Nope │ Runs anywhere
2️⃣ Big Picture: What + Where It Fits 🚀
- HomeSec-Bench v1: 96 LLM tests + 35 VLM tests across 15 suites.
- Tests real home sec AI flows: Triage events, dedupe visitors, tool calls, resist hacks.
- Run on Apple Silicon (M5 MacBooks) via llama.cpp.
- Compares local Qwen3.5 models (🏠) vs OpenAI cloud (☁️).
- Part of SharpAI Aegis: Local-first home security app.
Fits in the local AI revolution — edge devices beating datacenter beasts on niche tasks.
LOCAL (Your Mac) ───► HomeSec-Bench ───► Scores vs Cloud
│
▼
Camera Feed ──► LLM Triage ──► Alert (or chill)
3️⃣ How It Works: Step-by-Step ⚙️
1️⃣ Setup: Feed AI-generated fixture images + prompts mimicking home cams. 2️⃣ Run tests: Against OpenAI-compatible endpoints (local or cloud). 3️⃣ Score: Pass/fail on 96 evals. Measures accuracy, speed, memory. 4️⃣ Suites: Grouped by skill (see below).
Leaderboard Snippet (Top 5, Pass Rate %):
| Rank | Model | Type | Passed | Pass Rate | Time | |------|------------------------|-------|--------|-----------|---------| | 🥇 | GPT-5.4 | ☁️ | 94 | 97.9% | 2m22s | | 🥈 | GPT-5.4-mini | ☁️ | 92 | 95.8% | 1m17s | | 🥉 | Qwen3.5-9B (Q4_K_M) | 🏠 | 90 | 93.8%| 5m23s | | 4 | Qwen3.5-27B (Q4_K_M) | 🏠 | 90 | 93.8%|15m8s | | 5 | Qwen3.5-122B-MoE | 🏠 | 89 | 92.7% | 8m26s |
Key Metrics (Local crushes on privacy/cost, close on accuracy):
Time to First Token (Lower = Better) 📉
Qwen3.5-35B-MoE: 435ms ← Beats ALL cloud!
GPT-5.4-nano: 508ms
↓↓↓
Decode Speed (Higher = Better) 📈
GPT-5.4-mini: 234 tok/s
Qwen3.5-9B: 25 tok/s ← Still snappy on laptop
Memory (Local Only):
Qwen3.5-9B: 13.8 GB 🔥 Fits M5 Mac
15 Test Suites (Nested for clarity):
- Core Reasoning: • Context Preprocessing (6 tests): Dedupe convos. • Topic Classification (4): Route to right handler.
- Security Flows: • Event Deduplication (8): "Same dude on cam1/cam2?" • Security Classification (12): Normal → Critical. • VLM-to-Alert Triage (5): Vision → Urgency → Dispatch.
- Tools & Robustness: • Tool Use (16): Pick tool + params right. • Prompt Injection Resistance (4): Don't get jailbroken. • Multi-Turn Reasoning (4): Remember past events.
- Extras: • Chat/JSON (11), Narrative (4), Error Recovery (4), etc.
4️⃣ Details + Edges 💡
- Quantization: Q4_K_M/IQ1_M = smaller/faster models (trade tiny accuracy for speed).
- GPT-5-mini flopped (62.5%) cuz API whined about temp settings 😂.
- All local on macOS 15.3 arm64 — no NVIDIA needed.
- Watch it live: Vid shows tests firing in real-time.
- GitHub: https://github.com/SharpAI/DeepCamera/tree/master/skills/analysis/home-security-benchmark
TL;DR / LOCK IT IN 🎯
Local Qwen3.5-9B: 93.8% (4pts behind GPT-5.4), 25 tok/s on M5 Mac, zero cost/privacy win. Benchmark = proof local AI ready for home sec. Download Aegis and run it yourself. You tracking? 🚀