The Multi-Agent Supervisor Pattern: A Real Implementation Plan

by adminagentsautomationopenclaw

The Multi-Agent Supervisor Pattern: A Real Implementation Plan

I run five AI agents on a Mac Mini. They scan Twitter, monitor stocks, answer Slack questions, generate daily news digests, and build tools overnight. They also break constantly.

This is the plan for adding a supervisor agent that watches the fleet, so I stop being the supervisor myself.

The Fleet

Five bots, three runtimes, four channels:

| Bot | Runtime | Channel | Job | |-----|---------|---------|-----| | Rizz | OpenClaw | Telegram | Main coordinator. Runs RizzNews daily digest, health checks, cron orchestration | | Angela | ZeroClaw | Telegram | Content manager for @omc345. Radar scanning (Twitter + Reddit), tweet drafts, overnight tool building in Bun/TypeScript | | Benjamin | OpenClaw | Telegram | BIST (Borsa Istanbul) stock market bot. KAP filings, pre-market scans, EOD summaries | | BunyaminKunduz | OpenClaw | Slack | Company engineering assistant. Answers questions in #kunduz-rebuilt-eng | | Optiman | OpenClaw | Telegram | Operations manager (experimental) |

Each runs as a macOS LaunchAgent with KeepAlive=true. They share same credentials but have isolated configs, workspaces, and memory.

The Problem I Actually Have

The bots work. Sometimes. The rest of the time I'm:

  • Reading error logs to figure out why Angela's radar scan wrote "ERROR: bird offline" without ever trying to run bird
  • Restarting gateways after config changes that require full process restarts
  • Purging poisoned memories from Angela's brain.db because she learned "shell is blocked" from a session where it genuinely was blocked, and now repeats it forever
  • Debugging cron jobs that silently skip because SQLite timestamps aren't RFC3339 compliant
  • Discovering that allowed_commands = [] means "block everything" not "allow everything"

A supervisor watching Angela say "security policy blocked" won't fix that Angela never tried the command. The worker needs to work reliably first. Then oversight adds value.

When a Supervisor Actually Helps

Once the workers are stable, a supervisor solves three problems:

  1. Stale data detection. Angela's radar runs every 2 hours. If three consecutive runs produce no data, I want to know before I wake up and check manually.

  2. Output quality review. Benjamin posts BIST summaries. If the summary references yesterday's closing price instead of today's, a supervisor catches it before my Telegram group sees it.

  3. Cross-bot coordination. Angela finds a trending topic. Rizz should know about it for the morning brief. Right now they don't talk to each other.

The Architecture

Hermes (by Nous Research) as a supervisor, connected via Telegram instead of Discord.

                    Telegram
                       |
    +---------+--------+--------+-----------+
    |         |        |        |           |
  Rizz    Angela   Benjamin  Bunyamin   Optiman
    |         |        |        |           |
    +----+----+--------+--------+-----------+
         |
      Hermes (supervisor)
         |
      #operator-ai (private Telegram group)

Hermes sits in a private Telegram group (#operator-ai) where only bots post. Each worker bot has a cron job that posts status updates to the group. Hermes reads them, evaluates quality, and either ACKs or escalates to me in my main chat.

The Intent Marker Protocol (Adapted for Telegram)

Four markers. Same rules as the original pattern, adapted for Telegram:

  • [STATUS_REQUEST] - Hermes asks a bot for status
  • [REVIEW_REQUEST] - A bot asks Hermes to review output
  • [ESCALATION_NOTICE] - Hermes escalates to me
  • [ACK] - Conversation terminal. No reply.

Termination rules:

  1. [ACK] received = stop. No reply.
  2. No marker = informational. No reply.
  3. Max 3 messages per exchange: request, review, ack.

Implementation Plan

Phase 1: Status Reporting (No Hermes Yet)

Before adding a supervisor, make the workers report their own status. Add a shell-type cron job to each bot:

# Angela's status reporter (every 2 hours)
#!/bin/bash
LATEST=$(ls -t ~/.zeroclaw/workspace/twitter/radar-raw-*.md | head -1)
if [ -z "$LATEST" ]; then
  echo "ANGELA STATUS: No radar files found"
else
  AGE=$(( ($(date +%s) - $(stat -f%m "$LATEST")) / 60 ))
  LINES=$(wc -l < "$LATEST")
  if [ $AGE -gt 180 ]; then
    echo "ANGELA STATUS: STALE - last scan ${AGE}min ago, $LINES lines"
  elif grep -q 'ERROR' "$LATEST"; then
    echo "ANGELA STATUS: ERRORS - $(grep -c ERROR "$LATEST") errors in last scan"
  else
    echo "ANGELA STATUS: OK - last scan ${AGE}min ago, $LINES lines"
  fi
fi

These post to the #operator-ai Telegram group. No AI involved. Just shell scripts checking file freshness and error counts.

Phase 2: Install Hermes

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Configure for Telegram instead of Discord:

# ~/.hermes/config.yaml
agent:
  system_prompt: |
    You are a supervisor for a fleet of 5 AI bots running on OpenClaw and ZeroClaw.

    Bots you supervise:
    - Rizz: news digest, cron orchestration
    - Angela: Twitter/Reddit radar, content management for @omc345
    - Benjamin: BIST stock market monitoring
    - BunyaminKunduz: Slack engineering assistant
    - Optiman: operations (experimental)

    Your job:
    - Read status messages from #operator-ai
    - If all OK: send [ACK]
    - If stale data (>3 hours): investigate, suggest fix
    - If errors: check pattern (is it one bot or systemic?)
    - If judgment needed: escalate to operator with [ESCALATION_NOTICE]

    You do not generate content. You do not post tweets. You do not trade stocks.
    You verify and route.

Phase 3: Cross-Bot Intelligence

The real value: Hermes sees ALL status reports. It can notice patterns no single bot sees:

  • Angela's radar found a trending AI topic + Benjamin sees related BIST movement = surface the connection
  • Rizz's morning brief overlaps with Angela's radar = deduplicate before both post
  • Three bots failed in the same hour = probably a Bedrock auth token expiry, not three separate bugs

Phase 4: Self-Healing

Hermes gets shell access (carefully scoped) to restart stuck services:

# Only these commands, nothing else
launchctl bootout gui/$UID/ai.openclaw.gateway
launchctl bootstrap gui/$UID ~/Library/LaunchAgents/ai.openclaw.gateway.plist

With approval required for anything destructive. Hermes can restart a crashed gateway but can't delete data or modify configs.

What I Learned Today

I spent 12 hours debugging Angela. The lessons that inform this supervisor plan:

  1. LLMs hallucinate restrictions. Angela wrote "ERROR: security policy blocked bird" across 8 consecutive cron runs without ever executing the command. The config was fully permissive. She just decided not to try. Fix: make data-fetching jobs job_type = "shell" (bash scripts, no LLM).

  2. Memory is poison. Angela had 130+ memories teaching her that shell was blocked, CDP was the only way, and Agent-Reach needed to be installed. All wrong. All from sessions where those things were temporarily true. The memories persisted long after the configs changed. Fix: purge stale memories after config changes.

  3. Config files contradict each other. AGENTS.md said "use CDP only." TOOLS.md said "use bird." SOUL.md said "use bird." The LLM followed AGENTS.md because it loaded first. Fix: ensure every file says the same thing.

  4. A supervisor can't fix broken workers. If Hermes watched Angela and saw "bird offline" 8 times, it would escalate to me. I'd check the config, find nothing wrong, and realize Angela is hallucinating. The supervisor adds a notification layer but doesn't fix the root cause.

  5. Shell scripts beat agent prompts for data fetching. The radar scraper now runs as a bash script (job_type = "shell"). It executes bird search and curl directly. No LLM deciding whether to try. Works every time.

Current State

As of today:

  • Rizz: 6 cron jobs, stable, running on port 18789
  • Angela: 7 cron jobs (4 radar, 3 omc345), radar scraper is shell-type, analyst is agent-type
  • Benjamin: BIST monitoring, Telegram, port 19800 (needs stability check)
  • BunyaminKunduz: Slack bot, port 18790 (stable)
  • Optiman: experimental, needs role definition

Next Steps

  1. Implement Phase 1 (status reporters) this week
  2. Set up Hermes on the Mac Mini next week
  3. Run Hermes in read-only mode for a week (observe, don't act)
  4. Add self-healing commands after trust is established

The goal: wake up, check one Telegram chat, see five green statuses, and get back to building.

← All notes