omc345 notes

What if you could pick up your phone, dial a number, and talk to your running Claude Code session? Not a fresh AI chat — your actual session, with all its context: your codebase, your git state, your running tests.

This is a brainstorm on how that could work, and why cmux makes it uniquely viable.

Two Approaches

Approach A: Route voice through a live cmux session

This is the interesting one. The call doesn't start a new AI conversation — it connects to a running Claude Code session inside cmux.

The flow works like this:

You call a Twilio number
Twilio streams your audio over WebSocket to a bridge server
The server transcribes your speech in real-time (Deepgram or Whisper)
Transcribed text gets sent to your cmux workspace via cmux send
Claude Code processes it — reads files, edits code, runs commands
The bridge captures Claude's response via cmux read-screen
Response gets converted to speech (ElevenLabs or OpenAI TTS)
Audio streams back to your phone

You're not talking to a generic AI. You're talking to the same session that knows your project, your branch, your recent changes.

Approach B: Claude API direct

Simpler but less powerful. Your server receives the call, transcribes audio, calls the Claude API directly with tools, converts the response to speech. Works fine, but it's a fresh context every time — no access to your running sessions or accumulated state.

Why cmux Is the Key Ingredient

cmux exposes Unix socket primitives that make Approach A possible:

cmux send <text> — types into a live terminal session. This is how transcribed voice commands reach Claude.
cmux read-screen — captures terminal output. This is how the bridge reads Claude's responses.
cmux list-workspaces — shows all running sessions. Enables "which workspace should I talk to?" routing.
cmux sidebar-state — reads status without parsing terminal output. Quick status checks become trivial.
cmux notify — desktop notifications. Alert when a call comes in.

With cmux-claude-pro's 16-hook integration, the bridge could also read sidebar state (progress bars, status pills, activity logs) for instant status summaries.

The Stack

Phone gateway: Twilio, Telegram Bot API, or Vonage — receives and places calls, streams audio.

Speech-to-text: Deepgram (streaming, ~200ms latency), Whisper, or AssemblyAI.

Brain: Either Claude API with tools, or cmux routing to a live Claude Code session.

Text-to-speech: ElevenLabs, OpenAI TTS, or Play.ht — converts responses to natural voice.

Actions: cmux send/read-screen for session control, or Claude's native tool use.

Things to Think About

Latency. Twilio WebSocket gives real-time audio. Deepgram streaming adds ~200ms. Claude's response time varies. TTS adds another ~200ms. Total round-trip: 2-5 seconds. Acceptable for a coding assistant, but not instant.

Session routing. With multiple workspaces running (kunduz, some-api-server, react-app), the bridge needs to know where to send commands. Could be voice-based ("talk to kunduz") or a simple menu.

Security. You're giving phone access to a system that can edit code, run commands, and push to git. Minimum: caller ID verification and a PIN. Twilio has built-in voice verification.

What works well over voice. Status checks ("how's the build?"), high-level direction ("fix the failing test"), and monitoring ("any errors?"). What doesn't: detailed code review, complex multi-file changes, anything where you'd want to see the diff.

A Minimal Prototype

Maybe 200 lines of Node.js:

A Twilio phone number that receives calls and streams audio via WebSocket
A server that runs Deepgram STT on the incoming audio
A cmux bridge that sends transcribed text to a hardcoded workspace and reads the response
OpenAI TTS to convert the response to audio and stream it back

The cmux primitives do the heavy lifting. The rest is plumbing.

The Bottom Line

Every piece of this stack exists today. The unique ingredient is cmux's socket-based terminal control, which turns "talk to an AI" into "talk to your running AI session." That session knows your codebase, your git state, your running tests. A fresh API call doesn't.

The real question isn't whether it's possible — it's whether speaking code commands while away from the keyboard beats just walking back to the keyboard. For quick status checks and high-level steering? Probably yes. For anything detailed? Probably not.

But the fact that the infrastructure exists to try it is interesting enough to write down.