Building a Local Turkish Voice Agent: Architecture Deep Dive

Imagine speaking Turkish into your browser mic, and within seconds, a local AI responds in natural voice—all running on your hardware, no cloud required. This voice agent streams speech from browser to server via WebSocket, transcribes it with Whisper, crafts replies via Ollama's LLM, and synthesizes speech with XTTS. Every step pipelines over a single persistent connection.

Browser (client.html)
  │  mic PCM ↓        ↑ TTS PCM + JSON status/transcript
  └──── WebSocket ────┘
           │
      bot.py (FastAPI + uvicorn)
           │
    ┌──────┼──────────────────┐
    │      │                  │
 Whisper  Ollama          XTTS v2
 (STT)    (LLM)           (TTS)

📝 Craft note: Opening with a vivid "imagine" hook draws readers in immediately, replacing the generic "Overview" summary. This specific scenario shows the agent's power ("natural voice—all running on your hardware") before diving into details, building excitement and context in one punchy paragraph.

Core Libraries: Choices, Wins, and Pitfalls

FastAPI + Uvicorn: The WebSocket Backbone

FastAPI powers the HTTP server and WebSocket transport, channeling raw audio streams with minimal fuss.

Its native WebSocket support shines: await ws.receive_bytes() grabs mic data; await ws.send_bytes() blasts back synthesized speech. Uvicorn's ASGI backbone syncs perfectly with Python's asyncio, juggling concurrent tasks—Whisper inference in threads, HTTP calls to Ollama and XTTS, and client audio delivery.

Wins:

  • WebSocketState.CONNECTED checks avert crashes on client dropouts.
  • FileResponse serves client.html directly—no extra static server.
  • CORS middleware slips in effortlessly.

Pitfalls:

  • No native backpressure: Slow clients buffer endless send_bytes() calls, bloating memory. Fine for solo local runs; disastrous at scale.
  • Absent heartbeats leave silent network drops undetected until sends fail.

Short connections demand vigilance. Next, the transcription engine devours that incoming audio.

📝 Craft note: Parallel structure in "Wins" and "Pitfalls" lists ("X checks avert Y"; "No Z: A buffers B") creates snappy rhythm, making bullet points scan faster and stick. Original used fragmented phrases; this mirrors grammar for momentum.

faster-whisper: Turbocharged Turkish Transcription

This STT beast ingests 16kHz PCM from the mic, spitting out precise Turkish text via the deepdml/faster-whisper-large-v3-turbo-ct2 model—a CTranslate2-optimized Whisper variant.

CTranslate2's int8 quantization accelerates inference 4-8x over vanilla Whisper on CPU, slashing seconds off each utterance. No cloud APIs; pure local fire.

Wins:

  • Nails Turkish quirks: ş, ç, ğ, ı, ö, ü emerge flawless.
  • Greedy decoding (beam_size=1) trims latency; VAD filter (vad_filter=True) axes silence hallucinations.
  • Thread executor (run_in_executor()) frees the async loop.

Pitfalls:

  • Lingering hallucinations on noise yield ghost phrases. Client-side RMS silence detection patches crudely.
  • CPU churns 1-3s per clip—prime latency villain. GPU drops it to 200ms.
  • Model loads block startup (~3-5s); int8 trades edge-case fidelity for speed.

Precision demands power. The LLM now seizes that transcript.

Ollama with Qwen 3.5 27B: Streaming Turkish Brains

Ollama hosts the qwen3.5:27b-q4_K_M model—27B parameters, 4-bit quantized for 16GB VRAM consumer GPUs. It ingests chat history, forges concise Turkish replies.

Streaming ("stream": true) unlocks the magic: First token lands in 500ms, fueling instant TTS. Disable thinking ("think": false); cap at num_predict: 150. aiohttp slurps NDJSON lines effortlessly.

Wins:

  • Sentence streaming bridges LLM wait times.
  • OpenAI-like API: Dead simple.

Pitfalls:

  • Regex splitter ([.!?;:]\s|[.!?;:]$) stumbles on "Dr. Ahmet" or "3.14"—needs a tokenizer upgrade.
  • Empty chunks demand filtering; quantization dings rare coherence.
  • Cold loads spike first-token latency to 10s.

Tokens flow. XTTS breathes life into them.

XTTS v2 via Coqui TTS: Natural Voice Forge

XTTS v2 clones the "ortayli" Turkish speaker, rendering LLM text as fluid speech over HTTP (port 5050). Pre-bake fillers ("Hmm", "Şey") for zero-latency nods.

Sentence-by-sentence synthesis streams audio as LLM generates, masking delays.

Wins:

  • Prosody rivals humans; WAV-to-PCM stripping is trivial.
  • Fillers hit instantly; partial responses play mid-generation.

Pitfalls:

  • 1-3s per sentence on GPU—latency co-conspirator.
  • No streaming output; per-call sessions waste connections.
  • Shorts clip awkwardly.

Supporting Cast: aiohttp, NumPy, Loguru, Browser APIs

aiohttp pipelines async HTTP to Ollama/XTTS, though per-call sessions squander pools. NumPy vectorizes PCM math (np.frombuffer(...).astype(np.float32) / 32768.0)—blazing, flawless. Loguru colors logs vividly but skips rotation.

Browser-side: AudioWorklet snags 16kHz mic data on its thread; dual AudioContexts dodge rate clashes; queued createBufferSource playback gaps minimally. Constraints tame echo and noise. Yet allocations glitch under load; no AEC mutes speaker feedback; chunk gaps click faintly.

Each piece slots in. But why this raw architecture?

📝 Craft note: Strong verbs ("devours", "spits out", "slashes", "seizes", "forges", "slurps", "breathes life") replace weak ones like "converts" or "takes," injecting energy and vividness. Original: "Converts raw PCM audio"; here: "ingests 16kHz PCM... spitting out precise Turkish text." This paints action, hooking technical readers.

Key Decisions: Raw Power Over Frameworks

Pipecat tempted with pipeline abstractions, but its WebRTC dependency (Daily.co keys), custom XTTS boilerplate, and VAD mismatches bloated a linear flow. Ditched for 300 lines of pure async Python.

WebSocket trumps WebRTC: No ICE dances, proxies plug in seamlessly. Latency? Inference seconds dwarf transport milliseconds. Tradeoff: Lacks WebRTC's jitter buffers.

Sentence streaming? Tokens yield robotic TTS; full phrases ensure flow. Fillers camouflage the 1-3s wait.

📝 Craft note: Precise nouns ("ICE dances", "jitter buffers") sharpen abstractions like "no STUN/TURN servers," clarifying tradeoffs without jargon dumps. Sentence variety mixes short punches ("Ditched for 300 lines.") with flowing explanations, controlling pace—read aloud, it breathes.

Latency Exposed

| Stage | Duration | Notes | |--------------------|---------------|--------------------------------| | Filler sound | 0ms | Pre-loaded; instant play | | Whisper STT | 1-3s (CPU) | Drops to 200ms on GPU | | LLM first token | 0.3-1s | Model-warm dependent | | LLM first sentence | 1-3s | Length scales time | | TTS per sentence | 1-3s | Sequential calls | | Total to speech| 3-7s | GPU tweaks: 1.5-3s |

Bottlenecks scream for GPU.

Fixes on the Horizon

  1. Pooled sessions: Share aiohttp.ClientSession across LLM/TTS calls.
  2. Echo mute: Flag "playback" to trash mic input.
  3. Smart splitter: Turkish tokenizer over regex.
  4. Barge-in: Monitor mic mid-speech; kill TTS.
  5. Parallel TTS: Queue next while prior plays.
  6. Cleanup: Axe Pipecat relics.
  7. Hallucination guard: Confidence filters post-VAD.

This stack delivers offline Turkish voice today—polish tomorrow.

📝 Craft note: Table preserved verbatim for accuracy, but surrounding prose shows metrics ("Drops to 200ms") instead of telling ("biggest bottleneck"). Transitions like "Bottlenecks scream" propel from analysis to action, using metaphor for rhythm without fluff.

← All notes