Markdown as a Protocol for Agentic UI: Making AIs Build UIs on the Fly! πŸš€

link-note

Yo, fam! This article is straight-up πŸ”₯. It's talking about how AI agents can build user interfaces as they're talking to you, right in the middle of a conversation. Forget static web pages, we're talking about dynamic UI components popping up when you need them.

WHY does this even exist? πŸ€” The Pain Before the Protocol

Look, bro, think about how you usually build software or even use AI today.

  • ❌ Building traditional UIs is slow and rigid. You gotta design it, code it, deploy it. If something changes, you gotta do it all again. It's like pouring concrete every time you want to move a couch.
  • ❌ AI chatbots are mostly text-based. You ask a question, they give you text. If you want to, say, fill out a form or interact with a graph, the AI can't just whip that up for you. It's stuck in text mode.
  • ❌ Current AI tools often have a "tool calling" step. The AI identifies it needs a tool, tells you what it's doing, calls the tool, then gives you the result. That's a lot of back-and-forth and not very fluid.

The "OHHHHHH" Moment: What if the AI could just generate the UI itself, natively, as part of its response, whenever it needed to get information from you or show you something visually? Like, boom, here's a form, fill it out. Or, boom, here's a live chart. This system aims to solve that pain, making AI interactions way more dynamic and, dare I say, agentic.

The BIG PICTURE: How Markdown Becomes a UI Wizard πŸ§™β€β™‚οΈ

The core idea is pretty wild: Markdown isn't just for formatting text anymore; it's a communication protocol where AI, server, and client all speak the same language.

Imagine your AI chatbot not just sending you text, but also:

  • Running code on a server.
  • Streaming live data.
  • Popping up interactive forms or charts right there in your chat window.

All of this happens inside the same conversation flow, using the syntax the LLM (Large Language Model, aka the AI) already knows best: Markdown.

Here's the mental model for how it glues together:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    User (You)   β”‚             β”‚   AI (LLM Agent)    β”‚
β”‚ (Chat Interface)β”‚             β”‚ (The smarty pants)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ (Prompt)                      β”‚
          β”‚                               β”‚ Generates
          β–Ό                               β”‚ Markdown 
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   Text, Code, Data    β”‚ (interleaved)
β”‚ Client (Browser)β”‚ ◀─────────────────────┴──────────┐
β”‚ (Parses Markdown)        (WebSockets, HTTP/2...)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
          β”‚ Execute Code Blocks                       β”‚
          β”‚ Render UI Components                      β”‚
          β–Ό                                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚
β”‚ Server (Runtime)β”‚ β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (bun-streaming-exec)                                 
β”‚ (Mounts React UIs)                                    
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                     
          β”‚ (Console logs, errors, user input)
          β”‚                                           
          └──────────────────────────────────────────►
                                                  (Feeds back to LLM for next turn)

The magic happens because the LLM uses Markdown to tell the client and server what to do. The client parses it, renders UIs, and sends user input back. The server executes code blocks and sends outputs back to the LLM. It's a continuous loop!

The MECHANICS: Three Pillars Holding Up This UI Temple πŸ’ͺ

This whole system rests on three super clever ideas:

1. Markdown as Protocol: The Universal Translator πŸ—£οΈ

WHY: LLMs are literally trained on the internet. And what's all over the internet? Markdown! Code fences, formatting, tablesβ€”they get it. So, why invent a new language for them when they already speak one fluently? Using Markdown means the AI doesn't need special training; it's already pre-wired for this.

HOW IT WORKS: The agent outputs a single stream of Markdown, but it's not just text. This Markdown stream contains three different types of "blocks":

  • Text Blocks:

    • Syntax: Just regular Markdown (**like this**, *or this*).
    • Purpose: This is the standard chat output. It streams directly to your eyeballs, token by token.
    • Example: "Hey! I am the assistant. This text is streamed to the user token by token."
  • Code Fences:

    • Syntax: ```tsx agent.run(or similar, with anagent.run` hint).
    • Purpose: These blocks contain actual code (TypeScript/JavaScript) that the server needs to execute. The agent.run part tells the system, "Hey, run this code!".
    • Example:
      const messages = await fetchMessages(); // Server will run this!
      
  • Data Fences:

    • Syntax: ```json agent.data => "id"(or similar, with anagent.data` hint).
    • Purpose: These are for streaming structured data (like JSON) directly into a UI component that's already mounted on the client. The "id" part tells it which UI component should receive this data.
    • Example:
      [
        { "name": "Blade Runner", "rating": 4.5 } // This JSON goes to UI component "fake-movies"
      ]
      

KEY TAKEAWAY: All these different types of information are interleaved in one single Markdown stream. The client and server just parse it as it comes in and handle each block type appropriately. No complicated, separate APIs!

2. Streaming Execution: No Waiting Around ⏰

WHY: Imagine asking an AI for something, and it starts generating code for a UI. If you have to wait for the entire block of code to be sent before it starts running, that's a delay. And if it's a huge block, you're just staring at a blank screen. That's a bad user experience. We live in an instant gratification world, bro!

HOW IT WORKS:

  • As the LLM generates the Markdown, the system doesn't wait for a full code block to finish.
  • As soon as a complete statement within a code fence arrives, it's immediately executed on the server.
  • This means:
    • API calls start instantly.
    • UI components can begin rendering right away.
    • Errors surface faster, so the AI can correct itself sooner.

The "Cursed" Part: This is tricky to implement in standard runtimes. The author had to build a custom solution (bun-streaming-exec) to achieve this, using some clever vm.Script magic. It's like feeding code line by line to the server and having it run instantly, even though it's part of a larger, still-being-generated block.

3. The mount() Primitive: Your UI Creation Superpower ✨

WHY: You've got text, code, and data streaming. But how do you actually turn that into a visible, interactive user interface? You need a specific function that says, "Hey, put this UI on the screen now!" React is perfect because LLMs have seen millions of examples of React components and JSX.

HOW IT WORKS:

  • The mount() function is the core primitive the agent uses to create UIs.
  • When the LLM generates a code block like this:
    mount({
      ui: () => <Card>Hello from the agent!</Card> // This is React JSX!
    });
    
  • The server executes it. mount() takes that React component, serializes it (turns it into a format that can be sent over the wire), and sends it to the client.
  • The client then receives this serialized component and renders it directly inside the chat interface.

This is where the agent becomes a UI designer on the fly!

FOUR Ways Data Moves: The Dance of Information πŸ•ΊπŸ’ƒ

This system is all about dynamic interaction, and that means data needs to flow seamlessly between the user, the server, and the AI. The article highlights four key patterns:

1. Client β†’ Server (Forms: Getting User Input) πŸ“

  • The Pain: How does the AI get structured input from you beyond just text? Like when you need to fill out a field or pick an option.
  • The Solution: The mount function can render forms.
    • The outputSchema (using z.object from Zod) tells the system what kind of data to expect from the form.
    • await form.result is a crucial line on the server. It pauses the server-side code execution until the user fills out and submits the form on the client.
    • Once submitted, the console.log sends that user input back to the LLM as part of its "runtime transcript."
  • Imagine: AI says, "What's your name and email?" Pop! A form appears. You fill it. Press submit. The AI immediately gets your data and continues the conversation.

2. Server β†’ Client (Live Updates: Reacting to Server Changes) πŸ“Š

  • The Pain: If something changes on the server (e.g., a progress bar update, a new message coming in), how do you get that reflected in the UI without the AI needing to regenerate the whole UI or know about every tiny change?
  • The Solution: Data objects.
    • The server creates a new Data({ progress: 0 }).
    • This Data object is passed to mount(), and the UI uses data.progress.
    • When the server-side data.progress = 40 changes, the system detects this mutation (because Data objects are "proxies").
    • It serializes just that change (a "patch"), sends it over a WebSocket, and the client applies the patch to update the UI instantly.
  • Imagine: AI starts a long task. Pop! A progress bar appears. As the task progresses on the server, the bar updates in real-time in your chat.

3. LLM β†’ Client (Streaming: Data Arriving Piecemeal) 🌊

  • The Pain: What if the data for a UI component is large or arrives slowly? You don't want to wait for all of it before showing anything.
  • The Solution: StreamedData objects and jsonriver.
    • The agent mounts a UI that expects streamedData.
    • The LLM then uses a json agent.data => "id" fence to stream JSON data directly to that StreamedData object.
    • The client uses jsonriver to incrementally parse the incoming JSON. As each piece of JSON arrives, the UI updates, showing partial data.
  • Imagine: AI is fetching a list of movies. Pop! A list appears. As each movie's data arrives from the AI, it pops onto the list in real-time, one by one, instead of waiting for the full list.

4. Client β†’ Server (Callbacks: Interactive UI Actions) πŸ‘†

  • The Pain: How do you let the user click a button or perform an action within an AI-generated UI, and have that action trigger server-side logic without needing the AI to be involved in every single click?
  • The Solution: callbacks in mount().
    • The server defines an async function, like onRefresh.
    • This function is passed to mount() in the callbacks prop.
    • The UI component's button (e.g., onClick={callbacks.onRefresh}) calls this server-side function when clicked.
    • The server executes onRefresh, which can fetch new data, update Data objects (triggering live updates as above), etc.
  • Imagine: AI shows you a list of messages. Pop! A refresh button appears. You click it. The client makes a call to the server-side onRefresh function, which fetches new messages, updates the displayed data, all without prompting the AI for a new turn.

Slots: Building Complex UIs Incrementally πŸ—οΈ

WHY: If an AI has to generate a very complex UI, it might take a while to write all the code. That means the user has to wait longer to see anything. This is bad for responsiveness.

HOW IT WORKS:

  • The agent first creates a "shell" UI with mount(). This shell has Slot components where more complex parts will go. These slots can show a Skeleton (placeholder) while waiting.
  • Then, the agent uses shell.mountSlot("slotName", uiComponent) to fill in each slot as soon as it finishes generating the code for that specific part.
  • This means the user sees the basic structure of the UI very quickly, and then individual sections pop in as they're ready.
  • Slots inherit context (data, callbacks) from their parent, so they remain reactive and connected.
AI thinks 🧠...
Sends shell UI code.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Card Header       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Skeleton (Stats)  β”‚ β”‚   <-- User sees this FAST!
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Skeleton (Blockers)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

AI thinks 🧠...
Sends stats slot code.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Card Header       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚   Actual Stats    β”‚ β”‚   <-- Stats pop in!
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Skeleton (Blockers)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

AI thinks 🧠...
Sends blockers slot code.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Card Header       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚   Actual Stats    β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Actual Blockers   β”‚ β”‚   <-- Blockers pop in!
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This makes the whole experience feel much snappier.

On Security: The Elephants in the Room 🐘

The article gives a brief but important nod to security.

  • The Problem: Executing AI-generated code is inherently risky. What if the AI generates malicious code?
  • The Stance: This project doesn't solve security. It assumes that sandboxing, permissions, and static analysis (all things companies like Google and OpenAI are working on) will eventually handle the code execution risks.
  • The Unsolved Problem: Prompt injection. If a user can trick the AI into generating bad code, that's still a huge challenge across all agent architectures, not just this one.

TL;DR: This system is built on the premise that the underlying code execution can be made safe.

WHY This Works So Well (LLM Ergonomics) βœ…

The author found that LLMs picked up this protocol immediately. Why?

  • Markdown: LLMs are trained on it. It's their native tongue.
  • TypeScript/React: The most popular modern dev tools on GitHub. LLMs have seen millions of examples of how to write JSX components, how to use state, how to define callbacks.
  • mount() arguments (await results, callbacks, Zod schemas): These are common programming patterns that LLMs have seen everywhere.

The core genius: Instead of trying to teach the LLM a new language or a new way to think about UI, this system just takes patterns the LLM already knows and arranges them into a functional system. It's like giving a master chef all the ingredients they know and asking them to cook, rather than teaching them a whole new cuisine.

BURN THIS IN YOUR BRAIN πŸ”₯

  • Markdown is the new API: It's not just for text; it's a multimodal protocol for AI, client, and server.
  • AI as a UI designer: LLMs can generate and render interactive UIs on the fly, mid-conversation.
  • Streaming is key: Code executes incrementally, data streams in, UIs update instantly for a snappy experience.
  • mount() is your entry point: This function transforms AI-generated code into real, interactive React components.
  • Data flow patterns: Important to understand the four ways data can move for dynamic interactions.
  • LLM Ergonomics: This whole thing is designed around what LLMs already know, making them naturally adapt to it.

This is a wild look into one possible future of AI interaction, where your chatbot isn't just chatting; it's building things for you as you speak. Pretty mind-blowing, right? 🀯

← All notes