omc345 notes

Yo, fam! Ever heard of Karpathy's autoresearch? It's basically an AI trying to make another AI better, one tiny step at a time. But what if that AI got its hands on a whole farm of GPUs? That's what this article is all about, and spoiler alert: it goes wild!

1. WHY? The Pain of Slow AI Research & Why Supercharging It Matters 😭

Look, dude, training neural networks is like trying to bake a perfect cake – you gotta try different ingredients (hyperparameters), different ovens (architectures), and different baking times (training steps). Each "experiment" takes time, and you only know if it worked after it's done.

Before:

┌───────────────────────┐ │ 🧑‍💻 Human Researcher │ │ • Tries idea A │ │ (waits 5 min) │ │ • Sees result │ │ • Tries idea B │ │ (waits 5 min) │ │ • ... │ └───────────────────────┘ │ ▼ ┌───────────────────────┐ │ 🐢 Karpathy's Agent │ │ (1 GPU - sequential) │ │ • Edits train.py │ │ (30s) │ │ • Runs experiment │ │ (5 min) │ │ • Checks result │ │ (30s) │ │ • ... (one by one!) │ └───────────────────────┘

This sequential process is a massive bottleneck. Imagine you're writing a book and can only write one word, wait for an editor to read it, then write the next word. It's painstakingly slow!

The problem with "one experiment at a time":

🐢 Slow as molasses: Each experiment takes precious minutes, and AI research often requires hundreds or thousands of experiments.
** blinders:** You can only explore local improvements. If the best solution is two steps away from your current best and requires two interacting changes, a sequential agent might never find it because each change alone doesn't look promising.
No big picture: It's hard to see overall trends or interaction effects between different settings when you're just incrementally tweaking one thing.
Idling agent: The AI itself (the "agent" that edits code and plans next steps) just sits there twiddling its digital thumbs while the GPU is busy. What a waste! 💀

The "Ohhhhhh" moment: What if we could run a bunch of these experiments at the same time? What if the AI had many GPUs and could blast through options simultaneously? That's the pain point this research solves: speeding up AI advancement by letting an AI parallelize its own research. 🔥

2. The BIG Picture: Agent-Driven Parallel Research 🤯

Alright, so the core idea is simple but powerful: take Karpathy's "autoresearch" agent, which automates the tweak-and-test cycle, and give it access to a cluster of GPUs instead of just one.

Here's the setup:

The Brain (Agent): This is Claude Code (or similar AI), acting as the researcher. Its goal: minimize val_bpb (validation bits per byte) for a neural network within a 5-minute training budget.
The Experiment: The agent modifies a train.py file, runs it on a GPU for 5 minutes, and records the val_bpb.
The Accelerator: Instead of one lonely GPU, we're talking about a whole squad of 16 GPUs (some H100s, some H200s, because why not mix it up?).
The Orchestrator: SkyPilot is the magic glue here. It's an open-source tool that lets the agent programmatically launch and manage these GPU clusters across different cloud providers (or Kubernetes, like in this case). The agent actually understands how to use SkyPilot via a "skill."

The transformation:

┌───────────────────┐ ┌───────────────────┐ │ 🐢 Single GPU │ │ 🚀 GPU Cluster │ │ • 1 Experiment │ │ • 16 Experiments │ │ per 5 min slot │ │ per 5 min slot │ │ • Sequential │ │ • Parallel │ │ • Greedy search │ │ • Factorial grids│ └───────────────────┘ └───────────────────┘ │ │ │ │ ▼ ▼ ┌──────────────────┐ ┌─────────────────────────────────┐ │ Agent + 1 GPU │ │ Agent + SkyPilot │ │ │ │ (talking to 16 GPUs) │ │ IDLE TIME!! │ ─────► │ ⚡ No more idle time! │ └──────────────────┘ └─────────────────────────────────┘

The result? The agent can now test 16 ideas simultaneously, dramatically accelerating the research process. It's like going from asking one person a question at a time to polling 16 people at once!

3. How It Works: The Mechanics of Auto-Scaling AI Research 🛠️

Let's break down the actual process, step-by-step.

3.1 The Autoresearch Project Structure

The underlying Karpathy autoresearch project has three key files:

prepare.py:
- What it does: Handles data downloading, tokenizer training, and sets up the dataloader and evaluation function.
- Agent's interaction: The agent is not allowed to modify this. It's read-only, defining the problem constraints.
train.py:
- What it does: Contains the actual neural network model definition (GPT), the optimizer, and the training loop.
- Agent's interaction: This is the agent's playground! It can tweak anything in here:
  - Model architecture (e.g., width, depth)
  - Hyperparameters (learning rate, batch size, weight decay)
  - Optimizer settings
  - ...as long as the code still runs!
program.md:
- What it does: These are the instructions for the agent itself.
- Agent's interaction: Tells the agent:
  - What it can change (train.py).
  - How to evaluate results (minimize val_bpb).
  - When to keep or discard changes.

3.2 SkyPilot: The Agent's Cloud Superpower 💪

This is where SkyPilot comes in to unlock parallelism.

The Agent Learns SkyPilot:
- The instructions.md given to Claude Code points it to a "SkyPilot skill."
- This skill teaches the agent how to interact with SkyPilot command-line tools (sky launch, sky exec, sky logs, sky status).
- Think of it like giving the agent a new app on its phone – it now knows how to use it!
Defining an Experiment (experiment.yaml):
- Each experiment is defined in a YAML file.
- resources: Specifies what GPU it needs (e.g., H100:1 or H200:1). Also, importantly, infra: k8s dictates it should run on a Kubernetes cluster.
- workdir: Where the code lives.
- envs: Environment variables for the experiment (e.g., EXPERIMENT_ID, EXPERIMENT_DESC).
- setup: Commands to run once when the cluster first starts (e.g., pip install uv, uv sync, uv run prepare.py).
- run: The actual command to execute the training, capture logs, and parse val_bpb.
  - It uses tee run.log to write to both stdout and a file.
  - It then greps and awks the val_bpb and peak_vram_mb from the log, indicating the experiment's status and results.
Launching Parallel Experiments:
- The agent uses sky launch to provision new GPU clusters based on experiment.yaml.
  - Example: sky launch gpu-01 experiment.yaml -d --env EXPERIMENT_ID=exp-01
  - The -d means "detached" – fire it off and immediately get back to planning the next thing.
  - It creates a named cluster (e.g., gpu-01).
- To run subsequent experiments on an already running cluster, it uses sky exec.
  - Example: sky exec gpu-01 experiment.yaml -d --env EXPERIMENT_ID=exp-02
  - This queues the new experiment to run as soon as the previous one finishes on gpu-01, meaning zero idle time for that specific GPU.
Monitoring and Decision Making:
- The agent uses sky logs <cluster-name> to see the output of its running or finished experiments.
- It parses these logs to extract val_bpb (the performance metric).
- Based on these results, it updates train.py for the next wave of experiments.
- It maintains a record of the "best" configuration found so far and commits improved train.py versions.

3.3 The Loop: Autonomous Research Cycle

The agent essentially enters this super-powered loop:

Evaluate Current State: Check the scores of all recently finished experiments.
Generate Hypotheses: Based on what it learned, it proposes new changes to train.py (e.g., "try different learning rates," "increase model width").
Prepare Experiments: For each hypothesis, it generates a (slightly modified) train.py and a corresponding experiment.yaml.
Launch in Parallel: It uses sky launch -d (for new clusters) or sky exec -d (for existing, idle clusters) to blast these experiments onto its 16 GPUs.
Monitor: It periodically checks sky logs for results.
Repeat: Once enough results are in, it goes back to step 1.

This fully autonomous, parallel loop allows it to run a ton of experiments very quickly.

4. Details + Edge Cases: The Agent's Clever Moves 🧐

Beyond just brute-force parallelism, the agent started to show some real "intelligence."

4.1 Emergent Research Phases 📈

The agent didn't "plan" a grand strategy. Its research phases naturally emerged from the results:

Phase 1: Hyperparameter Sweeps (~200 experiments):
- What: It tested all the "easy" knobs: batch size, Adam betas, weight decay, learning rates.
- How parallelism helped: Instead of checking one learning rate, then another, then another, it checked 10-13 variations at once.
- Results: Quickly got val_bpb down from 1.003 to 0.981.
Phase 2: Architecture Discovery (~220 more experiments):
- What: Realized hyperparameters weren't yielding big gains, so it started playing with model architecture (specifically, ASPECT_RATIO which controls model_dim).
- How parallelism helped: This was the biggest win for parallelism. It tested SIX different aspect ratios (AR=48, 64, 72, 80, 90, 96, 112) simultaneously in one 5-minute wave. Serially, this would be 30+ minutes just to gather the initial data.
- Results: Found that AR=96 (a wider model) was a game-changer, dropping val_bpb to 0.977. It saw the trend instantly!
Phase 3: Fine-tuning the Wider Model (~140 more experiments):
- What: Now that it had a better architecture, it re-tuned hyperparameters specifically for that wider model.
- Results: Small but steady improvements to 0.975.
Phase 4: Optimizer Tuning (~140 more experiments):
- What: Focused on tweaking the Muon optimizer's specific parameters.
- How parallelism helped: Found muon_beta2=0.98 by testing 5 values in one parallel wave.
- Results: Another significant late-stage jump to 0.974.
Phase 5: Diminishing Returns (~210 more experiments):
- What: Tried even more combinatorial tweaks, but the big gains were gone.
- Results: Improvement flattened out, signaling it had likely extracted most of the juice under the current constraints.

4.2 Exploiting Heterogeneous Hardware 🧠

This is where it gets really spicy! The agent was given a mix of GPUs: H100s and H200s. It wasn't explicitly told the difference, but it figured it out:

Observation: It noticed that experiments run on H200s (even with the exact same train.py) consistently yielded better val_bpb scores (~0.005 lower).
Inference: It linked this to H200s being faster, leading to more training steps within the fixed 5-minute budget.
Strategic Shift (Self-Correction): The agent then started using the H100s for initial screening of many ideas in parallel (the "cheap" GPUs) and only promoted the most promising candidates to the H200s for final validation ("expensive" GPUs with better results).
- This is a classic human research strategy, invented autonomously by the AI!
Catching Nuances: It even found cases where a setting that looked good on an H100 (fewer training steps) was actually worse on an H200 (more training steps). This highlights how environment can affect optimal configurations.

Ascii Art: The Emergent Strategy

┌───────────┐ ┌───────────┐ │ H100 GPUs │ │ H200 GPUs │ │ (Cheap, │ │ (Faster, │ │ Many) │ │ Few) │ └───────────┘ └───────────┘ │ ▲ │ 🚀 "Screen Many Ideas!" │ 🏆 "Validate Best!" ▼ │ ┌─────────────────────────┐ ┌─────────────────────────┐ │ Agent's INITIAL IDEAS │ │ Agent's "TOP PICKS" │ │ (Many hypotheses tested) │ ───Promote───► │ (Refined & Confirmed) │ └─────────────────────────┘ └─────────────────────────┘ │ │ │ ▼ └────────(Agent learns which are 'best')─────────► BETTER AI MODELS! ✅

4.3 Cost Efficiency 💲

The entire 8-hour, ~910-experiment run using 16 GPUs cost under $300 in GPU compute (plus a tiny Claude API cost, like $9). This demonstrates that scaling research this way can be surprisingly cost-effective, much cheaper than hiring a team of human researchers to achieve the same results in 9x the time.

BURN THIS IN YOUR BRAIN 🔥

The WHY: AI research, especially hyperparameter tuning and architecture search, is slow and iterative. Before this, agents were stuck doing one experiment at a time, like a human researcher with only one laptop.
The SOLUTION: Give the agent access to many GPUs (via SkyPilot) so it can run experiments in parallel.
The IMPACT:
- Speed: 9x faster progress (8 hours vs. 72 hours for the same results).
- Strategy Shift: Agent moved from greedy, incremental search to exploring wide factorial grids, preventing local optima.
- Emergent Intelligence: The agent discovered heterogeneous hardware and developed a sophisticated two-tier research strategy to exploit it, showing an impressive level of autonomous adaptation.
SkyPilot's Role: It's the critical "operating system" that allows the agent to self-manage cloud resources and parallelize its workload.

This is a peek into the future, my guy, where AIs aren't just doing tasks, they're autonomously figuring out how to make themselves better, faster, and more efficiently. Wild stuff, right? 🤯