I'm currently taking an agentic AI course, and one of the things that keeps coming up is the idea of AI agents that don't just answer questions — they do things. Autonomously. In a loop. Without you babysitting them.
So when I stumbled across Andrej Karpathy's autoresearch project, I had to try it. The pitch is simple and kind of wild: give an AI agent a real LLM training setup, tell it to experiment, and go to sleep. Wake up to a better model.
I'm not a researcher at a big AI lab. I'm just a guy with a 2008 MacBook Pro running Ubuntu, an agentic AI course under my belt, and a RunPod account. Here's exactly what happened.
What is autoresearch?
Karpathy's autoresearch is a minimal single-file project. There are really only three things that matter:
prepare.py— downloads training data and trains a tokenizer. You run it once and never touch it again.train.py— a full GPT model, optimizer, and training loop in one file. This is what the AI agent edits.program.md— your instructions to the agent. This is what you write and iterate on.
The loop works like this: the agent reads program.md, modifies train.py, runs a
5-minute training experiment, checks if the validation metric (val_bpb — lower is better)
improved, keeps the change if it did or reverts it if it didn't, and repeats. Forever. Or until you stop it.
On a fast GPU you get roughly 12 experiments per hour — about 100 overnight.
Renting an H100 on RunPod
My MacBook definitely wasn't running this. The project requires an NVIDIA GPU, and Karpathy tested on an H100.
RunPod makes renting cloud GPUs surprisingly painless. You create an account, add credits, and spin up a "pod" — a containerized VM with your chosen GPU attached. Billing is per-second, so you're not locked into hourly minimums.
I picked the H100 SXM 80GB on Community Cloud at $2.69/hr. For an overnight run that's about $22. Way cheaper than anything AWS or Google Cloud would charge for the same hardware.
A few things I learned setting it up:
- Use the RunPod PyTorch template — it comes pre-loaded with CUDA, Python, and everything you need so you're not installing GPU drivers from scratch.
- Set your container disk to at least 20GB — the training data download needs the space.
- Create a Network Volume and clone your repo into
/workspace— this storage survives even if you terminate the pod, so your results are safe. - There's no "Stop" button on some pod configurations — only Terminate. Since my repo was on the Network Volume, terminating was fine. Everything important persisted.
- Enable SSH access so you can connect from your terminal. RunPod gives you an exact SSH command to copy-paste.
The Baseline
Once everything was set up I ran train.py manually once to get a baseline before handing things
over to the agent.
val_bpb: 0.997264
training_seconds: 300.1
mfu_percent: 38.50
num_params_M: 50.3
So: a 50M parameter model, 38.5% GPU utilization (lots of room to improve), and a val_bpb of 0.997264. That's the number to beat.
I added this to program.md so the agent would know its starting point.
Unleashing Claude Code
The agent I used was Claude Code — Anthropic's agentic coding tool. I installed it on the
pod, set my API key, and kicked it off inside a tmux session so it would keep running after I
closed my SSH connection.
The prompt I gave it:
Please read program.md carefully. Our baseline val_bpb is 0.997264 with 38.5% MFU on an H100 SXM 80GB. Your job is to autonomously run experiments by modifying train.py, running
uv run train.py, checking if val_bpb improved, keeping the change if it did and reverting if it didn't, then repeating. Run as many experiments as possible overnight. Focus on improving val_bpb — lower is better. Do not modify prepare.py.
Then I detached from tmux (Ctrl+B, D), closed my laptop, and went to sleep.
What Happened Overnight
In the morning I SSH'd back in, reattached to the tmux session, and found that it had run over 100 experiments overnight — but only 16 passed (improved val_bpb) and were committed to a new branch. Here's the full progression of the winning runs:
| # | Change | Before | After | Δ |
|---|---|---|---|---|
| 1 | depth=9: more params with acceptable step count | 0.998781 | 0.997468 | -0.001313 |
| 2 | matrix_lr=0.05: higher Muon LR converges faster | 0.997468 | 0.996470 | -0.000998 |
| 3 | warmdown_ratio=0.6: longer cooldown helps | 0.996470 | 0.995290 | -0.001180 |
| 4 | embedding_lr=0.8: higher embedding LR helps | 0.995290 | 0.995149 | -0.000141 |
| 5 | unembedding_lr=0.006: higher lm_head LR helps | 0.995149 | 0.994054 | -0.001095 |
| 6 | batch_size=218: 2x more optimizer steps is huge win | 0.994054 | 0.981629 | -0.012425 |
| 7 | x0_lambdas init=0.15: stronger residual helps | 0.981629 | 0.981323 | -0.000306 |
| 8 | x0_lambdas init=0.2: even stronger residual helps | 0.981323 | 0.980177 | -0.001146 |
| 9 | matrix_lr=0.04: lower LR better with smaller batch | 0.980177 | 0.979518 | -0.000659 |
| 10 | warmdown_ratio=0.65: slightly longer cooldown | 0.979518 | 0.979514 | -0.000004 |
| 11 | muon momentum warmup 200 steps | 0.979514 | 0.979094 | -0.000420 |
| 12 | matrix_lr=0.045 | 0.979094 | 0.979026 | -0.000068 |
| 13 | weight_decay=0.15 | 0.979026 | 0.978706 | -0.000320 |
| 14 | window_pattern=S: all short attention gives more steps | 0.978706 | 0.978668 | -0.000038 |
| 15 | short window=seq_len//4: faster attention, more steps | 0.978668 | 0.978627 | -0.000041 |
| 16 | muon ns_steps=4: faster optimizer step | 0.978627 | 0.977733 | -0.000894 |
Final val_bpb: 0.977733 — an improvement of 0.019531 from the baseline.
The single biggest jump was experiment #6: doubling the batch size. That one change alone dropped val_bpb by 0.012 — more than half of the total overnight improvement. The agent correctly identified this as the highest-leverage change and then spent the rest of the night fine-tuning around it.
What I Learned
The agent is surprisingly good at this. Claude Code didn't just randomly tweak numbers. It explored depth, learning rates, batch sizes, optimizer settings, and attention patterns in a logical sequence. Each commit message is a coherent hypothesis with a result.
GPU utilization matters. Starting at 38.5% MFU left a lot of room. The agent found that changes like short attention windows and larger batch sizes pushed more work through the H100 per minute, directly translating to better models.
The cost was reasonable. The total run cost me around $22 for the overnight session. For 100+ experiments on an H100 — 16 of which were winners — that's hard to beat.
program.md is the real skill. The agent is only as good as the instructions you give it.
Writing a clear, well-structured program.md — with the baseline, the goal, and the
constraints — is what separates a useful overnight run from a chaotic one.
What's Next
I want to run this same experiment locally on my ASRock Creator Radeon AI Pro R9700 — a 32GB VRAM AMD card. That means getting the training loop running on ROCm instead of CUDA. AMD's ROCm support has improved a lot recently and the R9700 is a serious piece of hardware, so I'm curious whether the agent finds similar optimizations on a different architecture.
If the local ROCm run works, I could run continuous overnight experiments for free instead of paying for cloud GPU time. More on that in the next post.
The full experiment branch is on GitHub: autoresearchTest/autoresearch/arch-exploration
Questions or ideas? Reach out — I'm always happy to talk about this stuff.