I Let an AI Agent Run 300+ ML Experiments on My AMD GPU Overnight. Here's What It Found.

Back to Blog

I pointed autoresearch at an AMD Radeon AI PRO R9700 — RDNA4, gfx1201, 32 GB of VRAM — and let it run ~316 experiments over several sessions. The headline finding wasn't an architecture trick or a clever optimizer. It was that torch.compile works on this GPU now, and the popular AMD fork I started from had it disabled.

There's a project called autoresearch — a simplified, single-GPU descendant of Karpathy's nanochat — that does something beautifully stupid: it loops overnight, modifying a train.py, running short pretraining experiments, keeping only the changes that improve validation loss, and iterating. It's autonomous ML research with the patience of a machine and the compute budget of a hobbyist.

The Setup

The stack: ROCm 7.2.0, PyTorch 2.9.1+rocm6.3, Ubuntu 24.04, a Ryzen 9 9950X3D on the CPU side. The model is a 25–50M parameter GPT with SwiGLU MLPs, rotary position embeddings, value embeddings (ResFormer-style), and a MuonAdamW optimizer. Training data is ClimbMix (general web text).

I started from andyluo7's AMD fork of autoresearch, which disables torch.compile on ROCm and falls back to PyTorch SDPA instead of Flash Attention 3. At the time the fork was written, that was the right call — compile didn't work. I didn't question it.

Each experiment gets a fixed wall-clock budget (5, 10, or 15 minutes). The validation metric is val_bpb (bits per byte), which is vocab-size-independent, so architecture changes get a fair comparison. The agent — Claude Code — proposes a change, runs the experiment, checks whether val_bpb improved, and either keeps or reverts.

Five Minutes: Squeezing Blood from a Stone

Session 1 was pure architecture and hyperparameter exploration: 231 experiments, 35 kept. Baseline val_bpb: 1.711. Final: 1.170 — a 31% improvement.

The biggest wins were surprisingly mundane. Cutting TOTAL_BATCH_SIZE from 219 to 215 gave more optimizer steps in the fixed time window. Reducing depth from 8 to 5 layers. Shrinking the FFN multiplier from 4 to 3. Fine-grained tuning of learning rates, initialization scales, and betas. Nothing exotic — just the agent methodically searching the space one axis at a time.

Session 2 was where things got interesting in a "negative results are results" way. Twenty-five experiments testing training dynamics tricks: EMA, SWA, z-loss, multi-token prediction, LayerDrop, label smoothing, dropout, gradient noise injection, Lookahead optimizer, and more.

Zero kept. Every single one was net negative.

The key insight: in a tight 5-minute budget with depth 5 and MuonAdamW, training never plateaus. The final checkpoint is strictly the best one, so weight averaging can't help. Gradient noise injection was catastrophic because Muon already normalizes gradients — adding noise on top just destroys signal. The standard pretraining priors (EMA helps, z-loss helps) simply don't carry over to this regime.

Cross-seed variance was measured at ~0.005 val_bpb, so these weren't noise — the effects were real and reliably negative.

Session 3 tried structural shifts: optimizer swaps (full AdamW, Lion), parallel transformer blocks, NormFormer, tied embeddings, MoE, differential attention. Fifteen experiments, one marginal keep.

Two findings stood out. First, Muon is essential — both AdamW and Lion were significantly worse. Second, "throughput is king." Any change that cost even 10% per-step speed got punished harder than any capacity gain could compensate. At low MFU, wall-clock efficiency dominates everything.

The Budget Realization

After 271 experiments at 5 minutes, the config was ruthlessly optimized for that specific budget. So I tried something obvious: what happens if I just give it more time?

Budget val_bpb Code changes
5 min 1.170 Fully tuned
10 min 1.123 None
15 min 1.101 None

No code changes. Just longer training. And the optimal architecture shifted — at 15 minutes, depth 7 beat depth 5, the matrix learning rate wanted 0.015 instead of 0.02, and the warmdown ratio moved from 0.85 to 0.93. After ~20 re-exploration experiments at 15 minutes: val_bpb 1.090.

This is worth internalizing: hyperparameter optima are functions of your compute budget. An agent that runs until it plateaus under one budget will not find the optimum under a different budget unless you explicitly tell it to re-explore. The 5-minute champion is not the 15-minute champion.

The torch.compile Story

This is the actual headline of the post.

The AMD fork I started from disables torch.compile because it didn't work on ROCm when the fork was written. I had Claude Code test whether that was still true with ROCm 7.2 + PyTorch 2.9.1.

It was not.

Re-enabling torch.compile (the change is literally uncommenting a line):

Then switching to mode="max-autotune":

model = torch.compile(model, mode="max-autotune")

What happened? The default Triton kernel configurations for the small dim=512 matmul shapes in this model are not well-tuned for RDNA4. max-autotune spends extra compile time searching over kernel configs and found materially better ones. The model was memory-bandwidth-bound on kernel inefficiency, not launch overhead.

After max-autotune, batch size dynamics also reversed. At 5 minutes with low MFU, larger batches hurt (fewer steps). At 15 minutes with efficient kernels, larger batches helped — the matmuls hit better kernel shapes AND gradient quality improved. A sweep found the optimum at 48k tokens — a non-power-of-2 that beat both 32k and 64k.

Stacking additional inductor flags (aggressive_fusion, loop_ordering_after_fusion, coordinate_descent_tuning, fusing MLP gate+up projections via cat-in-forward, fusing QKV) each shaved another 0.0001–0.002.

Final result: val_bpb 1.0446 at ~20% MFU.

From 1.711 to 1.0446 — a 39% improvement. And the single biggest contributor wasn't 231 architecture experiments. It was enabling a compiler flag that the ecosystem had caught up to.

What Did It Actually Learn?

Let's be honest about what 50M parameters trained for 15 minutes can do.

Untrained (random weights): Pure token salad — random word fragments with no discernible pattern.

5-min trained (val_bpb 1.170): Topical words appear but degenerate into repetition loops:

"The meaning of life is meaning meaning meaning meaning meaning..."

15-min, pre-compile (val_bpb 1.090): Topical vocabulary with variety but no syntax:

"life, living, lived, loved, peace, happiness, survival, creatures, organisms, generations, ecosystems, suffering"

Final (val_bpb 1.046): Trace amounts of English structure. For the prompt "In a distant galaxy far away":

"...far far north away away away the stretch away closer to make distance..."

Note the function words ("to") and the verb phrase "to make." Still overwhelmingly word salad, but the syntax is starting to show through. The model clearly learned topical vocabulary clustering. Sentence-level structure needs more compute than 50M params × 15 minutes can provide.

This isn't a limitation of the GPU or the method — it's just physics. Language models need scale to learn grammar.

What I'd Try Next

Takeaways for AMD GPU Users

  1. Try torch.compile again. If you're on ROCm 7.x with a recent PyTorch, it probably works now. Don't trust old forks that disable it — test it yourself. It was the single biggest win in this entire experiment.
  2. Use mode="max-autotune". The compile time is longer but the kernel search finds substantially better configurations, especially for smaller matmul shapes common in sub-100M models. The default kernels are not optimized for every GPU.
  3. Your hyperparameters are budget-specific. If you tuned at one training duration and then scale up, re-tune. The optima move.
  4. Muon matters. If you're using MuonAdamW, don't swap it for AdamW or Lion at this scale. It's not close.
  5. Profile before you architecture-search. I spent 271 experiments squeezing 31% out of architecture. One compiler flag matched that. Check your MFU first.

This work builds on Andrej Karpathy's nanochat and andyluo7's AMD fork. The autonomous agent was Claude Code. Hardware was a single AMD Radeon AI PRO R9700 (32 GB). Total experiment time was roughly 48 hours of GPU time across all sessions.

Share this post