I pointed autoresearch at an AMD Radeon AI PRO R9700 — RDNA4, gfx1201, 32 GB of VRAM — and let
it run ~316 experiments over several sessions. The headline finding wasn't an architecture trick or a clever
optimizer. It was that torch.compile works on this GPU now, and the popular AMD fork I started
from had it disabled.
There's a project called autoresearch
— a simplified, single-GPU descendant of Karpathy's nanochat — that does something beautifully
stupid: it loops overnight, modifying a train.py, running short pretraining experiments, keeping
only the changes that improve validation loss, and iterating. It's autonomous ML research with the patience
of a machine and the compute budget of a hobbyist.
The Setup
The stack: ROCm 7.2.0, PyTorch 2.9.1+rocm6.3, Ubuntu 24.04, a Ryzen 9 9950X3D on the CPU side. The model is a 25–50M parameter GPT with SwiGLU MLPs, rotary position embeddings, value embeddings (ResFormer-style), and a MuonAdamW optimizer. Training data is ClimbMix (general web text).
I started from andyluo7's AMD fork
of autoresearch, which disables torch.compile on ROCm and falls back to PyTorch SDPA instead
of Flash Attention 3. At the time the fork was written, that was the right call — compile didn't
work. I didn't question it.
Each experiment gets a fixed wall-clock budget (5, 10, or 15 minutes). The validation metric is
val_bpb (bits per byte), which is vocab-size-independent, so architecture changes get a fair
comparison. The agent — Claude Code — proposes a change, runs the experiment, checks whether
val_bpb improved, and either keeps or reverts.
Five Minutes: Squeezing Blood from a Stone
Session 1 was pure architecture and hyperparameter exploration: 231 experiments, 35 kept.
Baseline val_bpb: 1.711. Final: 1.170 — a 31% improvement.
The biggest wins were surprisingly mundane. Cutting TOTAL_BATCH_SIZE from 219 to
215 gave more optimizer steps in the fixed time window. Reducing depth from 8 to 5 layers.
Shrinking the FFN multiplier from 4 to 3. Fine-grained tuning of learning rates, initialization scales, and
betas. Nothing exotic — just the agent methodically searching the space one axis at a time.
Session 2 was where things got interesting in a "negative results are results" way. Twenty-five experiments testing training dynamics tricks: EMA, SWA, z-loss, multi-token prediction, LayerDrop, label smoothing, dropout, gradient noise injection, Lookahead optimizer, and more.
Zero kept. Every single one was net negative.
The key insight: in a tight 5-minute budget with depth 5 and MuonAdamW, training never plateaus. The final checkpoint is strictly the best one, so weight averaging can't help. Gradient noise injection was catastrophic because Muon already normalizes gradients — adding noise on top just destroys signal. The standard pretraining priors (EMA helps, z-loss helps) simply don't carry over to this regime.
Cross-seed variance was measured at ~0.005 val_bpb, so these weren't noise — the
effects were real and reliably negative.
Session 3 tried structural shifts: optimizer swaps (full AdamW, Lion), parallel transformer blocks, NormFormer, tied embeddings, MoE, differential attention. Fifteen experiments, one marginal keep.
Two findings stood out. First, Muon is essential — both AdamW and Lion were significantly worse. Second, "throughput is king." Any change that cost even 10% per-step speed got punished harder than any capacity gain could compensate. At low MFU, wall-clock efficiency dominates everything.
The Budget Realization
After 271 experiments at 5 minutes, the config was ruthlessly optimized for that specific budget. So I tried something obvious: what happens if I just give it more time?
| Budget | val_bpb | Code changes |
|---|---|---|
| 5 min | 1.170 | Fully tuned |
| 10 min | 1.123 | None |
| 15 min | 1.101 | None |
No code changes. Just longer training. And the optimal architecture shifted — at 15 minutes, depth 7
beat depth 5, the matrix learning rate wanted 0.015 instead of 0.02, and the warmdown ratio moved from 0.85
to 0.93. After ~20 re-exploration experiments at 15 minutes: val_bpb 1.090.
This is worth internalizing: hyperparameter optima are functions of your compute budget. An agent that runs until it plateaus under one budget will not find the optimum under a different budget unless you explicitly tell it to re-explore. The 5-minute champion is not the 15-minute champion.
The torch.compile Story
This is the actual headline of the post.
The AMD fork I started from disables torch.compile because it didn't work on ROCm when the
fork was written. I had Claude Code test whether that was still true with ROCm 7.2 + PyTorch 2.9.1.
It was not.
Re-enabling torch.compile (the change is literally uncommenting a line):
val_bpb: 1.090 → 1.077- MFU: 10.5% → 13.3%
- VRAM: 10.4 GB → 6.1 GB
- Training steps in 15 min: 2,474 → 3,126 (+26%)
Then switching to mode="max-autotune":
model = torch.compile(model, mode="max-autotune")
val_bpb: 1.077 → 1.053- MFU: 13.3% → 19.0%
- Training steps: 3,120 → 4,458 (+43%)
- Step time: 289ms → 203ms
What happened? The default Triton kernel configurations for the small dim=512 matmul shapes
in this model are not well-tuned for RDNA4. max-autotune spends extra compile time searching
over kernel configs and found materially better ones. The model was memory-bandwidth-bound on kernel
inefficiency, not launch overhead.
After max-autotune, batch size dynamics also reversed. At 5 minutes with low MFU, larger
batches hurt (fewer steps). At 15 minutes with efficient kernels, larger batches helped — the matmuls
hit better kernel shapes AND gradient quality improved. A sweep found the optimum at 48k tokens — a
non-power-of-2 that beat both 32k and 64k.
Stacking additional inductor flags (aggressive_fusion,
loop_ordering_after_fusion, coordinate_descent_tuning, fusing MLP gate+up
projections via cat-in-forward, fusing QKV) each shaved another 0.0001–0.002.
Final result: val_bpb 1.0446 at ~20% MFU.
From 1.711 to 1.0446 — a 39% improvement. And the single biggest contributor wasn't 231 architecture experiments. It was enabling a compiler flag that the ecosystem had caught up to.
What Did It Actually Learn?
Let's be honest about what 50M parameters trained for 15 minutes can do.
Untrained (random weights): Pure token salad — random word fragments with no discernible pattern.
5-min trained (val_bpb 1.170): Topical words appear but degenerate into repetition loops:
"The meaning of life is meaning meaning meaning meaning meaning..."
15-min, pre-compile (val_bpb 1.090): Topical vocabulary with variety but no syntax:
"life, living, lived, loved, peace, happiness, survival, creatures, organisms, generations, ecosystems, suffering"
Final (val_bpb 1.046): Trace amounts of English structure. For the prompt "In a distant galaxy far away":
"...far far north away away away the stretch away closer to make distance..."
Note the function words ("to") and the verb phrase "to make." Still overwhelmingly word salad, but the syntax is starting to show through. The model clearly learned topical vocabulary clustering. Sentence-level structure needs more compute than 50M params × 15 minutes can provide.
This isn't a limitation of the GPU or the method — it's just physics. Language models need scale to learn grammar.
What I'd Try Next
- Longer budgets. The 5→15 min transition showed that optimal configs shift with budget. A 1-hour or overnight budget would likely find different depth/width/LR optima and might cross the threshold into rudimentary syntax.
- Bigger model. With
torch.compiledropping VRAM from 10.4 to 6.1 GB, there's ~26 GB of headroom on this 32 GB card. A 200M+ parameter model might use it well. - Narrower data. TinyStories or similar constrained-vocabulary datasets would let the model learn actual sentence structure at this parameter count — useful for validating whether the training dynamics are sound even if the capacity isn't there for general language.
- Flash Attention on ROCm. The current setup uses PyTorch SDPA. If composable kernel or CK-based Flash Attention works on RDNA4, that's another MFU step.
Takeaways for AMD GPU Users
- Try
torch.compileagain. If you're on ROCm 7.x with a recent PyTorch, it probably works now. Don't trust old forks that disable it — test it yourself. It was the single biggest win in this entire experiment. - Use
mode="max-autotune". The compile time is longer but the kernel search finds substantially better configurations, especially for smaller matmul shapes common in sub-100M models. The default kernels are not optimized for every GPU. - Your hyperparameters are budget-specific. If you tuned at one training duration and then scale up, re-tune. The optima move.
- Muon matters. If you're using MuonAdamW, don't swap it for AdamW or Lion at this scale. It's not close.
- Profile before you architecture-search. I spent 271 experiments squeezing 31% out of architecture. One compiler flag matched that. Check your MFU first.
This work builds on Andrej Karpathy's nanochat and andyluo7's AMD fork. The autonomous agent was Claude Code. Hardware was a single AMD Radeon AI PRO R9700 (32 GB). Total experiment time was roughly 48 hours of GPU time across all sessions.