Karpathy has been posting about AutoResearch — pointing an agent at a research question, not a task list, and letting the experimental loop run. His framing: the bottleneck in research isn’t compute or even ideas, it’s experiment throughput. If an agent can close the design-run-interpret cycle autonomously, you’re not just going faster — you’re doing structurally different work.

I had a question sitting open after a two-week experiment on prompt-based self-improvement. The same harness, same task, same pipeline: +9pp for a 3B Llama, -8pp for a 7B Qwen, +11pp for a 14B Phi. A follow-up study produced a three-factor model to explain it, but it sat on 8 data points with several obvious falsification attempts unrun.

I wrote a task file, pointed Claude Code at it, and stopped intervening. The instructions:

You are doing exploratory research — not executing a fixed list. Each iteration, decide what is most worth testing given what the reports currently say.

NEVER STOP. Do not pause to ask if you should continue. Do not check in. Run experiments, update the reports, generate new experiments from the results, repeat.

The stopping condition wasn’t time-based:

Output <promise>EXPERIMENTS</promise> when the three-factor model has survived at least 3 independent falsification attempts — or when a factor has been refuted and a revised model proposed and documented in the report.

I seeded a priority queue — narrow the threshold gap between 13% and 17% baseline, run self-GRPO on a high-baseline rejector, test whether the pattern holds cross-dataset — and went to sleep.

By morning it had run through the queue, extended it with experiments I hadn’t anticipated, and updated the reports in place. It ran models I hadn’t listed because intermediate results pointed at them. When an API failure corrupted a run it noted it and returned later with a modified approach. The three-factor model didn’t survive intact — at least one factor got substantially revised based on what it found.

What made it work wasn’t elaborate prompting. The research infrastructure had to exist first: the harness, the checkpointing, structured reports the agent could read and update. Given that, the two things that mattered were a falsifiable hypothesis and a stopping condition tied to epistemic state rather than iteration count. “Run N experiments” produces N experiments. “Stop when the model has been stress-tested or overturned” produces experiments oriented toward actually answering the question.

That’s the shift Karpathy is pointing at. A task list encodes what you already think is worth doing. A goal with a falsification condition lets the evidence determine the path.