Research

Can a Model Teach Itself With Prompts Instead of Gradients?

The question I’ve been thinking about: can a LLM model, a stateless machine, teach itself? Does it have the introspection to understand its mistakes and know how to improve? I spent the last few days running an experiment based on a paper called Training-Free GRPO. The core idea: instead of fine-tuning a model with reward signals, you extract natural-language “experiences” from its own successes and failures and inject them back into future prompts. ...

Which Models Actually Benefit From Prompt-Injected Experiences?

The previous experiment ended with an unresolved anomaly. Three models improved when a strong teacher (DeepSeek V3.2) injected procedural experiences into their prompts. One — Qwen 2.5 7B — regressed, and kept regressing regardless of what experiences it received or how much token budget it was given. The cross-injection experiments showed it wasn’t the content; it was something about how Qwen handles injected lists at all. The question that left open: is this a Qwen thing, or does it happen to any model that’s already competent at the task? ...

Hard Examples Are All You Need for GRPO

This post summarizes a paper I co-authored with Benjamin Pikus and Pratyush Ranjan Tiwari. The full paper is on arXiv. Fine-tuning a language model with GRPO is expensive. Collecting and annotating training data is expensive. So if you can only afford to train on 10% of your data, which 10% should you pick? The intuitive answer might be: a representative sample. Maybe some easy, some hard, some in the middle. That’s what random selection gives you. ...