The question I’ve been thinking about: can a LLM model, a stateless machine, teach itself? Does it have the introspection to understand its mistakes and know how to improve?
I spent the last few days running an experiment based on a paper called Training-Free GRPO. The core idea: instead of fine-tuning a model with reward signals, you extract natural-language “experiences” from its own successes and failures and inject them back into future prompts.
The obvious question: does it actually work?
Short answer: sometimes. And the follow-up question I ended up caring more about – can a stronger model teach a weaker one better than the weaker one teaches itself? – has a cleaner answer.
The setup
I tested 4 models (Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 7B, Phi-4 ~14B) on self-referential logic puzzles. Each puzzle presents 7 statements like “at least 3 of these 7 statements are true” and “the number of true statements is prime.” You have to enumerate T = 0 through 7, check which assignments are self-consistent, and return the valid counts.
This task was chosen deliberately: single correct integer answer, baselines in the 10-25% range (room to improve), and it rewards the kind of systematic enumeration that a well-crafted experience rule could teach.
The pipeline runs 3 epochs. Each training epoch generates G=8 rollouts per problem (99 problems, 792 total calls), then runs a 4-stage LLM pipeline that summarizes each trajectory, compares successful vs. failed attempts per problem, decides which experiences to add/update/delete, and consolidates a final library of up to 8 experiences. That library gets injected into the next epoch’s prompts. Evaluation uses 3 rollouts per problem (Mean@3 and Pass@3), run separately at each checkpoint.
In the basic version, the same model runs both the task and the experience pipeline. Student and teacher simultaneously.
I also ran a second experiment: replace the teacher with DeepSeek V3.2 and have it analyze each student model’s baseline rollouts in a single pass. Teacher distillation – one pipeline invocation, ~150 API calls, no iterative loop. The 4-stage pipeline is identical; the only difference is which model runs stages 1-4.
Self-learning results
The epoch-by-epoch progressions (Mean@3 across 99 problems, 3 rollouts each):
| Model | Baseline | Epoch 1 | Epoch 2 | Epoch 3 | Pattern |
|---|---|---|---|---|---|
| Phi-4 (~14B) | 10.4% | 16.5% | 15.5% | 18.2% | Dip then recovery |
| Llama 3.2 3B | 4.4% | 11.4% | 12.1% | 10.8% | Peak at 2, regress |
| Gemma 3 4B | 16.8% | 12.5% | 20.9% | 13.1% | Crash, spike, crash |
| Qwen 2.5 7B | 22.6% | 18.5% | 18.9% | 13.5% | Plateau then crash |
Two models improved in a sustained way, one never recovered after epoch 1, one stayed flat then collapsed. The two that ended positive started with the lowest baselines.
The instability is the main story. Gemma’s trajectory is the most striking: it lost 4.3pp going into epoch 1, recovered 8.4pp in epoch 2, then gave back 7.8pp in epoch 3. That’s not noise – it’s a library that happened to capture something useful in epoch 2 and then overwrote it entirely in epoch 3. Whatever worked didn’t survive. Llama peaked at epoch 2 and regressed 1.3pp in epoch 3. Even phi-4, which continued improving into epoch 3, showed a 1pp dip in epoch 2 before recovering.
The experience memorization problem
The most interesting finding wasn’t in the headline results. It was in what happened to phi-4’s experience library across epochs – specifically in the incremental checkpoint runs (not the main fresh 3-epoch run in the table above, which kept 8 general experiences through epoch 3).
In earlier incremental phi-4 runs, the experience library shows a clear degradation pattern:
Epoch 1 – generalizable rule with an anchor:
“When evaluating truth-value assignments, explicitly classify each number as prime, composite, or neither (e.g., 1 is neither prime nor composite) to avoid misclassification errors.”
Epoch 2 – the rule starts collapsing into a specific case:
“When evaluating truth-value assignments for T=2, ensure exactly two statements are true. For example, verify that only Statements 2 and 6 are true.”
Epoch 3 – full case memorization:
“For T=3, verify that Statement 1 is false because having 4 true statements contradicts the condition of ‘at least 2 true.’”
By epoch 3 of those runs, every experience was prefixed “For T=3” and described a specific statement’s truth value from a specific training puzzle. These aren’t strategies. They’re answers to problems the model already solved. When injected into a held-out problem, they’re noise.
The main 3-epoch run didn’t degrade this severely – phi-4 is large enough to maintain more abstract experiences for longer. But the incremental runs show the underlying drift mechanism clearly: the consolidation stage gravitates toward text that references the most frequently encountered training cases, and over enough passes, that content stops transferring to held-out problems. It’s the same mechanism that likely caps smaller models’ gains earlier.
Why Gemma crashed and Qwen plateaued
Gemma’s crash-spike-crash pattern came down to library volatility. The pipeline replaced nearly its entire experience library every epoch, so epoch 2’s useful content had no protection going into epoch 3. The epoch-to-epoch swings (-4.3pp, +8.4pp, -7.8pp) show a library that’s being rebuilt from scratch each time rather than iteratively refined.
Qwen’s failure is more interesting. Qwen generates verbose chain-of-thought traces and regularly hit the 2000-token output ceiling. Across two independent training runs it accumulated 161 and 345 truncations out of ~6000 calls each. A truncated response can’t receive credit even if the reasoning was on track.
Qwen’s truncation rate in eval also rose from 2.0% at baseline to 5.7% with teacher experiences injected. Part of that is experience injection lengthening the input, though the eval runs use different temperatures (baseline at temp=0.3, the teacher-injected eval at the same setting), so the effect isn’t cleanly isolable. What’s clear is that qwen’s verbose traces are incompatible with the 2000-token ceiling in a way no other model’s were.
Teacher distillation: one pass, all four models
The second experiment was to run DeepSeek V3.2 on each student’s baseline rollouts and generate experiences from the outside. Same 4-stage pipeline, one pass, no iterative refinement.
Note: the baselines here come from separate eval runs than the self-learning table above, so the same model can show different baselines (2-5pp variance from sampling noise). The comparisons within each section are internally consistent, but the cross-section head-to-head is comparing deltas from different starting points.
| Model | Baseline | With Teacher | Delta |
|---|---|---|---|
| Llama 3.2 3B | 10.4% | 20.5% | +10.1pp |
| Phi-4 (~14B) | 13.1% | 17.5% | +4.4pp |
| Gemma 3 4B | 11.5% | 16.1% | +4.7pp |
| Qwen 2.5 7B | 25.6% | 22.6% | -3.0pp |
Head-to-head against the best self-learning epoch – keeping in mind self-learning ran 3 epochs (99×8×3 = ~2,400 training rollouts plus 3 pipeline passes) while teacher distillation ran the pipeline once on 1 epoch of baseline rollouts:
| Model | Self-learning best | Teacher (1 pass) | Winner |
|---|---|---|---|
| Phi-4 (~14B) | +10.8pp | +4.4pp | Self-learning |
| Llama 3.2 3B | +8.8pp | +10.1pp | Teacher |
| Gemma 3 4B | -3.0pp | +4.7pp | Teacher (decisively) |
| Qwen 2.5 7B | -7.7pp | -3.0pp | Neither |
Teacher distillation beats self-learning on Llama (+10.1pp vs. +8.8pp) using a fraction of the compute. For the two smallest models, one pass from a strong teacher outperforms three epochs of self-improvement.
Why the size crossover makes sense
Self-learning asks a model to: generate rollouts, reason about why some succeeded and others failed, articulate generalizable principles, and apply those to future problems. Steps 2 and 3 require accurate meta-cognition. A 3B model failing at self-referential logic probably can’t correctly diagnose its own errors.
The teacher bypasses that entirely. DeepSeek V3.2 analyzing a 3B model’s rollouts produces experiences like:
“Enumerate T from 0 to total statements. For each T, compute F = total - T, then evaluate each statement’s condition against T/F.”
“For XOR statements, first evaluate sub-statements A and B. The XOR is true only if exactly one of A, B is true.”
None of these reference specific T values or specific statement numbers. They’re procedures, not answers. A model receiving these on a held-out problem can apply them.
At ~14B, phi-4 has enough meta-cognitive capacity to generate useful self-diagnoses. And with 3 epochs of iterative refinement vs. the teacher’s single pass, self-learning accumulates more task-relevant procedural knowledge than a one-shot extraction can match. The advantage isn’t just who’s analyzing the mistakes – it’s also how many rounds of refinement the library gets.
The practical implication: teacher distillation is most valuable precisely where self-reflection is least reliable, and it’s most efficient when you’re comparing it against a budget-constrained version of self-learning.
The Qwen puzzle
Qwen regresses under both approaches. The max_tokens=4000 teacher distillation run (designed to test whether the 2000-token ceiling was the culprit) actually made things worse – -5.05pp with zero truncations, versus -3.0pp at the original ceiling. The truncation hypothesis is refuted. Eliminating every truncated response didn’t help at all.
The leading explanation: Qwen’s instruction tuning biases it toward treating prepended procedural lists as context to discuss, not directives to follow. This is a model-level incompatibility with the injection mechanism, not a quality issue with the experiences themselves.
Cross-injection experiments: how we know it's the mechanism, not the content
I ran Qwen with Llama’s experience library – content derived entirely from Llama’s failure modes, which Qwen had never been processed with. Qwen regressed to -11.1pp, with a 19.5% truncation rate (vs. 5.7% with its own experiences). The content doesn’t matter. Qwen treats the injected block as context to elaborate on rather than instructions to follow, and that verbosity pushes its responses past any token ceiling. It’s a generation mode triggered by the presence of the injected list, not by what the list says.
The strategy interference hypothesis – that Qwen’s high baseline (25.6%) means it has an internalized approach that explicit instructions disrupt – predicts the opposite of what we see. Foreign experiences, which shouldn’t conflict with Qwen’s internalized approach at all, caused more regression than its own targeted experiences. Content specificity doesn’t explain the pattern.
For comparison: running Llama with Qwen’s experiences produced +2.69pp – much less than Llama’s +9.42pp gain from its own experiences, but still positive. Llama uses injected content as instructions regardless of source. Qwen does the opposite regardless of source.
There’s also a strange inversion. Llama had the fewest queries with useful learning signal – only 12 of 99 had mixed success across rollouts, meaning 87 problems were either all-wrong or all-right. Qwen had the most – 48 of 99. Yet llama matched self-learning’s best result and qwen regressed. A model that always fails the same way may be a cleaner diagnostic target for the teacher than one that fails inconsistently.
Open questions
The paper’s core idea holds up. Prompt injection can compound across iterations when the experiences are abstract and stable. The failure modes are mostly library management problems – memorization drift, library churn, context saturation – not a flaw in the fundamental mechanism. But several things remain genuinely unclear to me:
Is there a natural stopping signal? Multiple models peaked at epoch 2 then regressed. Can you detect this in real-time – when training accuracy stops climbing or experience churn exceeds a threshold – and halt before the library degrades?
Why does library churn correlate with failure? Llama retained most of its experiences between epochs and improved. Gemma replaced nearly everything every epoch and degraded. Is this a cause or a symptom? Would forcing a retention floor stabilize the weaker models, or just lock in bad experiences?
What’s the right library size for a given model? Llama (3B) thrived with 6 concise experiences. Gemma (4B) struggled with 8. Phi-4 (14B) handled 8 fine. Is there a relationship between model size and the number of instructions it can usefully follow at once?
Where is the self-teaching ceiling? Phi-4 self-taught (+10.8pp) outperformed its teacher-taught result (+4.4pp), but smaller models showed the opposite. Is there a model-size threshold where self-diagnosis becomes reliable, and does it depend on task difficulty?
Can teacher distillation iterate? The teacher results used a single pass over baseline rollouts. What happens if you run a second pass – the teacher analyzes the student’s post-experience rollouts and refines the library? Does that compound or saturate?
Why did the model with the least signal improve the most? Llama had only 12 contrastive groups out of 99 in epoch 1 – most problems were all-wrong. Qwen had 48. Yet Llama beat self-learning with teacher distillation. One interpretation: concentrated, consistent failures are a cleaner diagnostic target than inconsistent ones. But the cross-injection result adds another layer – Qwen’s rich signal is irrelevant if it rejects the experiences anyway.
Does experience-student alignment scale? Cross-injection shows experiences work best when derived from the student’s own failures. Llama gained +9.42pp from its own experiences but only +2.69pp from Qwen’s. How much of this is failure-mode targeting vs. the teacher simply having more contrastive signal to work with? And does a teacher analyzing a stronger student’s near-correct rollouts produce experiences that transfer to weaker students?
Is Qwen’s regression purely a context problem? Answered. Raising max_tokens to 4000 made it worse. Cross-injection with foreign experiences made it much worse (-11.1pp). The regression is injection-mechanism incompatibility, not context saturation or strategy interference. What’s not yet tested: whether different injection placements (system prompt, few-shot) recover performance for models like Qwen that reject list-prepended experiences.