The previous experiment ended with an unresolved anomaly. Three models improved when a strong teacher (DeepSeek V3.2) injected procedural experiences into their prompts. One — Qwen 2.5 7B — regressed, and kept regressing regardless of what experiences it received or how much token budget it was given. The cross-injection experiments showed it wasn’t the content; it was something about how Qwen handles injected lists at all.
The question that left open: is this a Qwen thing, or does it happen to any model that’s already competent at the task?
The test
To find out, I ran 4 new models through the same protocol: baseline eval on 99 self-referential logic puzzles, then eval again with Llama’s pre-computed experience library injected into every prompt.
The models were chosen specifically to break the correlations present in the original study:
| Model | Purpose |
|---|---|
| Llama 3.2 1B | Does the receiver property extend to 1B? |
| Mistral 7B | Same size as Qwen 2.5 7B, different family |
| Ministral 3B | Same size as Llama 3.2 3B (a strong receiver), different family |
| Qwen3 8B | Newer Qwen generation — does it change the behavior? |
Using a fixed injection source (Llama’s 6-experience checkpoint) across all models means any differences in outcome reflect how each model responds to the same content, not differences in experience quality.
Results
| Model | Family | Baseline | With Llama Experiences | Δ |
|---|---|---|---|---|
| Llama 3.2 1B | Llama | 12.5% | 19.5% | +7.1pp |
| Mistral 7B | Mistral | 12.5% | 20.2% | +7.7pp |
| Ministral 3B | Mistral | 22.9% | 14.8% | -8.1pp |
| Qwen3 8B | Qwen3 | 25.6% | 29.0% | +3.4pp |
The Mistral result is the one that matters most. Ministral 3B rejects experiences hard (-8.1pp). Mistral 7B receives them strongly (+7.7pp). Same family, same injection source, opposite outcomes. This directly rules out family-specific instruction tuning as the explanation.
The full picture
Combining with the original 4-model study:
| Model | Family | Baseline | ΔMean@3 | Verdict |
|---|---|---|---|---|
| Llama 3.2 1B | Llama | 12.5% | +7.1pp | Receiver |
| Llama 3.2 3B | Llama | 10.4% | +8.75pp | Receiver |
| Mistral 7B | Mistral | 12.5% | +7.7pp | Receiver |
| Phi-4 (~14B) | Phi | 13.1% | +4.4pp | Receiver |
| Gemma 3 4B | Gemma | 11.5% | +4.7pp | Receiver |
| Ministral 3B | Mistral | 22.9% | -8.1pp | Rejector |
| Qwen 2.5 7B | Qwen | 25.6% | -11.1pp | Rejector |
| Qwen3 8B | Qwen3 | 25.6% | +3.4pp | Receiver* |
Every model below ~15% baseline improves. Every model above ~20% regresses. Family and size have no predictive power — Ministral 3B (3B params, Mistral) rejects while Llama 3.2 1B (1B params, Llama) receives. The split tracks task capability, not architecture.
The threshold is approximately 15–20% Mean@3 on this specific task. Below it, the model hasn’t found a reliable approach and benefits from procedural guidance. Above it, it already has one, and injecting a different procedure disrupts it.
This is a task-specific threshold. A model that rejects experiences here, where it starts from 22%, might be a strong receiver on a task where it starts from 5%.
The exception: Qwen3 and extended thinking
Qwen3 8B has almost exactly the same baseline as Qwen 2.5 7B (25.6% vs 25.6%) but improves rather than regresses. The token statistics explain why:
| Model | Avg completion tokens | Max |
|---|---|---|
| Llama 1B | 1609 | 2000 |
| Mistral 7B | 1451 | 1621 |
| Ministral 3B | 923 | 1524 |
| Qwen3 8B | 3540 | 7379 |
The max_tokens ceiling for all runs was 2000. Qwen3 8B returned an average of 3540 tokens per response. Qwen3 models run in extended thinking mode by default on OpenRouter, generating internal chain-of-thought that’s returned separately and isn’t subject to the max_tokens ceiling the same way visible output is.
This matters because it changes how injected experiences interact with generation. For a single-pass model, the injected experience block enters the prompt before generation begins — it has to be incorporated immediately into the output stream. A model with an internalized strategy can find that flow disrupted when it has to accommodate a prepended procedure. For an extended-thinking model, the injected content becomes an input to a deliberative reasoning loop that can evaluate it, use what’s relevant, and discard what isn’t. The injection doesn’t override the generation path; it informs the reasoning that precedes it.
The practical implication: o1-style and thinking models may be resistant to the rejection mechanism even at high baseline capability.
What this means for the original Qwen puzzle
The original study attributed Qwen 2.5 7B’s rejection primarily to instruction-following mismatch (H1) — something specific about how Qwen handles prepended procedural lists. The new data shifts the explanation. Ministral 3B rejects despite having no connection to Qwen’s training lineage. The more general mechanism is H3: models with effective internalized strategies are disrupted by injected procedural content. The disruption is a property of having a strategy, not of being Qwen.
H1 may still contribute for Qwen specifically — its particular instruction tuning might amplify the disruption. But the primary driver is task capability, not model identity.
Three factors
The complete picture of why models respond so differently to experience injection:
1. Task-specific baseline capability predicts the receiver/rejector split. This is the dominant factor, holds across all tested families and sizes, and is task-specific rather than model-specific.
2. Generation mode modulates Factor 1 at high baseline. Single-pass models with effective strategies reject injection. Extended-thinking models with the same capability level don’t — the deliberative loop absorbs the content without disruption.
3. Experience-student alignment modulates magnitude for receivers. Experiences derived from the student’s own failures yield larger gains than cross-model experiences (+8.75pp own vs +2.69pp foreign for Llama 3.2 3B). Cross-model experiences still help receivers, but less precisely.
Practical upshot
Before injecting experiences into a model’s prompts:
- Measure baseline on the target task first. Below ~15%: injection likely helps. Above ~20%: it likely hurts.
- The threshold is task-specific. Don’t assume a model that rejects here would reject on a different task where it starts lower.
- Extended-thinking models are worth trying at high baseline — they may avoid disruption entirely.
- Generate experiences from the target model’s own rollouts for maximum gain; cross-model experiences are useful but less targeted.