Which Models Actually Benefit From Prompt-Injected Experiences?

The previous experiment ended with an unresolved anomaly. Three models improved when a strong teacher (DeepSeek V3.2) injected procedural experiences into their prompts. One — Qwen 2.5 7B — regressed, and kept regressing regardless of what experiences it received or how much token budget it was given. The cross-injection experiments showed it wasn’t the content; it was something about how Qwen handles injected lists at all.

The question that left open: is this a Qwen thing, or does it happen to any model that’s already competent at the task?

The test

To find out, I ran 4 new models through the same protocol: baseline eval on 99 self-referential logic puzzles, then eval again with Llama’s pre-computed experience library injected into every prompt.

The models were chosen specifically to break the correlations present in the original study:

Model	Purpose
Llama 3.2 1B	Does the receiver property extend to 1B?
Mistral 7B	Same size as Qwen 2.5 7B, different family
Ministral 3B	Same size as Llama 3.2 3B (a strong receiver), different family
Qwen3 8B	Newer Qwen generation — does it change the behavior?

Using a fixed injection source (Llama’s 6-experience checkpoint) across all models means any differences in outcome reflect how each model responds to the same content, not differences in experience quality.

Results

Model	Family	Baseline	With Llama Experiences	Δ
Llama 3.2 1B	Llama	12.5%	19.5%	+7.1pp
Mistral 7B	Mistral	12.5%	20.2%	+7.7pp
Ministral 3B	Mistral	22.9%	14.8%	-8.1pp
Qwen3 8B	Qwen3	25.6%	29.0%	+3.4pp

The Mistral result is the one that matters most. Ministral 3B rejects experiences hard (-8.1pp). Mistral 7B receives them strongly (+7.7pp). Same family, same injection source, opposite outcomes. This directly rules out family-specific instruction tuning as the explanation.

The full picture

Combining with the original 4-model study:

Model	Family	Baseline	ΔMean@3	Verdict
Llama 3.2 1B	Llama	12.5%	+7.1pp	Receiver
Llama 3.2 3B	Llama	10.4%	+8.75pp	Receiver
Mistral 7B	Mistral	12.5%	+7.7pp	Receiver
Phi-4 (~14B)	Phi	13.1%	+4.4pp	Receiver (own exp)
Gemma 3 4B	Gemma	11.5%	+4.7pp	Receiver (own exp)
Ministral 3B	Mistral	22.9%	-8.1pp	Rejector
Qwen 2.5 7B	Qwen	25.6%	-11.1pp	Rejector (cross-injection)
Qwen3 8B	Qwen3	25.6%	+3.4pp	Receiver*

Every model below ~15% baseline improves. Every model above ~20% regresses. Family and size have no predictive power — Ministral 3B (3B params, Mistral) rejects while Llama 3.2 1B (1B params, Llama) receives. The split tracks task capability, not architecture.

The threshold is approximately 15–20% Mean@3 on this specific task. Below it, the model hasn’t found a reliable approach and benefits from procedural guidance. Above it, it already has one, and injecting a different procedure disrupts it.

This is a task-specific threshold. A model that rejects experiences here, where it starts from 22%, might be a strong receiver on a task where it starts from 5%.

The exception: Qwen3 and extended thinking

Qwen3 8B has almost exactly the same baseline as Qwen 2.5 7B (25.6% vs 25.6%) but improves rather than regresses. The token statistics explain why:

Model	Avg completion tokens	Max
Llama 1B	1609	2000
Mistral 7B	1451	1621
Ministral 3B	923	1524
Qwen3 8B	3540	7379

The max_tokens ceiling for all runs was 2000. Qwen3 8B returned an average of 3540 tokens per response. Qwen3 models run in extended thinking mode by default on OpenRouter, generating internal chain-of-thought that’s returned separately and isn’t subject to the max_tokens ceiling the same way visible output is.

This matters because it changes how injected experiences interact with generation. For a single-pass model, the injected experience block enters the prompt before generation begins — it has to be incorporated immediately into the output stream. A model with an internalized strategy can find that flow disrupted when it has to accommodate a prepended procedure. For an extended-thinking model, the injected content becomes an input to a deliberative reasoning loop that can evaluate it, use what’s relevant, and discard what isn’t. The injection doesn’t override the generation path; it informs the reasoning that precedes it.

The practical implication: o1-style and thinking models may be resistant to the rejection mechanism even at high baseline capability.

What this means for the original Qwen puzzle

The original study attributed Qwen 2.5 7B’s rejection primarily to instruction-following mismatch (H1) — something specific about how Qwen handles prepended procedural lists. The new data shifts the explanation. Ministral 3B rejects despite having no connection to Qwen’s training lineage. The more general mechanism is H3: models with effective internalized strategies are disrupted by injected procedural content. The disruption is a property of having a strategy, not of being Qwen.

H1 may still contribute for Qwen specifically — its particular instruction tuning might amplify the disruption. But the primary driver is task capability, not model identity.

Three factors

The complete picture of why models respond so differently to experience injection:

1. Task-specific baseline capability predicts the receiver/rejector split. This is the dominant factor, holds across all tested families and sizes, and is task-specific rather than model-specific.

2. Generation mode modulates Factor 1 at high baseline. Single-pass models with effective strategies reject injection. Extended-thinking models with the same capability level don’t — the deliberative loop absorbs the content without disruption.

3. Experience-student alignment modulates magnitude for receivers. Experiences derived from the student’s own failures yield larger gains than cross-model experiences (+8.75pp own vs +2.69pp foreign for Llama 3.2 3B). Cross-model experiences still help receivers, but less precisely.

Practical upshot

Before injecting experiences into a model’s prompts:

Measure baseline on the target task first. Below ~15%: injection likely helps. Above ~20%: it likely hurts.
The threshold is task-specific. Don’t assume a model that rejects here would reject on a different task where it starts lower.
Extended-thinking models are worth trying at high baseline — they may avoid disruption entirely.
Generate experiences from the target model’s own rollouts for maximum gain; cross-model experiences are useful but less targeted.

The test#

Results#

The full picture#

The exception: Qwen3 and extended thinking#

What this means for the original Qwen puzzle#

Three factors#

Practical upshot#