This post summarizes a paper I co-authored with Benjamin Pikus and Pratyush Ranjan Tiwari. The full paper is on arXiv.
Fine-tuning a language model with GRPO is expensive. Collecting and annotating training data is expensive. So if you can only afford to train on 10% of your data, which 10% should you pick?
The intuitive answer might be: a representative sample. Maybe some easy, some hard, some in the middle. That’s what random selection gives you.
The actual answer, at least for GRPO on reasoning tasks: pick the hardest examples. Not a mix. The hardest 10%.
The performance gap is large enough to matter. Training on the hardest subset yields accuracy gains of 34-47 percentage points on GSM8K and Tracking Shuffled Objects across the models we tested (Qwen3-4B, Qwen3-14B, Phi-4). Training on the easiest subset gets you 3-15 percentage points. That’s a 30+ point spread from the same budget, same number of steps, same everything except which examples you chose.
Why GRPO cares about difficulty
GRPO works by sampling a group of rollouts for each prompt, computing their rewards, and learning from the contrast between them. If a model gets every rollout in a group right, the advantages go to zero and nothing is learned. Same if it gets every rollout wrong.
The learning signal only exists when the group has mixed outcomes.
Easy examples are problematic for exactly this reason. At the start of training, a model already solves them most of the time. By a few hundred steps in, it solves them almost always. The group collapses to all-correct and stops contributing any gradient.
Hard examples stay mixed. The model gets some right and some wrong throughout training, which means the contrastive signal stays alive.
We measured this directly with a metric we call “learnable percentage” – the fraction of training steps where the within-group reward standard deviation is nonzero. Across all models and selection strategies, learnable percentage correlates strongly with final performance (R² = 0.66). The mechanism isn’t subtle.
| Strategy | GSM8K Learnable % (Qwen3-4B) | GSM8K Accuracy Gain |
|---|---|---|
| Easy | 3.7% | +3.5pp |
| Random | 19.0% | +29.5pp |
| Medium | 24.5% | +26.7pp |
| Hard | 34.1% | +34.2pp |
Out-of-distribution: only hard training generalizes
After training on GSM8K subsets, we evaluated Qwen3-4B on AIME2025-I – competition math, much harder than the training distribution. The results are stark.
| Training subset | AIME2025 Pass@8 |
|---|---|
| Base model | 33.3% |
| Easy | 33.3% |
| Random | 33.3% |
| Medium | 26.7% |
| Hard | 40.0% |
Easy, random, and base model all tie at 33.3%. Medium actually regresses. Hard is the only condition that improves. Training on easy examples doesn’t just fail to help on harder problems – it produces a model indistinguishable from no training at all.
This matters beyond the benchmark. The goal of GRPO fine-tuning is usually some form of generalization, not just fitting the training distribution. If easy examples can’t produce OOD gains, they’re providing essentially no value.
A simpler version of the same insight
We also ran an experiment that doesn’t require the multi-sample difficulty estimation machinery. Instead of ranking by pass@k, just split examples into two buckets: ones the base model gets wrong, and ones it gets right. Train on each bucket separately.
“Base wrong” examples consistently match or beat “base right” examples across models and tasks, even when the base right set is much larger. On GSM8K, the average relative improvement of base wrong over size-matched base right is 14.3%. Base wrong also slightly outperforms training on all examples (85% vs 82% average accuracy), suggesting that base right examples aren’t just neutral – they dilute the useful signal.
The practical implication: you can identify high-value training examples without any multi-sample probing. Just run the base model once on each example and keep the ones it fails. That’s a much cheaper filter.
What this means in practice
If you’re curating data for GRPO fine-tuning, the takeaway is concrete:
- Don’t sample randomly. A representative mix of easy and hard examples wastes budget on the easy ones.
- Probe the base model first. Run each candidate example through the model a few times. Keep the ones with low success rates.
- If budget is tight, go further. The harder the subset, the better – within reason. The relationship between difficulty and performance is monotonic in our experiments.
- Easy examples aren’t free. They’re not just neutral filler. They actively crowd out learning signal by consuming training steps where no advantage exists.
The mechanism here isn’t specific to any particular task or model family. It falls directly out of how GRPO computes advantages. Any training setup where the reward variance within a group collapses will stop learning, and easy examples cause that collapse faster.
The full paper has results across more model configurations and both tasks, plus appendix details on hyperparameters. Code is linked in the paper if you want to reproduce the difficulty estimation and selection pipeline.