Reward Hacking in RLHF: What Can Go Wrong
Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning language models with human preferences. But optimizing against a learned reward model introduces a subtle failure mode: reward hacking. The Setup In RLHF, we train a reward model $r_\phi(x, y)$ on human preference data, then optimize a policy $\pi_\theta$ to maximize the expected reward: $$\mathcal{L}(\theta) = \mathbb{E}{x \sim \mathcal{D},, y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta , \text{KL}\left(\pi_\theta | \pi_{\text{ref}}\right) \right]$$ ...