Alignment

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning language models with human preferences. But optimizing against a learned reward model introduces a subtle failure mode: reward hacking. The Setup In RLHF, we train a reward model $r_\phi(x, y)$ on human preference data, then optimize a policy $\pi_\theta$ to maximize the expected reward: $$\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \, \text{KL}\left(\pi_\theta \| \pi_{\text{ref}}\right) \right]$$The KL penalty term keeps the policy close to the reference (supervised fine-tuned) model, acting as a regularizer. But in practice, choosing $\beta$ is tricky — too low and you get reward hacking, too high and you barely move from the base model. ...