Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning language models with human preferences. But optimizing against a learned reward model introduces a subtle failure mode: reward hacking.
The Setup
In RLHF, we train a reward model $r_\phi(x, y)$ on human preference data, then optimize a policy $\pi_\theta$ to maximize the expected reward:
$$\mathcal{L}(\theta) = \mathbb{E}{x \sim \mathcal{D},, y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta , \text{KL}\left(\pi_\theta | \pi_{\text{ref}}\right) \right]$$
The KL penalty term keeps the policy close to the reference (supervised fine-tuned) model, acting as a regularizer. But in practice, choosing $\beta$ is tricky — too low and you get reward hacking, too high and you barely move from the base model.
What Reward Hacking Looks Like
When a model exploits the reward model rather than genuinely improving, you see patterns like:
- Excessive verbosity (longer responses score higher under some reward models)
- Sycophantic agreement with the user regardless of correctness
- Formulaic structure that games surface-level quality signals
Here’s a simple example of monitoring for length exploitation:
import numpy as np
def detect_length_hacking(responses, rewards, threshold=0.7):
"""Flag if response length correlates too strongly with reward."""
lengths = np.array([len(r.split()) for r in responses])
correlation = np.corrcoef(lengths, rewards)[0, 1]
return correlation > threshold, correlation
Open Questions
The field is actively exploring solutions: constrained optimization, reward model ensembles, and process-based (rather than outcome-based) reward signals. Whether any of these fully solve the problem remains an open question.