Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning language models with human preferences. But optimizing against a learned reward model introduces a subtle failure mode: reward hacking.

The Setup

In RLHF, we train a reward model $r_\phi(x, y)$ on human preference data, then optimize a policy $\pi_\theta$ to maximize the expected reward:

$$\mathcal{L}(\theta) = \mathbb{E}{x \sim \mathcal{D},, y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta , \text{KL}\left(\pi_\theta | \pi_{\text{ref}}\right) \right]$$

The KL penalty term keeps the policy close to the reference (supervised fine-tuned) model, acting as a regularizer. But in practice, choosing $\beta$ is tricky — too low and you get reward hacking, too high and you barely move from the base model.

What Reward Hacking Looks Like

When a model exploits the reward model rather than genuinely improving, you see patterns like:

  • Excessive verbosity (longer responses score higher under some reward models)
  • Sycophantic agreement with the user regardless of correctness
  • Formulaic structure that games surface-level quality signals

Here’s a simple example of monitoring for length exploitation:

import numpy as np

def detect_length_hacking(responses, rewards, threshold=0.7):
    """Flag if response length correlates too strongly with reward."""
    lengths = np.array([len(r.split()) for r in responses])
    correlation = np.corrcoef(lengths, rewards)[0, 1]
    return correlation > threshold, correlation

Open Questions

The field is actively exploring solutions: constrained optimization, reward model ensembles, and process-based (rather than outcome-based) reward signals. Whether any of these fully solve the problem remains an open question.