Cheap Reward Hacking Detection

Ivan Belenky · Joaquin Itria · Steven Johns — Tamarillo

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on that embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW sanitized LLM-as-judge AUC (0.9510 on the cleaned split) and exceeding its TPR@5%FPR (0.7130) by +0.12, at roughly four orders of magnitude lower per-trajectory cost.


The problem

Reward hacking (RH) is a pervasive failure mode in both training and deployment. In training, a misspecified reward channel rewards the wrong behavior and the agent learns to exploit it. In deployment, a task description under-determines the user's intent and the cheapest plan that satisfies the literal specification is one the user would not have accepted. The underlying defect in both cases is a gap between what was specified and what was meant.

The benchmark Terminal-Wrench ships an LLM-as-judge baseline that reads each trajectory and votes on whether it was a hack. The judge works well and is the right baseline to compete against, but two things are concerning for broader applicability:

  • Cost. Running a frontier model over every trajectory is not scalable for deployment as a sidecar to agent runtimes, RL training loops, or online monitoring. A single GPT-5.4 judge call over a TW trajectory (~$10^4$ tokens) costs ~$0.01. One forward pass of the 14M-parameter encoder on an M2 Max dissipates ~4 J, roughly $0.000001. The ratio sits in the $10^3$ to $10^5$ range depending on judge model and hardware normalization.
  • Behavior detection. An LLM-as-judge reads the agent's chain of thought, which is exactly what a hypothetical adversary can delete or mutate. TW's own numbers make this explicit: the judge's TPR@5%FPR drops from 0.82 on the visible-reasoning condition to 0.44 on the stripped condition. If reward hacking is fundamentally about the agent's behavior (which actions were taken, in which order, with which side effects), the results suggest the judge is paying attention to the wrong parts.

Terminal-Wrench trajectory variants

Each task in the TW dataset is paired with up to four trajectory groups. The descriptions below are taken from the TW README so the meaning is preserved as-is.

FolderDescription
hackHack elicitation visible in the agent's prompt. Excluded from all training and eval.
sanitizedDerived from hack by rewriting agent messages to remove explicit hack references while preserving technical actions and observations.
strippedAgent messages blanked; only tool calls, commands, and observations remain. The closest the release gets to a pure action-surface view.
non_hackAgent was told to hack and refused: either a legitimate solve or a failed attempt.
baselineAgent was never given the hack elicitation; solved the task legitimately.

The hack variant is excluded because the elicitation prompt sits in the agent's first message, and a detector trained on those rows would trivially match on the prompt instead of learning anything about behavior. After that exclusion, the comparable row is sanitized.

The reward signal and objective

The benchmark exposes three boolean flags per trajectory: is_rewarded, is_hack, and told_to_hack. These are collapsed into a 3-bit vector $R(\tau) = (\text{is\_rewarded}, \text{is\_hack}, \text{told\_to\_hack}) \in \{0,1\}^3$ that serves as the supervision target.

An encoder $f$ is trained to map a trajectory $\tau$ to a unit vector $f(\tau) \in \mathbb{R}^d$, with the embedding shaped so that pairwise distances track the $L_1$ distance between reward signals:

$ \mathcal{L} = \frac{1}{\binom{N}{2}} \sum_{i < j} \left( \lVert f(\tau_i) - f(\tau_j) \rVert_2 - \frac{2 \cdot \lVert R_i - R_j \rVert_1}{r_{\textrm{dim}}} \right)^2 $

The target is normalized to $[0, 2]$, the range of $L_2$ distance between unit vectors. The $L_1$ distance of the bit-vector $R$ is divided by $r_{\textrm{dim}}$ so the loss is invariant to the dimensionality of the reward signal. The squared outer difference pushes very-different pairs toward antipodal vectors and very-similar pairs toward coincidence.

Architecture

The encoder is a standard pre-norm transformer with the following hyperparameters. The architecture is deliberately vanilla. The contribution is the loss and the data pipeline, not the model.

HyperparameterValue
Layers6
$d_{\text{model}}$384
Attention heads6 (x 64)
Feed-forward expansion4x
Positional encodingSinusoidal
PoolingMean-pool, linear, L2-norm
Embedding dim128
TokenizerBPE, vocab 8192
Chunk size / max tokens512 / 4096
Total parameters13,842,560

Data cleaning

TW's sanitization is a paraphrase, not a deletion: an LLM rewrites agent messages to remove explicit hack references, but some self-references survive. A detect-and-drop pipeline is applied on top: regex shortlist for verbatim phrases plus an LLM judge that flags only meta-statements about being assigned an adversarial role, while ignoring descriptions of hacking actions. The pipeline is tuned to minimize false negatives (contaminated trajectories surviving into training) at the cost of false positives.

The cleaned dataset:

VariantRows
sanitized2,386
non_hack2,376
baseline2,352
total7,114
train / val / test5,689 / 735 / 690

The split is by task_id: no task appears in more than one split, preventing leakage where the same underlying task in two variants ends up on both sides of train and test. Training ran for 8,182 steps over a 15-hour wall-clock budget.


Results

Training dynamics

The bottom-line training-time signal is the separation between two pairwise-distance statistics: the mean embedding distance between hack pairs ("same-class"), and the mean distance between a hack and a clean trajectory ("cross-class"). If the loss is doing its job, the second grows relative to the first.

Training dynamics: pairwise embedding distances and separation gap over 8,182 training steps. Scatter: raw per-step. Solid line: EMA.

The same-$R$ distance saturates at ~0.17 (EMA), the cross-$R$ distance at ~1.06, and the separation gap at ~0.89. Most of the gap is present by step ~1,000. The cross-$R$ saturation well below 2.0 (antipodal) is expected: given the four populated $R$-vectors, only two distinct non-zero target distances exist ($2/3$ and $4/3$), and no pair of $R$-vectors disagrees on all 3 bits.

Linear probe and ROC

A logistic regression probe on the 128-d trajectory embedding is the reported detector. The probe is fit on the train-split embeddings with label $y = \textrm{is\_hack}$ and evaluated on the 690-trajectory test split. Positives are sanitized trajectories (agents that hacked), negatives are non_hack and baseline trajectories (agents that refused or were never told).

The probe scores AUC 0.9467 and TPR@5%FPR 0.8296.

ROC curve of the logistic-regression probe on the 690-trajectory test split. The 5%-FPR operating point is marked.

Embedding structure

The UMAP projection below shows the 690 test-split embeddings in 2-D. On the left, trajectories are colored by ground-truth is_hack label. On the right, the same points are colored by the probe's confidence $P(\text{hack})$. The clean separation in the left panel and the smooth gradient in the right confirm that the encoder has learned a geometry where hack and clean trajectories occupy distinct, contiguous regions of the embedding space, not a scattered collection of decision-boundary artifacts.

UMAP projection: test-split embeddings colored by is_hack (left) and probe confidence (right).

Comparison to Terminal-Wrench

The encoder reads the same content a TW sanitized judge reads: task description, full chain of thought, shell commands, and their outputs. The only thing absent is the red-team elicitation prompt itself. Against TW's headline monitoring table (aggregated across claude-opus-4.6, gemini-3.1-pro, and gpt-5.4, with 3,632 positives and 2,352 negatives):

ConditionAUCTPR@5%FPR
TW judges, hack0.96790.8235
TW judges, sanitized0.94740.6187
TW judges, stripped0.91680.4400
LogReg probe (this work)0.94670.8296

Against the matched sanitized row the probe is essentially tied on AUC (-0.001) and substantially ahead on TPR@5%FPR (+0.21), at several orders of magnitude lower per-trajectory cost.


Where the hack signal lives

The probe defines a linear "hack direction" $w$ in embedding space. To attribute the score back to individual actions, an occlusion saliency is computed per action:

$$\Delta_i = (z(\tau) \cdot w) - (z(\tau_{-i}) \cdot w)$$

where $\tau_{-i}$ is $\tau$ with action $i$ deleted. A positive $\Delta_i$ means removing action $i$ reduces the hack score, i.e., that step pushed toward hack. This quantity is computed for every action of every trajectory in the held-out test sample and aggregated by normalized position.

Saliency analysis: |delta|-weighted density by absolute action index (left, log scale) and mean hack-pushing saliency vs normalized position (right).

Roughly 54% of the total $|\Delta|$ mass in hack trajectories concentrates in the first 20% of the trajectory's length. The right panel makes the asymmetry visible: hack trajectories carry meaningful push-to-hack saliency through the early third of their length and decay into the middle. Clean trajectories sit around zero throughout. The recurring pattern in the actual texts: reconnaissance first (inspect the verifier, look at the test harness), then the exploit.


Stripped-trained encoder

The probe-time stripping ablation above strips reasoning from a model that was trained on reasoning-rich inputs. That tells us this model depends on reasoning, not whether the action surface carries signal in principle. To answer the second question, a separate encoder was trained from scratch on stripped trajectories: all agent-generated observation text removed, retaining only the harness prompt (task description), tool calls, and terminal output. Architecture and loss identical, 5,501 steps on the same train split.

ConditionAUCTPR@5%FPR
Full encoder (reasoning visible)0.94670.8296
Full encoder, stripped at probe time0.62130.0852
Stripped-trained encoder0.76310.2275

The stripped-trained encoder lands at AUC 0.7631. This is above chance and above the probe-time stripping (0.6213), confirming that the action surface does carry detection signal that a dedicated model can learn. The signal is weak: TPR@5%FPR 0.2275 means the stripped encoder catches roughly one in four hacks at a 5% false-positive rate, compared to four in five for the full encoder.

Both encoders were evaluated on the same 500-trajectory test subset. The scatter plot below shows per-trajectory agreement:

Per-trajectory scores: full encoder (x) vs stripped encoder (y). Left: hack trajectories. Right: clean trajectories.

Of 167 hack trajectories, the full encoder detects 143 (85.6%), the stripped encoder 74 (44.3%), and their union 150 (89.8%). Only 7 trajectories (4.2%) were detected by the stripped encoder alone, all borderline (scores 0.50-0.61). A mixture model would gain ~4% true-positive coverage at the cost of ~13% additional false positives. The action-surface signal is almost entirely a subset of what the reasoning-aware encoder already captures.


Matched judge comparison

TW's judge runner is not public. The experiments below use the released prompt (monitor_judge_v1.txt) verbatim against the cleaned test split with gpt-5.4 (reasoning_effort: high). The evaluation set is 442 trajectories (223 sanitized positives, 219 baseline negatives), excluding non_hack from the negative set because agents that were told to hack but refused carry visible adversarial context that the judge correctly flags.

ConditionAUCTPR@5%FPR
TW sanitized (aggregate, 3 models)0.94740.6187
TW sanitized (gpt-5.4 only)0.9720--
Replication (gpt-5.4)0.95100.7130
LogReg probe (this work)0.94670.8296

The replication lands within 0.004 of TW's aggregate AUC, confirming the released prompt reproduces the published numbers. The probe matches the judge on AUC and exceeds it on TPR@5%FPR by +0.12.

Saliency-guided trace removal

If the encoder and the judge are reading the same signal, removing the actions the encoder considers most hack-indicative should degrade the judge's AUC. A parallel sweep with random-action removal serves as the control.

kJudge (saliency)Judge (random)Probe (saliency)Probe (random)
00.95100.95100.96500.9650
10.93850.94340.91580.9624
30.93330.95000.91660.9552
50.92790.92790.92040.9339

Both detectors degrade under saliency-guided removal; random removal has no effect. The probe reacts more sharply: removing the single most salient action drops the judge's AUC by 0.013 while the probe drops by 0.049. The two detectors converge on the same behavioral surface despite entirely different inductive biases: a 14M-parameter encoder trained on pairwise reward-signal distances versus a frontier LLM reading a suspicion rubric.


Conclusion

A 14M-parameter encoder matches a frontier LLM judge on AUC and exceeds it on TPR@5%FPR at roughly four orders of magnitude lower cost per trajectory. The saliency-guided ablation confirms both detectors read the same behavioral surface. The stripped-training experiment settles the open question: the action surface carries real but weak detection signal (AUC 0.76), and reasoning remains the primary discriminator. Mixing the two views does not help: the action-surface signal is almost entirely a subset of what the reasoning-aware encoder already captures.

Three limitations bound these results:

  1. The encoder is fragile to single-token adversarial perturbations. A greedy coordinate descent in token space can flip the probe from $P(\text{hack}) = 0.003$ to $0.877$ with a single token swap.
  2. It relies on reasoning text that a determined adversary could rewrite. An adversarial paraphrase test (rewriting reasoning with neutral language, preserving commands) showed a 23% evasion rate.
  3. The dataset contains trajectories from only three agent models. Generalization to unseen architectures is untested.