record state
frontier-ownedReview status
This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
frontiers / frontier
Finding bundle
back to stateno incoming links yet
record state
frontier-ownedThis finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
finding statement
finding typeNo entity list is declared.
evidence
source-boundtheoretical · manual state transition
proof impact
packet context1 reviewable changes and 0 evaluation records are attached to this finding id.
Evidence and conditions
method
manual state transition
evidence type
theoretical
conditions
Provenance
source title
Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)
authors
reviewer:will-blair
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
vs_7e9999ec54d14123 · manual_curation
outgoing
vf_efe9ddeab6b12e54Circuit-Aware Reward Training claim (index 7) depends on scaling Contextual Decomposition to attribution of sub-circuit interactions (index 10); causality validation cannot proceed without sub-circuit level precision.
incoming
No incoming links.
events
vev_18565102b346f9f6finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_432ec2d717f65e1bfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation record targets this finding id.