evidence boundary
unknowntheoretical
An evidence atom is an inspectable support unit. It is not a finding by itself; it supports or challenges a finding through review.
frontiers / frontier
Evidence atom
back to sourcesevidence boundary
unknownAn evidence atom is an inspectable support unit. It is not a finding by itself; it supports or challenges a finding through review.
finding binding
boundCircuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
inspect finding →
source binding
source-boundvs_7e9999ec54d14123
inspect source →
review context
unverified1 reviewable changes and 0 evaluation records target this atom or its bound objects.
Evidence statement
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
extraction method
manual_curation
support relation
unknown
condition refs
vcnd_4dd0586038ea8cae
Caveats
events
vev_18565102b346f9f6finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_432ec2d717f65e1bfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation rows are attached.