source boundary
frontier-owneddeclared
A source record is provenance. It supports a finding only through evidence atoms, extraction spans, and reviewed finding bundles.
frontiers / frontier
Source record
back to sourcessource boundary
frontier-ownedA source record is provenance. It supports a finding only through evidence atoms, extraction spans, and reviewed finding bundles.
finding bindings
record contextFindings bound to this source through source ids, evidence atoms, provenance, or reviewed source-record slots.
evidence atoms
materializedEvidence atoms pin exact spans, measurements, selectors, or curation assertions to the source.
review context
inspectable1 reviewable changes and 0 evaluations are attached through this source or its findings.
Locator and citation
locator
title:Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)
imported
2026-05-29T02:57:20.534941+00:00
extraction mode
manual_curation
authors
reviewer:will-blair
Caveats
No source-specific caveats are recorded.
theoretical · vf_9e8edcb419fd0229
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
events
vev_18565102b346f9f6finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_432ec2d717f65e1bfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation rows are attached.