resultsread-only vieweressays

frontiers / frontier

Open-transformer circuit evidence

CC-BY-4.0vfr_c94f4f79b4e341ad

id: vfr_c94f4f79b4e341ad
license: CC-BY-4.0
findings: 42
accepted core: 14
contested: 0

findings

links

sources

evidence

contested

0.73

avg conf

frontiers / frontier

Open-transformer circuit evidence

CC-BY-4.0vfr_c94f4f79b4e341ad

id: vfr_c94f4f79b4e341ad
license: CC-BY-4.0
findings: 42
accepted core: 14
contested: 0

findings

links

sources

evidence

contested

0.73

avg conf

Finding bundle

back to state

Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.

no incoming links yet

id: vf_9e8edcb419fd0229
frontier: Open-transformer circuit evidence
version: 1
confidence: 0.65

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

theoretical

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Preliminary results on single reward model architecture (Llama-7B); generalization to state-of-the-art RLHF systems and ensemble effects untested; requires causality validation via targeted intervention.

Provenance

source title

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

authors

reviewer:will-blair

Source records

source record

declared

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

vs_7e9999ec54d14123

title:Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

manual_curation

inspect source →

Evidence atoms

vea_b3c422643b5d7bd5theoretical · unknown
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
vs_7e9999ec54d14123 · manual_curation

Typed links

outgoing

dependsvf_efe9ddeab6b12e54
Circuit-Aware Reward Training claim (index 7) depends on scaling Contextual Decomposition to attribution of sub-circuit interactions (index 10); causality validation cannot proceed without sub-circuit level precision.

incoming

No incoming links.

Review, event, and evaluation records

events

vev_18565102b346f9f6finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_432ec2d717f65e1bfinding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.

no incoming links yet

id: vf_9e8edcb419fd0229
frontier: Open-transformer circuit evidence
version: 1
confidence: 0.65

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

theoretical

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Preliminary results on single reward model architecture (Llama-7B); generalization to state-of-the-art RLHF systems and ensemble effects untested; requires causality validation via targeted intervention.

Provenance

source title

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

authors

reviewer:will-blair

Source records

source record

declared

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

vs_7e9999ec54d14123

title:Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

manual_curation

inspect source →

Evidence atoms

vea_b3c422643b5d7bd5theoretical · unknown
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
vs_7e9999ec54d14123 · manual_curation

Typed links

outgoing

dependsvf_efe9ddeab6b12e54
Circuit-Aware Reward Training claim (index 7) depends on scaling Contextual Decomposition to attribution of sub-circuit interactions (index 10); causality validation cannot proceed without sub-circuit level precision.

incoming

No incoming links.

Review, event, and evaluation records

events

vev_18565102b346f9f6finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_432ec2d717f65e1bfinding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

theoretical

1 atoms

1 events

Source records

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

theoretical

1 atoms

1 events

Source records

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)

Evidence atoms

Typed links

Review, event, and evaluation records