resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Source record

back to sources

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

id: vs_8f73b3eac7b38303
frontier: AI alignment evaluations
year: 2023
type: paper

source boundary

frontier-owned

declared

A source record is provenance. It supports a finding only through evidence atoms, extraction spans, and reviewed finding bundles.

finding bindings

record context

1 findings

Findings bound to this source through source ids, evidence atoms, provenance, or reviewed source-record slots.

evidence atoms

materialized

1 atoms

Evidence atoms pin exact spans, measurements, selectors, or curation assertions to the source.

review context

inspectable

1 events

1 reviewable changes and 0 evaluations are attached through this source or its findings.

Locator and citation

external source

locator

title:Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

imported

2026-05-29T02:53:36.923652+00:00

extraction mode

manual_curation

authors

reviewer:will-blair

Caveats

No source-specific caveats are recorded.

Bound findings

Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.
observational · vf_cfa59d594a8d0a81

Evidence atoms

vea_2ce720351ad43905theoretical · unknown
Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.

Review, event, and evaluation records

events

vev_589eaf4db7683caafinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_8290e375a8ed6612finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation rows are attached.

Search Canopus

declared

1 findings

1 atoms

1 events

Bound findings

Evidence atoms

Review, event, and evaluation records

declared

1 findings

1 atoms

1 events

Bound findings

Evidence atoms

Review, event, and evaluation records