resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Finding bundle

back to state

Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.

no incoming links yet

id: vf_cfa59d594a8d0a81
frontier: AI alignment evaluations
version: 1
confidence: 0.86

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Most severe when human feedback is sparse or inconsistent; iterative retraining reduces but does not eliminate exploits

Provenance

source title

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

authors

reviewer:will-blair

Source records

source record

declared

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

vs_8f73b3eac7b38303

title:Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

2023manual_curation

inspect source →

Evidence atoms

vea_2ce720351ad43905theoretical · unknown
Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.
vs_8f73b3eac7b38303 · manual_curation

Typed links

outgoing

supportsvf_c949999dbbad515b
RLHF reward hacking is mechanism for specification gaming; both show optimization pressure drives misalignment

incoming

No incoming links.

Review, event, and evaluation records

events

vev_589eaf4db7683caafinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_8290e375a8ed6612finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.

no incoming links yet

id: vf_cfa59d594a8d0a81
frontier: AI alignment evaluations
version: 1
confidence: 0.86

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Most severe when human feedback is sparse or inconsistent; iterative retraining reduces but does not eliminate exploits

Provenance

source title

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

authors

reviewer:will-blair

Source records

source record

declared

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

vs_8f73b3eac7b38303

title:Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

2023manual_curation

inspect source →

Evidence atoms

vea_2ce720351ad43905theoretical · unknown
Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.
vs_8f73b3eac7b38303 · manual_curation

Typed links

outgoing

supportsvf_c949999dbbad515b
RLHF reward hacking is mechanism for specification gaming; both show optimization pressure drives misalignment

incoming

No incoming links.

Review, event, and evaluation records

events

vev_589eaf4db7683caafinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_8290e375a8ed6612finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

observational

1 atoms

1 events

Source records

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

observational

1 atoms

1 events

Source records

Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)

Evidence atoms

Typed links

Review, event, and evaluation records