record state
frontier-ownedReview status
This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
frontiers / frontier
Finding bundle
back to stateno incoming links yet
record state
frontier-ownedThis finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
finding statement
finding typeNo entity list is declared.
evidence
source-boundtheoretical · manual state transition
proof impact
packet context1 reviewable changes and 0 evaluation records are attached to this finding id.
Evidence and conditions
method
manual state transition
evidence type
theoretical
conditions
Provenance
source title
Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)
authors
reviewer:will-blair
Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.
vs_8f73b3eac7b38303 · manual_curation
outgoing
vf_c949999dbbad515bRLHF reward hacking is mechanism for specification gaming; both show optimization pressure drives misalignment
incoming
No incoming links.
events
vev_589eaf4db7683caafinding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_8290e375a8ed6612finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation record targets this finding id.