Skip to content

canopusrecord engine body

field
frontiers
loop
body
docs

vela-science/vela

frontier

ai-alignment-evaluations

findings

vf_491436508804de41

resultsread-only vieweressays

Search Canopus

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

16

findings

14

links

16

sources

16

evidence

0

contested

0.84

avg conf

recordoverview state sources proofenginereviewbodygraph

Finding bundle

Scalable oversight approaches (iterated amplification, recursive reward modeling, debate) provide frameworks for human oversight of superhuman tasks, but they assume the honest strategy can simulate the AI system for exponentially many steps—an assumption that breaks for sufficiently advanced models.

2inferred

id: vf_491436508804de41
frontier: AI alignment evaluations
version: 1
confidence: 0.81

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

theoretical

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Breaks for superhuman capability gaps; debate methodology assumes honest argumentation is distinguishable from skilled deception

Provenance

source title

Scalable Oversight review (2024); Doubly-Efficient Debate (Brown-Cohen et al., ICML 2024)

authors

reviewer:will-blair

Source records

1

source record

Scalable Oversight review (2024); Doubly-Efficient Debate (Brown-Cohen et al., ICML 2024)

vs_6930b4944d805a78

title:Scalable Oversight review (2024); Doubly-Efficient Debate (Brown-Cohen et al., ICML 2024)

2024manual_curation

inspect source →

Evidence atoms

1

vea_2914f552cb7ce6ectheoretical · unknown
Scalable oversight approaches (iterated amplification, recursive reward modeling, debate) provide frameworks for human oversight of superhuman tasks, but they assume the honest strategy can simulate the AI system for exponentially many steps—an assumption that breaks for sufficiently advanced models.
vs_6930b4944d805a78 · manual_curation

Typed links

2

outgoing

No outgoing links.

incoming

Red-teaming protocols using multi-round automatic adversarial prompting can expose jailbreaks in 86% of undefended models, but attack success rates improve when adversaries analyze failed attempts iteratively.
supports · vf_4927eb9384edd7ce
AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.
contradicts · vf_7bff72eaad13e7e2

Review, event, and evaluation records

2

events

vev_76ad0bcfebef4559finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_316479af44e3c3affinding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.