resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Finding bundle

back to state

AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.

no incoming links yet

id: vf_7bff72eaad13e7e2
frontier: AI alignment evaluations
version: 1
confidence: 0.76

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

theoretical

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Breaks when judges cannot distinguish skilled deception from truth; model capability advantage over judge creates imbalance

Provenance

source title

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

authors

reviewer:will-blair

Source records

source record

declared

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

vs_0f0d72c6be022a1e

title:AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

2018manual_curation

inspect source →

Evidence atoms

vea_cec2ebc6ec629526theoretical · unknown
AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.
vs_0f0d72c6be022a1e · manual_curation

Typed links

outgoing

contradictsvf_491436508804de41
Debate assumes honest/deceptive distinction is detectable; expert persuaders blur this boundary

incoming

No incoming links.

Review, event, and evaluation records

events

vev_07254bfa148bc3e4finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_9cf2f4d48faaedd0finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.

no incoming links yet

id: vf_7bff72eaad13e7e2
frontier: AI alignment evaluations
version: 1
confidence: 0.76

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

theoretical

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Breaks when judges cannot distinguish skilled deception from truth; model capability advantage over judge creates imbalance

Provenance

source title

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

authors

reviewer:will-blair

Source records

source record

declared

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

vs_0f0d72c6be022a1e

title:AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

2018manual_curation

inspect source →

Evidence atoms

vea_cec2ebc6ec629526theoretical · unknown
AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.
vs_0f0d72c6be022a1e · manual_curation

Typed links

outgoing

contradictsvf_491436508804de41
Debate assumes honest/deceptive distinction is detectable; expert persuaders blur this boundary

incoming

No incoming links.

Review, event, and evaluation records

events

vev_07254bfa148bc3e4finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_9cf2f4d48faaedd0finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

theoretical

1 atoms

1 events

Source records

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

theoretical

1 atoms

1 events

Source records

AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)

Evidence atoms

Typed links

Review, event, and evaluation records