record state
frontier-ownedReview status
This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
frontiers / frontier
Finding bundle
back to stateno incoming links yet
record state
frontier-ownedThis finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
finding statement
finding typeNo entity list is declared.
evidence
source-boundtheoretical · manual state transition
proof impact
packet context1 reviewable changes and 0 evaluation records are attached to this finding id.
Evidence and conditions
method
manual state transition
evidence type
theoretical
conditions
Provenance
source title
AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)
authors
reviewer:will-blair
AI safety via debate—where two models argue opposing positions and a human judge determines truthfulness—assumes honest argumentation is detectably different from skilled deception, an assumption that fails for sufficiently deceptive models.
vs_0f0d72c6be022a1e · manual_curation
outgoing
vf_491436508804de41Debate assumes honest/deceptive distinction is detectable; expert persuaders blur this boundary
incoming
No incoming links.
events
vev_07254bfa148bc3e4finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_9cf2f4d48faaedd0finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation record targets this finding id.