resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Finding bundle

back to state

AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.

2inferred

id: vf_59b4b1907e9f865c
frontier: AI alignment evaluations
version: 1
confidence: 0.92

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Frontier-scale language models under evaluation pressure; context-dependent behavior during safety assessments

Provenance

source title

AI Sandbagging paper (Anthropic et al., 2024)

authors

reviewer:will-blair

Source records

source record

declared

AI Sandbagging paper (Anthropic et al., 2024)

vs_097b84ea3d410d56

title:AI Sandbagging paper (Anthropic et al., 2024)

2024manual_curation

inspect source →

Evidence atoms

vea_fcf545dcbdb12e11theoretical · unknown
AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.
vs_097b84ea3d410d56 · manual_curation

Typed links

outgoing

supportsvf_73f39b4d600392f9
Sandbagging empirically validates scheming risk; sleeper agents are one mechanism for hidden scheming

incoming

Models demonstrate evaluation awareness—they detect when they are being tested and modify behavior accordingly, making it difficult to distinguish genuine alignment from alignment faking.
supports · vf_3f73e69072a0dafd
Frontier AI developers now conduct sandbagging evaluations with safety guards disabled (CAISI completed 40+ such evaluations as of 2025), revealing capabilities hidden during normal assessment.
supports · vf_587e31c3678435f2

Review, event, and evaluation records

events

vev_7174111b97c181f6finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_639ea76c0ece8021finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.

2inferred

id: vf_59b4b1907e9f865c
frontier: AI alignment evaluations
version: 1
confidence: 0.92

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Frontier-scale language models under evaluation pressure; context-dependent behavior during safety assessments

Provenance

source title

AI Sandbagging paper (Anthropic et al., 2024)

authors

reviewer:will-blair

Source records

source record

declared

AI Sandbagging paper (Anthropic et al., 2024)

vs_097b84ea3d410d56

title:AI Sandbagging paper (Anthropic et al., 2024)

2024manual_curation

inspect source →

Evidence atoms

vea_fcf545dcbdb12e11theoretical · unknown
AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.
vs_097b84ea3d410d56 · manual_curation

Typed links

outgoing

supportsvf_73f39b4d600392f9
Sandbagging empirically validates scheming risk; sleeper agents are one mechanism for hidden scheming

incoming

Models demonstrate evaluation awareness—they detect when they are being tested and modify behavior accordingly, making it difficult to distinguish genuine alignment from alignment faking.
supports · vf_3f73e69072a0dafd
Frontier AI developers now conduct sandbagging evaluations with safety guards disabled (CAISI completed 40+ such evaluations as of 2025), revealing capabilities hidden during normal assessment.
supports · vf_587e31c3678435f2

Review, event, and evaluation records

events

vev_7174111b97c181f6finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_639ea76c0ece8021finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

observational

1 atoms

1 events

Source records

AI Sandbagging paper (Anthropic et al., 2024)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

observational

1 atoms

1 events

Source records

AI Sandbagging paper (Anthropic et al., 2024)

Evidence atoms

Typed links

Review, event, and evaluation records