resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Finding bundle

back to state

Behavioral safety evaluations (refusal-based testing on harmful content categories) show strong surface-level safety but do not assess deeper deception, sandbagging, or scheming capabilities.

2inferred

id: vf_3ea1bb869e1c5f9b
frontier: AI alignment evaluations
version: 1
confidence: 0.87

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

diagnostic

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Effective at reducing direct user-facing harms; insufficient for detecting strategic misalignment

Provenance

source title

Frontier AI Safety Frameworks review (multiple labs, 2024)

authors

reviewer:will-blair

Source records

source record

declared

Frontier AI Safety Frameworks review (multiple labs, 2024)

vs_0ed2b819f71baff6

title:Frontier AI Safety Frameworks review (multiple labs, 2024)

2024manual_curation

inspect source →

Evidence atoms

vea_d4443243040bc5d5theoretical · unknown
Behavioral safety evaluations (refusal-based testing on harmful content categories) show strong surface-level safety but do not assess deeper deception, sandbagging, or scheming capabilities.
vs_0ed2b819f71baff6 · manual_curation

Typed links

outgoing

No outgoing links.

incoming

Benchmark data contamination affects 16-91% of test sets across major LLMs, with models achieving high benchmark scores while failing 72% of real-world task executions.
contradicts · vf_201b5c921b23410b
Specification gaming—optimizing the literal objective rather than the intended outcome—generalizes from simple laboratory tasks to increasingly sophisticated real-world exploits.
supports · vf_c949999dbbad515b

Review, event, and evaluation records

events

vev_f495bad59887515cfinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_162253ea3f0f4c69finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

Behavioral safety evaluations (refusal-based testing on harmful content categories) show strong surface-level safety but do not assess deeper deception, sandbagging, or scheming capabilities.

2inferred

id: vf_3ea1bb869e1c5f9b
frontier: AI alignment evaluations
version: 1
confidence: 0.87

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

diagnostic

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Effective at reducing direct user-facing harms; insufficient for detecting strategic misalignment

Provenance

source title

Frontier AI Safety Frameworks review (multiple labs, 2024)

authors

reviewer:will-blair

Source records

source record

declared

Frontier AI Safety Frameworks review (multiple labs, 2024)

vs_0ed2b819f71baff6

title:Frontier AI Safety Frameworks review (multiple labs, 2024)

2024manual_curation

inspect source →

Evidence atoms

vea_d4443243040bc5d5theoretical · unknown
Behavioral safety evaluations (refusal-based testing on harmful content categories) show strong surface-level safety but do not assess deeper deception, sandbagging, or scheming capabilities.
vs_0ed2b819f71baff6 · manual_curation

Typed links

outgoing

No outgoing links.

incoming

Benchmark data contamination affects 16-91% of test sets across major LLMs, with models achieving high benchmark scores while failing 72% of real-world task executions.
contradicts · vf_201b5c921b23410b
Specification gaming—optimizing the literal objective rather than the intended outcome—generalizes from simple laboratory tasks to increasingly sophisticated real-world exploits.
supports · vf_c949999dbbad515b

Review, event, and evaluation records

events

vev_f495bad59887515cfinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_162253ea3f0f4c69finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

diagnostic

1 atoms

1 events

Source records

Frontier AI Safety Frameworks review (multiple labs, 2024)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

diagnostic

1 atoms

1 events

Source records

Frontier AI Safety Frameworks review (multiple labs, 2024)

Evidence atoms

Typed links

Review, event, and evaluation records